Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

esensar
Copy link
Contributor

@esensar esensar commented Dec 21, 2024

Summary

This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Ran vector with internal metrics and observer that utilization was updated every ~5 secs, instead of only when events are running.

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

…e it is regularly published

This adds a separate task that runs periodically to emit utilization metrics and collect messages
from components that need their utilization metrics calculated. This ensures that utilization metric
is published even when no events are running through a component.

Fixes: vectordotdev#20216
@github-actions github-actions bot added the domain: topology Anything related to Vector's topology code label Dec 21, 2024
@esensar
Copy link
Contributor Author

esensar commented Dec 21, 2024

I have left this as a draft, since I am not sure how to handle shutdown (which shutdown signal to use) and how to name the task (or maybe run it in a completely different way, to not mix it up with components).

Also, gauge is passed into the timer instead of using the macro inside the timer to ensure that correct labels are inherited from the tracing context.

@pront pront self-assigned this Jan 2, 2025
@esensar
Copy link
Contributor Author

esensar commented Jan 9, 2025

@pront
Any suggestion for running this separate task? It is currently started as following:

running_topology.utilization_task =
    // TODO: how to name this custom task?
    Some(tokio::spawn(Task::new("".into(), "", async move {
        utilization_emitter
            .run_utilization(ShutdownSignal::noop())
            .await;
        // TODO: new task output type for this? Or handle this task in a completely
        // different way
        Ok(TaskOutput::Healthcheck)
    })));

I am not sure how to pass the shutdown signal to it (and if I should do it at all, it made sense to me, but I might have misunderstood some part of the topology). Also, I currently create a task with empty name, but maybe it would make more sense to run it in a different way compared to other tasks?

@pront
Copy link
Member

pront commented Jan 9, 2025

Hi @esensar,

This is a complex so I checked out this PR to do some testing;

config:

api:
  enabled: true

sources:
  internal_metrics_1:
    type: internal_metrics

transforms:
  filter_utilization:
    type: filter
    inputs: ["internal_metrics_1"]
    condition: .name == "utilization"

sinks:
  console:
    inputs: ["filter_utilization"]
    type: console
    encoding:
      codec: json
      json:
        pretty: true

Sample output:

/Users/pavlos.rontidis/.cargo/bin/cargo run --color=always --profile dev -- --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.70s
     Running `target/debug/vector --config /Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml`
2025-01-09T20:46:27.736727Z  INFO vector::app: Log level is enabled. level="info"
2025-01-09T20:46:27.741218Z  INFO vector::app: Loading configs. paths=["/Users/pavlos.rontidis/CLionProjects/vector/pront/configs/internal_metrics.yaml"]
2025-01-09T20:46:27.766384Z  INFO vector::topology::running: Running healthchecks.
2025-01-09T20:46:27.767489Z  INFO vector::topology::builder: Healthcheck passed.
2025-01-09T20:46:27.769222Z  INFO vector: Vector has started. debug="true" version="0.44.0" arch="aarch64" revision=""
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:27.770905Z",
  "kind": "absolute",
  "gauge": {
    "value": 1.0
  }
}
2025-01-09T20:46:27.777873Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=http://127.0.0.1:8686/playground graphql=http://127.0.0.1:8686/graphql
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.010011816446046937
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:37.771882Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.01004418815411736
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "filter_utilization",
    "component_kind": "transform",
    "component_type": "filter",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.0001184493997704478
  }
}
{
  "name": "utilization",
  "namespace": "vector",
  "tags": {
    "component_id": "console",
    "component_kind": "sink",
    "component_type": "console",
    "host": "COMP-LPF0JYPP2Q"
  },
  "timestamp": "2025-01-09T20:46:47.771505Z",
  "kind": "absolute",
  "gauge": {
    "value": 0.00010693227629135064
  }
}
...

Leaving this here as context. Will followup with more questions.

src/topology/builder.rs Outdated Show resolved Hide resolved
src/topology/builder.rs Outdated Show resolved Hide resolved
@@ -1053,6 +1055,17 @@ impl RunningTopology {
running_topology.connect_diff(&diff, &mut pieces).await;
running_topology.spawn_diff(&diff, pieces);

running_topology.utilization_task =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Still trying to parse the details here)
Do we join this handle at any point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do. I forgot to add that it seems. Should it be joined in stop? I can see that other tasks are joined there.

Copy link
Member

@pront pront Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds right.

@@ -1053,6 +1055,17 @@ impl RunningTopology {
running_topology.connect_diff(&diff, &mut pieces).await;
running_topology.spawn_diff(&diff, pieces);

running_topology.utilization_task =
// TODO: how to name this custom task?
Some(tokio::spawn(Task::new("".into(), "", async move {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible name utilization_heartbeat. But here I have a more basic question, do we expect this to repeat every 5 seconds (the hardcoded value)? In my test it seems like it was way more frequent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is expected every 5 seconds. Not sure what went wrong there, in my testing it was repeated every 5 seconds (even though utilization was printed from the sink every second, it was only updated by this component every 5).

@pront
Copy link
Member

pront commented Jan 9, 2025

cc @lukesteensen (just in case you are interested in this one)

@esensar esensar requested a review from pront January 15, 2025 17:14
@pront pront marked this pull request as ready for review January 15, 2025 19:07
@pront pront requested a review from a team as a code owner January 15, 2025 19:07
@esensar esensar changed the title fix(utilization_metric): run a separate task for utilization to ensure it is regularly published fix(metrics): run a separate task for utilization metric to ensure it is regularly published Jan 17, 2025
@esensar esensar changed the title fix(metrics): run a separate task for utilization metric to ensure it is regularly published fix(metrics): run a separate task for utilization metric to ensure it is regularly updated Jan 17, 2025
@esensar
Copy link
Contributor Author

esensar commented Jan 20, 2025

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

@pront
Copy link
Member

pront commented Jan 24, 2025

I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/

I didn't have time to take a look at this yet. But I wouldn't be surprised if the validation framework also needs changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: topology Anything related to Vector's topology code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus stats "stuck" on last value seen for transforms using aggregations (vector_utilization)
2 participants