-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metrics): run a separate task for utilization metric to ensure it is regularly updated #22070
base: master
Are you sure you want to change the base?
Conversation
…e it is regularly published This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component. Fixes: vectordotdev#20216
I have left this as a draft, since I am not sure how to handle shutdown (which shutdown signal to use) and how to name the task (or maybe run it in a completely different way, to not mix it up with components). Also, gauge is passed into the timer instead of using the macro inside the timer to ensure that correct labels are inherited from the tracing context. |
@pront running_topology.utilization_task =
// TODO: how to name this custom task?
Some(tokio::spawn(Task::new("".into(), "", async move {
utilization_emitter
.run_utilization(ShutdownSignal::noop())
.await;
// TODO: new task output type for this? Or handle this task in a completely
// different way
Ok(TaskOutput::Healthcheck)
}))); I am not sure how to pass the shutdown signal to it (and if I should do it at all, it made sense to me, but I might have misunderstood some part of the topology). Also, I currently create a task with empty name, but maybe it would make more sense to run it in a different way compared to other tasks? |
Hi @esensar, This is a complex so I checked out this PR to do some testing; config:
Sample output:
Leaving this here as context. Will followup with more questions. |
src/topology/running.rs
Outdated
@@ -1053,6 +1055,17 @@ impl RunningTopology { | |||
running_topology.connect_diff(&diff, &mut pieces).await; | |||
running_topology.spawn_diff(&diff, pieces); | |||
|
|||
running_topology.utilization_task = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Still trying to parse the details here)
Do we join this handle at any point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do. I forgot to add that it seems. Should it be joined in stop
? I can see that other tasks are joined there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds right.
src/topology/running.rs
Outdated
@@ -1053,6 +1055,17 @@ impl RunningTopology { | |||
running_topology.connect_diff(&diff, &mut pieces).await; | |||
running_topology.spawn_diff(&diff, pieces); | |||
|
|||
running_topology.utilization_task = | |||
// TODO: how to name this custom task? | |||
Some(tokio::spawn(Task::new("".into(), "", async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible name utilization_heartbeat
. But here I have a more basic question, do we expect this to repeat every 5 seconds (the hardcoded value)? In my test it seems like it was way more frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is expected every 5 seconds. Not sure what went wrong there, in my testing it was repeated every 5 seconds (even though utilization was printed from the sink every second, it was only updated by this component every 5).
cc @lukesteensen (just in case you are interested in this one) |
I haven't been able to figure out what causes these component validation tests to get stuck when stopping the topology. I can see that the utilization task stops properly, but sink tasks get stuck for some reason :/ |
I didn't have time to take a look at this yet. But I wouldn't be surprised if the validation framework also needs changes. |
Summary
This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component.
Change Type
Is this a breaking change?
How did you test this PR?
Ran vector with internal metrics and observer that utilization was updated every ~5 secs, instead of only when events are running.
Does this PR include user facing changes?
Checklist
Cargo.lock
), pleaserun
dd-rust-license-tool write
to regenerate the license inventory and commit the changes (if any). More details here.References