-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Proposal: Support Timestamped Profiling Events #594
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,284 @@ | ||
# Proposal: Support Timestamped Profiling Events | ||
|
||
**Author(s):** @felixge in collaboration with @thegreystone, @jbachorik et al. | ||
|
||
**🚧 WORK IN PROGRESS 🚧:** This proposal is still in the early stages. I'm mostly looking for feedback from active participants in the profiling SIG before soliciting feedback from the wider community. | ||
|
||
## Summary | ||
|
||
The OpenTelemetry profiling format is currently designed to record aggregated summaries of events. This proposal suggests to pivot the format toward recording individual events with timestamps while retaining the ability to record aggregated profiling data as well. | ||
|
||
This will enable the following use cases. | ||
|
||
1. Richer analysis of On-CPU as well as Off-CPU data. | ||
2. Powerful visualizations such as as Thread Timelines, Flame Charts, and FlameScopes. | ||
3. Better compatibility with JFR, Go Execution Traces and Linux perf. | ||
|
||
## Motivation | ||
|
||
Most of the current OpenTelemetry profiling format, including the concept of aggregating stack traces, is inherited from pprof. | ||
However, given that we have [decided](https://github.com/open-telemetry/opentelemetry-proto/issues/567#issuecomment-2286565449) that strict pprof compatibility is not a goal for the OpenTelemetry, we are now free to design a more flexible and extensible format that can be used for a wider range of profiling use cases than pprof. | ||
|
||
One use case that recently came up is the [collection of Off-CPU profiling data for the ebpf profiler](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/144). The attendees of the [Sep 5](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit?tab=t.0#heading=h.lenvx4xd62c6) SIG meeting agreed that aggregated profiling data is not ideal for this use case, as it increases the difficulty to reason about the collected data and to correlate it with application behavior. This is especially true when it comes to the analysis of tail latency problems caused by outliers. So instead aggregation, it is much more useful to record this data as individual events with timestamps as well as additional context such as thread id, trace id and span id. This becomes even more powerful when On-CPU samples are also recorded with timestamps, as it allows users to identify spikes and stalls of CPU activity as well as Off-CPU examplars that explain the cause of the stalls and resulting latency (or lack of throughput). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
FWIW, I used aggregated off-cpu profiles in the past ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. I'm not trying to suggest that aggregated Off-CPU profiles are useless. But they tend to be dominated by idle threads, so it can be tricky to navigate them. Additionally they can't really be used to analyze tail latencies since all they show are averages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, we use a mix of both. Overlaying active CPU over aggregate wait CPU shows where the CPU is trying to execute and getting delayed. From there we allow zooming into individual time sections involved in the aggregate to debug thread interactions, etc. We need both within the same profile to achieve the same thing we have in our private formats. |
||
|
||
In addition to the manual analysis of such data, collecting timestamps also enables powerful visualizations such as [Thread Timelines](https://www.datadoghq.com/blog/continuous-profiler-timeline-view/), [Flame Charts](https://developer.chrome.com/docs/devtools/performance/reference#flame-chart), and [FlameScopes](https://www.brendangregg.com/blog/2018-11-08/flamescope-pattern-recognition.html). See [this document](https://docs.google.com/document/d/1285Rp1pSu6eOTInzPynCiJvDoG-r2cjqxWwTewaTncs/edit?tab=t.0#heading=h.jze26hdz58dy) or [this presentation](https://youtu.be/53UIPZfz-_U?si=x4uYpMtOCIJy8ihY) for more details. | ||
|
||
Last but not least, first-class support for timestamps allows greater compatibility with other profiling tools such as JFR, Go Execution Traces and Linux perf that also record timestamps for profiling events. | ||
|
||
## Design | ||
|
||
This proposal comes with a prelimary sketch for the new format to illustrate the feasibility of supporting event based as well as aggregated profiling data within the same format. However, the final design will be subject to further discussion and refinement and the changes can be merged via smaller PRs if needed. | ||
|
||
The goals for the design are to be a strict superset of pprof and a useful subset of JFR, Go Execution Traces and Linux perf. Efficiency is also a goal, but it is secondary to the goal of providing a flexible and extensible format that can be implemented by a wide range of profilers and analysis tools. | ||
|
||
TODO: Finish describing the design from a high level. | ||
|
||
## Examples | ||
|
||
TODO: Add example for Go mutex and goroutine profiles. | ||
|
||
### Aggregated CPU Profile | ||
|
||
Below is a simple example of an aggregated CPU profile that demonstrates how the new format can operate in the same way as pprof. | ||
|
||
```txtpb | ||
event_types: [ | ||
{ # index 0 | ||
name_string_index: 1 # cpu_sample | ||
unit_string_index: 2 # ms | ||
default_value: 10 # each Event.count represents 10ms of CPU time - this avoid redundant | ||
} | ||
] | ||
strings: [ | ||
"", # index 0 | ||
"cpu_sample", # index 1 | ||
"ms" # index 2 | ||
] | ||
|
||
events: [ | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 1 # main;serve_request | ||
count: 2 # samples, i.e. 20ms of CPU time | ||
} | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 2 # main;foo | ||
count: 1 # samples, i.e. 10ms of CPU time | ||
}, | ||
] | ||
|
||
# stacks, locations, functions, mappings, etc. omitted for brevity | ||
``` | ||
|
||
### Timestamped CPU Profile | ||
|
||
```txtpb | ||
event_types: [ | ||
{ # index 0 | ||
name_string_index: 1 # cpu_sample | ||
unit_string_index: 2 # ms | ||
default_value: 10 # each event represents 10ms of CPU time | ||
clock_id: 0 | ||
} | ||
] | ||
clocks: [ | ||
{ # index 0 | ||
frequency : 1_000_000_000 # 1 Ghz clock, i.e. 1 cycle = 1 ns | ||
time_unix_nanos: 1_257_894_000_000_000_000 | ||
} | ||
] | ||
attributes: [ | ||
{ # index 0 | ||
key_index: 4 # thread.id | ||
value: {int_value: 1} # 1 | ||
}, | ||
{ # index 1 | ||
key_index: 4, # thread.id | ||
value: {int_value: 2} # 2 | ||
} | ||
] | ||
strings: [ | ||
"", # index 0 | ||
"cpu_time", # index 1 | ||
"milliseconds", # index 2 | ||
"unix_epoch", # index 3 | ||
"thread.id" # index 4 | ||
] | ||
|
||
events: [ | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 1 # main;serve_request | ||
time: 10_500_000 # 10.5ms since clock start | ||
attributes: [0] # thread_id: 1 | ||
} | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 2 # main;foo | ||
time: 300_000 # 0.3ms since previous event | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is time based on previous event? This implies we have events in perfect time order. Often during collection, especially off-CPU (cswitch) events, the start of the event is not in-order of when it completes. This would cause collectors to sort potentially large sets of data (cswitches can be massive). Can we just have an absolute time option vs the delta time encoding scheme? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I imagine delta encoding is used to reduce payload size There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The purpose of delta encoding is to reduce payload sizes. Anyway, I don't feel very strongly about delta encoding. I'm planning to do some benchmarks later to provide some evidence (or lack of evidence) that it's going to be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 64-bits are already variable encoded, and if they were deltas off of a stable epoch (like profile start time). They would be quite small without deltas between each other. I get why we would want delta encoding, but it comes at a sorting cost that is likely really bad for off-cpu data sources (since they complete at random times compared to the start time). |
||
attributes: [1] # thread_id: 2 | ||
}, | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 1 # main;serve_request | ||
time: 9_200_000 # 9.2ms since previous event | ||
attributes: [0] # thread_id: 1 | ||
} | ||
] | ||
# stacks, locations, functions, mappings, etc. omitted for brevity | ||
``` | ||
|
||
### Timestamped CPU Profile + Off-CPU Thread State Profile | ||
|
||
|
||
```txtpb | ||
event_types: [ | ||
{ # index 0 | ||
name_string_index: 1 # "cpu_sample" | ||
unit_string_index: 2 # "ms" | ||
default_value: 10 # each event represents 10ms of CPU time | ||
clock_id: 0 | ||
} | ||
{ # index 1 | ||
name_string_index: 2 # "thread_state" | ||
clock_id: 0 | ||
} | ||
] | ||
clocks: [ | ||
{ # index 0 | ||
frequency : 1_000_000_000 # 1 Ghz clock, i.e. 1 cycle = 1 ns | ||
time_unix_nanos: 1_257_894_000_000_000_000 | ||
} | ||
] | ||
attributes: [ | ||
{ # index 0 | ||
key_index: 4 # thread.id | ||
value: {int_value: 1} # 1 | ||
}, | ||
{ # index 1 | ||
key_index: 4, # thread.id | ||
value: {int_value: 2} # 2 | ||
} | ||
{ # index 2 | ||
key_index: 5, # thread.state | ||
value: {key_index: 6} # mutex | ||
}, # index 3 | ||
{ | ||
key_index: 5, # thread.state | ||
value: {key_index: 7} # sleep | ||
} | ||
] | ||
strings: [ | ||
"", # index 0 | ||
"cpu_sample", # index 1 | ||
"thread_state", # index 2 | ||
"ms", # index 3 | ||
"thread.id", # index 4 | ||
"thread.state", # index 5 | ||
"mutex", # index 6 | ||
"sleep", # index 7 | ||
] | ||
|
||
events: [ | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 1 # main;serve_request | ||
time: 10_500_000 # 10.5ms since clock start | ||
attributes: [0] # thread_id: 1 | ||
} | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 2 # main;foo | ||
time: 300_000 # 0.3ms since previous event | ||
attributes: [1] # thread_id: 2 | ||
}, | ||
{ | ||
event_type_index: 0 # cpu_sample (can be omitted on the wire) | ||
stack_index: 1 # main;serve_request | ||
time: 9_200_000 # 9.2ms since previous event | ||
attributes: [0] # thread_id: 1 | ||
}, | ||
{ | ||
event_type_index: 1 # thread_state | ||
stack_index: 1 # main;serve_request;mutex_lock | ||
time: 300_000_000 # 300ms since previous event | ||
duration: 50_000_500 # 50ms | ||
attributes: [0, 2] # thread_id: 1, thread.state: mutex | ||
}, | ||
{ | ||
event_type_index: 1 # thread_state | ||
stack_index: 1 # main;foo;sleep | ||
time: 5_000_000 # 5ms since previous event | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about the stall profiling mentioned in Brendan Gregg's blog on AI flame graphs (https://www.brendangregg.com/blog/2024-10-29/ai-flame-graphs.html)? If this kind of flame graph needs to be supported, there are multiple samples taken in the sleep state, which might cause recording confusion or data inconsistencies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, if we had a duration in addition to time, we could make off_cpu and stall events clearer. IE: Stall started at time X, lasted for Y. This gives a start/stop. This could be stored in an attribute, but feels somewhat wasteful to have to have an attribute per-duration value. Is the value of the event always considered a duration? Often times I have events where the value is less time based (like memory allocations). But for off-CPU / stalls, the value is more duration based. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. duration and value are separate fields in the current proposal, so I think the use cases brought up here should be covered? https://github.com/open-telemetry/opentelemetry-proto/pull/594/files/7d85bba21098f7950813d9717f06e86865da0726#diff-4a0777f946b5d445b17b64906a2ee5b121a070ee499189a656988ad0df350e11R81-R94 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, sorry I missed that! Definitely covers what I was asking for, thanks! |
||
duration: 1_000_000_500 # 1s | ||
attributes: [1, 3] # thread_id: 2, thread.state: sleep | ||
}, | ||
] | ||
# stack, location, function, mapping, etc. omitted for brevity | ||
``` | ||
### Aggregated Memory Profile | ||
|
||
Below is an example of an aggregated memory profile like those available in Go. | ||
|
||
```txtpb | ||
event_types: [ | ||
{ # index 0 | ||
name_string_index: 1 # allocations | ||
unit_string_index: 3 # By (bytes) | ||
} | ||
{ # index 1 | ||
name_string_index: 2 # liveheap | ||
unit_string_index: 3 # By (bytes) | ||
} | ||
] | ||
strings: [ | ||
"", # index 0 | ||
"allocations", # index 1 | ||
"liveheap", # index 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's necessary to differentiate between virtual memory, physical memory, and situations like page faults. Typically, virtual memory consumed is more than physical memory because some memory regions might be zeroed and never referenced. Some might be 'copy-on-write'. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. This needs to be clarified, probably via semantic conventions. But the example given here is allocations profiled by the Go runtime. Those are always measured in terms of virtual memory. I wrote up more details on this here. |
||
"By" # index 3 | ||
] | ||
|
||
events: [ | ||
{ | ||
event_type_index: 0 # allocations (can be omitted on the wire) | ||
stack_index: 1 # main;new_user | ||
count: 100 # objects allocated | ||
sum: 4_800 # bytes allocated | ||
} | ||
{ | ||
event_type_index: 1 # liveheap | ||
stack_index: 1 # main;new_user | ||
count: 12 # objects alive | ||
sum: 576 # bytes alive | ||
} | ||
{ | ||
event_type_index: 0 # allocations (can be omitted on the wire) | ||
stack_index: 2 # main;new_post | ||
count: 300 # objects allocated | ||
sum: 76_800 # bytes allocated | ||
} | ||
{ | ||
event_type_index: 1 # liveheap | ||
stack_index: 2 # main;new_post | ||
count: 73 # objects alive | ||
sum: 18688 # bytes alive | ||
} | ||
] | ||
|
||
# stacks, locations, functions, mappings, etc. omitted for brevity | ||
``` | ||
|
||
## Semantic Conventions | ||
|
||
The following semantic conventions attributes should be added as part of this proposal: | ||
|
||
* `thread.state`: The state of the thread, e.g. "running", "unscheduled", "sleeping", "mutex". | ||
* `go.goroutine.id`: The id of the goroutine that the event was recorded on. | ||
* `go.goroutine.state`: Like thread.state, but with goroutine specific wait states, e.g. "chansend". | ||
* `go.proc.id`: The id of the go processor that the event was recorded on. | ||
|
||
The following semantic conventions already exist and should be used where applicable: | ||
|
||
* [`system.cpu.logical_number`](https://opentelemetry.io/docs/specs/semconv/system/system-metrics/) can be used to record which CPU an event was recorded on. | ||
* [`thread.id`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/thread/) can be used to record the id of the thread that the event was recorded on. | ||
* [`thread.name`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/thread/) can be used to record the name of the thread that the event was recorded on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On- and Off-CPU profiling are just two profiling use cases. How about making it less specific:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow. The use case enabled by this proposal is event based CPU profiling rather than aggregation. We already have aggregation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTel ebpf profiler could aggregate on CPU profiling data. This is possible by setting Sample.value to a different value than 1 and provide a timestamp in Sample.timestamps_unix_nano for each
Sample.value
.Currently, this aggregation is not done as it was rejected during the review process and so it is still a low hanging fruit for optimization.