feat(telemetry): cross-component async write tracing #12405
+6,910
−1,805
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Key Features:
Async write requests (OpenAPI/Rest.li) will include a
trace id
generated by OpenTelemetry and returned via a standard http response header used for tracing:traceparent
.SystemMetadata
, available from OpenAPIv3, will also include thetrace id
in the properties with keytelemetryTraceId
trace id
can be used to track the outcome from a write request with information about its success/failure or pending status.trace id
from GMS to the consumers (mce-consumer
andmae-consumer
).The Failed MCP topic will now store more detailed error messages and the trace API will fetch these errors in order to not only return failure status, but detailed information on why it failed.
For debugging, a cookie or special header, can be used with any request (read/write/sync/async) using any API (Graphql/OpenAPI/Rest.li) and will trigger logging of the spans with detailed timing of the request in the logs.
X-Enable-Trace-Log
with valuetrue
enable-trace-log
with valuetrue
Design Considerations:
For the initial implementation no specific telemetry infrastructure is required, however existing environment variables for OpenTelemetry can continue to be used and will export the new spans if configured.
trace id
or related timestamps.trace id
is stored insystemMetadata
in both SQL and ES. For ES specifically, the presence of thetrace id
in the system metadata index is used as a proxy to determine a successful write to ES.Trace performance
skipCache
is included as a flag to bypass the cache.This PR updates OpenTelemetry and transitions from the DropWizard based timing instrumentation to using OpenTelemetry. The existing metrics for DropWizard are forwarded from OpenTelemetry preserving the existing naming scheme.
For easy access, the
OperationContext
now includes aTraceContext
to facilitate the integration of OpenTelemetry into any part of the code base.TODO: Create documentation, code coverage.
Checklist