Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry): cross-component async write tracing #12405

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

david-leifker
Copy link
Collaborator

@david-leifker david-leifker commented Jan 20, 2025

Key Features:

  • Async write requests (OpenAPI/Rest.li) will include a trace id generated by OpenTelemetry and returned via a standard http response header used for tracing: traceparent.

    • SystemMetadata, available from OpenAPIv3, will also include the trace id in the properties with key telemetryTraceId
    • The trace id can be used to track the outcome from a write request with information about its success/failure or pending status.
    • This tracing supports multi-component reaching across kafka topics enabling a consistent trace id from GMS to the consumers (mce-consumer and mae-consumer).
  • The Failed MCP topic will now store more detailed error messages and the trace API will fetch these errors in order to not only return failure status, but detailed information on why it failed.

  • For debugging, a cookie or special header, can be used with any request (read/write/sync/async) using any API (Graphql/OpenAPI/Rest.li) and will trigger logging of the spans with detailed timing of the request in the logs.

    • Header X-Enable-Trace-Log with value true
    • Cookie enable-trace-log with value true

Design Considerations:

  • For the initial implementation no specific telemetry infrastructure is required, however existing environment variables for OpenTelemetry can continue to be used and will export the new spans if configured.

    • This means the tracing implementation does not rely on any additional external systems or infrastructure. Due to this design choice, the trace is determined by inspecting the 3 storage systems (Primary Storage (SQL/Cassandra), Elasticsearch/Opensearch, Kafka topics) for the trace id or related timestamps.
    • The trace id is stored in systemMetadata in both SQL and ES. For ES specifically, the presence of the trace id in the system metadata index is used as a proxy to determine a successful write to ES.
    • The tracing feature will additionally fetch messages from the kafka topics (including the failed MCP topic) for more detailed error information. Pending states are derived from offsets of the message vs the current offsets of the consumer groups.
  • Trace performance

    • For successful writes, the performance of the trace is optimal, a single lookup from SQL and one from Elasticsearch with no interaction with kafka
    • For pending and error states, lookup from kafka is required and is slower.
      • Where possible timestamp offsets are used for efficient seeking.
      • Tracing in kafka is performed in parallel with limited concurrency
      • Offset caching is used to provide faster response times at the expense of stale data. skipCache is included as a flag to bypass the cache.
  • This PR updates OpenTelemetry and transitions from the DropWizard based timing instrumentation to using OpenTelemetry. The existing metrics for DropWizard are forwarded from OpenTelemetry preserving the existing naming scheme.

  • For easy access, the OperationContext now includes a TraceContext to facilitate the integration of OpenTelemetry into any part of the code base.

TODO: Create documentation, code coverage.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

* created TraceContext for opentelemetry spans
* added tracing header/cookies to control logging trace info
* support legacy dropwizard tracing using opentelemetry
* added smoke-tests for tracing conditions
@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Jan 20, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jan 20, 2025
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 46.58711% with 1119 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...om/linkedin/metadata/entity/EntityServiceImpl.java 70.09% 76 Missing and 17 partials ⚠️
.../com/linkedin/metadata/trace/KafkaTraceReader.java 58.82% 68 Missing and 16 partials ⚠️
...s/factory/kafka/trace/KafkaTraceReaderFactory.java 0.00% 66 Missing ⚠️
.../datahubproject/metadata/context/TraceContext.java 63.27% 50 Missing and 15 partials ⚠️
...in/metadata/entity/validation/ValidationUtils.java 20.58% 53 Missing and 1 partial ⚠️
...data/search/client/CachingEntitySearchService.java 23.94% 52 Missing and 2 partials ⚠️
...java/com/datahub/event/PlatformEventProcessor.java 0.00% 42 Missing ⚠️
.../com/linkedin/metadata/trace/TraceServiceImpl.java 84.32% 21 Missing and 16 partials ⚠️
...ory/system_telemetry/OpenTelemetryBaseFactory.java 0.00% 36 Missing ⚠️
...project/openapi/operations/v1/TraceController.java 0.00% 35 Missing ⚠️
... and 50 more
Files with missing lines Coverage Δ
...inkedin/datahub/upgrade/UpgradeCliApplication.java 20.00% <ø> (ø)
...din/datahub/upgrade/config/SystemUpdateConfig.java 90.69% <ø> (ø)
...in/java/com/linkedin/metadata/aspect/ReadItem.java 33.33% <ø> (ø)
...edin/metadata/aspect/utils/DefaultAspectsUtil.java 84.67% <100.00%> (+0.22%) ⬆️
...com/linkedin/metadata/client/JavaEntityClient.java 31.41% <ø> (ø)
...ta/entity/cassandra/CassandraRetentionService.java 69.01% <ø> (ø)
...n/metadata/entity/ebean/EbeanRetentionService.java 55.49% <ø> (ø)
...adata/graph/elastic/ElasticSearchGraphService.java 68.32% <ø> (ø)
...nkedin/metadata/graph/neo4j/Neo4jGraphService.java 71.92% <100.00%> (+0.12%) ⬆️
...linkedin/metadata/search/LineageSearchService.java 82.05% <ø> (ø)
... and 81 more

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 262dd76...69db86b. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops PR or Issue related to DataHub backend & deployment needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant