-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/dbt): dbt column-level lineage #8991
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core functionality looked like it was in _infer_schemas_and_update_cll, which made enough sense but I didn't really get a full picture of the process. I don't like how this logic is decently different from our CLL in other sources, but I'm guessing there's some key differences between the two.
metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py
Outdated
Show resolved
Hide resolved
and should_fetch_target_node_schema | ||
and graph | ||
): | ||
schema_metadata = graph.get_aspect(target_node_urn, SchemaMetadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't cached right? Is it possible we end up querying this multiple times for the same urn? Could we do a bulk fetch instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it won't ever query for the same urn multiple times
if self.config.include_column_lineage and sql_result: | ||
# We only save the debug info here. We're report errors based on it later, after | ||
# applying the configured node filters. | ||
node.cll_debug_info = sql_result.debug_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attaching lineage info to nodes is different than we do most other sources. How come you're doing it this way here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lineage is determined by the view definition and the schema of the upstreams. in most other sources, we have the schemas available, but in the case of dbt ephemeral models, we have to infer the schemas. that means we need to do it topographical order, so it's easier to do it all in one go
Changes stacked on top of #8989Caveats
ref
orsource
), those won't get CLLTODOs
Checklist