Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingestion/tableau): restructure the tableau graphql datasource query #11230

Conversation

sid-acryl
Copy link
Contributor

@sid-acryl sid-acryl commented Aug 23, 2024

Fixes

  • Restructure upstream GraphQl queries to avoid node limit error
  • Rewrite emit_project_container to emit project in topological order

Copy link
Contributor

coderabbitai bot commented Aug 23, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces enhancements to the Tableau ingestion module within the DataHub project. Key changes include the implementation of cursor-based pagination for metadata querying, the addition of new configuration fields, and the introduction of structured JSON files that represent data fields and connections. Multiple test files have also been updated to reflect changes in import paths and functionality.

Changes

File Change Summary
.../tableau/__init__.py Introduced imports for TableauConfig, TableauSiteSource, TableauSource, and TableauSourceReport.
.../tableau/tableau.py Replaced query_metadata with query_metadata_cursor_based_pagination, updated parameters in get_connection_object_page, added fetch_size field, and introduced update_datasource_for_field_upstream method.
.../tableau/tableau_common.py Added query_metadata_cursor_based_pagination function for efficient GraphQL queries with pagination.
.../tableau/tableau_constant.py Introduced new constant FIELDS_CONNECTION.
.../setup/*.json Multiple new JSON files added, defining structured representations of fields and connections for embedded and published data sources.
.../test_tableau_ingest.py Updated import paths and added calls to read_response in tests.
.../test_tableau_source.py Updated import paths for tableau_constant and tableau_common.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TableauAPI
    participant TableauConfig
    participant DataHub

    User->>TableauAPI: Request Metadata
    TableauAPI->>TableauConfig: Fetch Metadata with Pagination
    TableauConfig->>TableauAPI: Send Paginated Query
    TableauAPI-->>TableauConfig: Return Metadata Page
    TableauConfig-->>DataHub: Send Retrieved Metadata
    DataHub-->>User: Provide Data
Loading

🐇 In the field of data, I hop and play,
With tables and queries, I brighten the day.
New paths to explore, oh what a delight,
Ingestion is smoother, the future is bright!
With fields and connections, we dance and we sing,
Cheers to the changes, let the data take wing!
🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 23, 2024
@sid-acryl sid-acryl changed the title fix(ingestion/tableau): cursor base pagination fix(ingestion/tableau): restructure the tableau graphql datasource query to resolve the node limit error. Aug 27, 2024
@sid-acryl sid-acryl changed the title fix(ingestion/tableau): restructure the tableau graphql datasource query to resolve the node limit error. fix(ingestion/tableau): restructure the tableau graphql datasource query. Aug 27, 2024
@sid-acryl sid-acryl changed the title fix(ingestion/tableau): restructure the tableau graphql datasource query. fix(ingestion/tableau): restructure the tableau graphql datasource query Aug 27, 2024
sid-acryl and others added 3 commits August 27, 2024 23:03
… github.com:sid-acryl/datahub-fork into cus-2491-tableau-ingestion-hitting-20000-node-limit
@hsheth2
Copy link
Collaborator

hsheth2 commented Aug 28, 2024

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Aug 28, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8f88858 and 2ec4ca4.

Files selected for processing (18)
  • metadata-ingestion/src/datahub/ingestion/source/tableau/init.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/tableau/tableau.py (15 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/tableau/tableau_common.py (6 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/tableau/tableau_constant.py (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_04ed1dcc7090_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_1570e7f932f6_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_26675da44a38_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_69eb47587cc2_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_6f5f4cc0b6c6_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_a0fced25e056_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_a561c7beccd3_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_bda46be068e3_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_c651da2f6ad8_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesConnection_all.json (47 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesFieldUpstream_17139d6e97ae_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesFieldUpstream_8e19660bb5dd_all.json (1 hunks)
  • metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py (19 hunks)
  • metadata-ingestion/tests/unit/test_tableau_source.py (1 hunks)
Files skipped from review due to trivial changes (3)
  • metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_6f5f4cc0b6c6_all.json
  • metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesConnection_all.json
  • metadata-ingestion/tests/unit/test_tableau_source.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/tableau/__init__.py

2-2: datahub.ingestion.source.tableau.tableau.TableauConfig imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


3-3: datahub.ingestion.source.tableau.tableau.TableauSiteSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


4-4: datahub.ingestion.source.tableau.tableau.TableauSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


5-5: datahub.ingestion.source.tableau.tableau.TableauSourceReport imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

Additional comments not posted (52)
metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_bda46be068e3_all.json (1)

1-39: LGTM!

The JSON structure is well-formed and consistent with the expected schema.

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/tableau/tableau_constant.py (1)

69-69: LGTM!

The new constant FIELDS_CONNECTION is correctly defined and consistent with the existing constants.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_04ed1dcc7090_all.json (5)

1-4: LGTM!

The initial structure and nodes of the JSON file are correctly formatted.

The code changes are approved.


5-16: LGTM!

The first node is correctly formatted and follows the expected structure.

The code changes are approved.


17-87: LGTM!

The subsequent nodes are correctly formatted and follow the expected structure.

The code changes are approved.


90-93: LGTM!

The pageInfo structure is correctly formatted and follows the expected structure.

The code changes are approved.


94-94: LGTM!

The totalCount field is correctly formatted and follows the expected structure.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_26675da44a38_all.json (5)

1-4: LGTM!

The initial structure and nodes of the JSON file are correctly formatted.

The code changes are approved.


5-16: LGTM!

The first node is correctly formatted and follows the expected structure.

The code changes are approved.


17-100: LGTM!

The subsequent nodes are correctly formatted and follow the expected structure.

The code changes are approved.


102-105: LGTM!

The pageInfo structure is correctly formatted and follows the expected structure.

The code changes are approved.


106-106: LGTM!

The totalCount field is correctly formatted and follows the expected structure.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesFieldUpstream_8e19660bb5dd_all.json (5)

1-4: LGTM!

The initial structure and nodes of the JSON file are correctly formatted.

The code changes are approved.


5-16: LGTM!

The first node is correctly formatted and follows the expected structure.

The code changes are approved.


17-100: LGTM!

The subsequent nodes are correctly formatted and follow the expected structure.

The code changes are approved.


102-105: LGTM!

The pageInfo structure is correctly formatted and follows the expected structure.

The code changes are approved.


106-106: LGTM!

The totalCount field is correctly formatted and follows the expected structure.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/publishedDatasourcesFieldUpstream_17139d6e97ae_all.json (1)

1-281: LGTM!

The JSON structure is well-formed and consistent.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_69eb47587cc2_all.json (1)

1-421: LGTM!

The JSON structure is well-formed and consistent.

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/tableau/tableau_common.py (4)

933-964: LGTM!

The function query_metadata_cursor_based_pagination is well-implemented and follows best practices for dynamic query generation and pagination.

The code changes are approved.


Line range hint 966-996: LGTM!

The function query_metadata is correctly implemented.

The code changes are approved.


Line range hint 997-1017: LGTM!

The function get_filter_pages is correctly implemented and the comment provides useful context.

The code changes are approved.


226-238: Verify the impact of commenting out these sections.

The sections related to upstreamFields and upstreamColumns have been commented out. Ensure this change does not affect other parts of the codebase.

Run the following script to verify the impact:

Also applies to: 391-403

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_a561c7beccd3_all.json (3)

1-3: LGTM!

The top-level structure of the JSON file is correctly formatted.

The code changes are approved.


4-20: LGTM!

The "fieldsConnection" object and its nested "nodes" array are correctly formatted.

The code changes are approved.

Also applies to: 22-24, 26-33, 35-47, 49-59, 61-71, 73-85, 87-97, 99-109, 111-121, 123-133, 135-145, 147-157, 159-169, 171-181, 183-193, 195-205, 207-217, 219-229, 231-241, 243-253, 255-265, 267-277, 279-289, 291-301, 303-313, 315-325, 327-337, 339-349, 351-361, 363-373, 375-385, 387-397, 399-409, 411-421, 423-433, 435-445, 447-457, 459-469, 471-481, 483-493, 495-505, 507-517, 519-529, 531-541, 543-553, 555-565, 567-577, 579-589, 591-601, 603-613, 615-625, 627-637, 639-649, 651-661, 663-673, 675-685, 687-697, 699-709, 711-721, 723-733, 735-745, 747-757, 759-769, 771-781, 783-793, 795-805, 807-817, 819-829, 831-841, 843-853, 855-865, 867-877, 879-889, 891-901, 903-913, 915-925, 927-937, 939-949, 951-961, 963-973, 975-985, 987-997, 999-1004


581-584: LGTM!

The "pageInfo" object is correctly formatted.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_1570e7f932f6_all.json (3)

1-3: LGTM!

The top-level structure of the JSON file is correctly formatted.

The code changes are approved.


4-9: LGTM!

The "fieldsConnection" object and its nested "nodes" array are correctly formatted.

The code changes are approved.

Also applies to: 11-19, 21-29, 31-39, 41-49, 51-59, 61-69, 71-79, 81-101, 103-113, 115-123, 125-133, 135-143, 145-153, 155-163, 165-173, 175-183, 185-193, 195-203, 205-213, 215-223, 225-233, 235-243, 245-253, 255-263, 265-273, 275-283, 285-293, 295-303, 305-313, 315-323, 325-333, 335-343, 345-353, 355-363, 365-373, 375-383, 385-393, 395-403, 405-413, 415-423, 425-433, 435-443, 445-453, 455-463, 465-473, 475-483, 485-493, 495-503, 505-513, 515-523, 525-533, 535-543, 545-553, 555-563, 565-573, 575-583, 585-593, 595-603, 605-613, 615-623, 625-633, 635-643, 645-653, 655-663, 665-673, 675-683, 685-693, 695-703, 705-713, 715-723, 725-733, 735-743, 745-753, 755-763, 765-773, 775-783, 785-793, 795-803, 805-813, 815-823, 825-833, 835-843, 845-853, 855-863, 865-873, 875-883, 885-893, 895-903, 905-913, 915-923, 925-933, 935-943, 945-953, 955-963, 965-973, 975-985, 987-995, 997-1004


997-1000: LGTM!

The "pageInfo" object is correctly formatted.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_c651da2f6ad8_all.json (2)

1-1246: LGTM!

The structure of the JSON file is well-organized and follows a logical hierarchy.

The structure of the JSON file is approved.


1-1246: Verify the totalCount field.

The totalCount field is set to 8, which seems inconsistent with the number of nodes provided. Please verify and correct if necessary.

Ensure that the totalCount field accurately reflects the number of nodes in the nodes array.

metadata-ingestion/tests/integration/tableau/setup/embeddedDatasourcesFieldUpstream_a0fced25e056_all.json (3)

4-1539: LGTM!

The nodes array is correctly formatted, with unique id values and appropriate upstreamFields and upstreamColumns.

The code changes are approved.


1541-1544: LGTM!

The pageInfo object is correctly formatted with appropriate values for hasNextPage and endCursor.

The code changes are approved.


1545-1545: LGTM!

The totalCount value is correctly set to 8.

The code changes are approved.

metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py (18)

29-29: Update import path for tableau_common module.

The import path has been updated to reflect the new module structure. Ensure that the new path is correct and that the tableau_common module is accessible from the updated path.

The code changes are approved.


288-288: Update mock patch path for Server class.

The mock patch path has been updated to reflect the new module structure. Ensure that the new path is correct and that the Server class is being mocked correctly.

The code changes are approved.


334-365: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


386-386: Update mock patch path for Server class.

The mock patch path has been updated to reflect the new module structure. Ensure that the new path is correct and that the Server class is being mocked correctly.

The code changes are approved.


425-457: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


490-522: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


555-587: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


621-653: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


686-718: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


750-785: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


853-885: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1001-1033: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1201-1234: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1352-1385: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1418-1491: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1523-1556: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


1588-1621: Enhance test coverage for embedded datasource fields.

The additional read_response calls introduce new JSON files that are read during the tests, enhancing the test coverage for embedded datasource fields. This ensures a broader range of scenarios are tested.

The code changes are approved.


Line range hint 1643-1666: Update mock patch path for Server class.

The mock patch path has been updated to reflect the new module structure. Ensure that the new path is correct and that the Server class is being mocked correctly.

The code changes are approved.

Comment on lines 1 to 6
from datahub.ingestion.source.tableau.tableau import (
TableauConfig,
TableauSiteSource,
TableauSource,
TableauSourceReport,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused imports.

The following imports are unused and should be removed to maintain clean and efficient code:

  • TableauConfig
  • TableauSiteSource
  • TableauSource
  • TableauSourceReport

Apply this diff to remove the unused imports:

-from datahub.ingestion.source.tableau.tableau import (
-    TableauConfig,
-    TableauSiteSource,
-    TableauSource,
-    TableauSourceReport,
-)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from datahub.ingestion.source.tableau.tableau import (
TableauConfig,
TableauSiteSource,
TableauSource,
TableauSourceReport,
)
Tools
Ruff

2-2: datahub.ingestion.source.tableau.tableau.TableauConfig imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


3-3: datahub.ingestion.source.tableau.tableau.TableauSiteSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


4-4: datahub.ingestion.source.tableau.tableau.TableauSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


5-5: datahub.ingestion.source.tableau.tableau.TableauSourceReport imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

Comment on lines 982 to 983
current_cursor: Optional[str], # initial value is None
fetch_size: int = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve error handling for "NODE_LIMIT_EXCEEDED".

The function is correctly implemented, but the error handling for "NODE_LIMIT_EXCEEDED" could be improved by suggesting actionable steps to the user.

Apply this diff to suggest actionable steps for "NODE_LIMIT_EXCEEDED" error:

if node_limit_errors:
    logger.debug(f"Node Limit Error. query_data {query_data}")
    self.report.warning(
        title="Node Limit Errors",
        message="Increase your tableau node limit or reduce the fetch size in the configuration.",
        context=f"""{{
                "errors": {node_limit_errors},
                "connection_type": {connection_type},
                "query_filter": {query_filter},
                "query": {query},
        }}""",
    )

Also applies to: 986-987, 990-991, 995-1001, 1003-1003, 1013-1017, 1031-1035, 1037-1037, 1051-1051, 1059-1063, 1082-1093, 1131-1137

Comment on lines +3013 to +3016
datasource = self.update_datasource_for_field_upstream(
datasource=datasource,
field_upstream_query=datasource_upstream_fields_graphql_query,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract logic to a helper function.

The function is correctly implemented, but the logic for updating the datasource fields with their upstream fields could be extracted to a helper function to improve readability.

Apply this diff to extract the logic to a helper function:

def update_datasource_with_upstream_fields(self, datasource: dict) -> dict:
    return self.update_datasource_for_field_upstream(
        datasource=datasource,
        field_upstream_query=datasource_upstream_fields_graphql_query,
    )

def emit_embedded_datasources(self) -> Iterable[MetadataWorkUnit]:
    datasource_filter = {c.ID_WITH_IN: self.embedded_datasource_ids_being_used}

    for datasource in self.get_connection_objects(
        embedded_datasource_graphql_query,
        c.EMBEDDED_DATA_SOURCES_CONNECTION,
        datasource_filter,
    ):
        datasource = self.update_datasource_with_upstream_fields(datasource)
        yield from self.emit_datasource(
            datasource,
            datasource.get(c.WORKBOOK),
            is_embedded_ds=True,
        )
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
datasource = self.update_datasource_for_field_upstream(
datasource=datasource,
field_upstream_query=datasource_upstream_fields_graphql_query,
)
def update_datasource_with_upstream_fields(self, datasource: dict) -> dict:
return self.update_datasource_for_field_upstream(
datasource=datasource,
field_upstream_query=datasource_upstream_fields_graphql_query,
)
def emit_embedded_datasources(self) -> Iterable[MetadataWorkUnit]:
datasource_filter = {c.ID_WITH_IN: self.embedded_datasource_ids_being_used}
for datasource in self.get_connection_objects(
embedded_datasource_graphql_query,
c.EMBEDDED_DATA_SOURCES_CONNECTION,
datasource_filter,
):
datasource = self.update_datasource_with_upstream_fields(datasource)
yield from self.emit_datasource(
datasource,
datasource.get(c.WORKBOOK),
is_embedded_ds=True,
)

Comment on lines +2324 to +2358
def update_datasource_for_field_upstream(
self,
datasource: dict,
field_upstream_query: str,
) -> dict:
# Collect field ids to fetch field upstreams
field_ids: List[str] = []
for field in datasource.get(c.FIELDS, []):
if field.get(c.ID):
field_ids.append(field.get(c.ID))

# Fetch field upstreams and arrange them in map
field_vs_upstream: Dict[str, dict] = {}
for field_upstream in self.get_connection_objects(
field_upstream_query,
c.FIELDS_CONNECTION,
{c.ID_WITH_IN: field_ids},
):
if field_upstream.get(c.ID):
field_id = field_upstream[c.ID]
# delete the field id, we don't need it for further processing
del field_upstream[c.ID]
field_vs_upstream[field_id] = field_upstream

# update datasource's field for its upstream
for field_dict in datasource.get(c.FIELDS, []):
field_upstream_dict: Optional[dict] = field_vs_upstream.get(
field_dict.get(c.ID)
)
if field_upstream_dict:
# Add upstream fields to field
field_dict.update(field_upstream_dict)

return datasource

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the logic using dictionary comprehension.

The function is correctly implemented, but the logic could be simplified by using a dictionary comprehension.

Apply this diff to simplify the logic:

# Fetch field upstreams and arrange them in map
field_vs_upstream: Dict[str, dict] = {
    field_upstream[c.ID]: {k: v for k, v in field_upstream.items() if k != c.ID}
    for field_upstream in self.get_connection_objects(
        field_upstream_query,
        c.FIELDS_CONNECTION,
        {c.ID_WITH_IN: field_ids},
    )
    if field_upstream.get(c.ID)
}

# update datasource's field for its upstream
for field_dict in datasource.get(c.FIELDS, []):
    field_upstream_dict: Optional[dict] = field_vs_upstream.get(field_dict.get(c.ID))
    if field_upstream_dict:
        # Add upstream fields to field
        field_dict.update(field_upstream_dict)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def update_datasource_for_field_upstream(
self,
datasource: dict,
field_upstream_query: str,
) -> dict:
# Collect field ids to fetch field upstreams
field_ids: List[str] = []
for field in datasource.get(c.FIELDS, []):
if field.get(c.ID):
field_ids.append(field.get(c.ID))
# Fetch field upstreams and arrange them in map
field_vs_upstream: Dict[str, dict] = {}
for field_upstream in self.get_connection_objects(
field_upstream_query,
c.FIELDS_CONNECTION,
{c.ID_WITH_IN: field_ids},
):
if field_upstream.get(c.ID):
field_id = field_upstream[c.ID]
# delete the field id, we don't need it for further processing
del field_upstream[c.ID]
field_vs_upstream[field_id] = field_upstream
# update datasource's field for its upstream
for field_dict in datasource.get(c.FIELDS, []):
field_upstream_dict: Optional[dict] = field_vs_upstream.get(
field_dict.get(c.ID)
)
if field_upstream_dict:
# Add upstream fields to field
field_dict.update(field_upstream_dict)
return datasource
def update_datasource_for_field_upstream(
self,
datasource: dict,
field_upstream_query: str,
) -> dict:
# Collect field ids to fetch field upstreams
field_ids: List[str] = []
for field in datasource.get(c.FIELDS, []):
if field.get(c.ID):
field_ids.append(field.get(c.ID))
# Fetch field upstreams and arrange them in map
field_vs_upstream: Dict[str, dict] = {
field_upstream[c.ID]: {k: v for k, v in field_upstream.items() if k != c.ID}
for field_upstream in self.get_connection_objects(
field_upstream_query,
c.FIELDS_CONNECTION,
{c.ID_WITH_IN: field_ids},
)
if field_upstream.get(c.ID)
}
# update datasource's field for its upstream
for field_dict in datasource.get(c.FIELDS, []):
field_upstream_dict: Optional[dict] = field_vs_upstream.get(
field_dict.get(c.ID)
)
if field_upstream_dict:
# Add upstream fields to field
field_dict.update(field_upstream_dict)
return datasource


parent_project: Optional[
TableauProject
] = self.tableau_project_registry.get(project_.parent_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how exactly does this interact with project path filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here #L870, connector adding project into the registry if its allowed by pattern.

parent_container_key = self.gen_project_key(project.parent_id)
elif self.config.add_site_container and self.site and self.site.id:
parent_container_key = self.gen_site_key(self.site.id)
visited_project_ids: Set[str] = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
visited_project_ids: Set[str] = set()
generated_project_ids: Set[str] = set()

parent_container_key: Optional[ContainerKey] = None
# set site as parent container
if self.config.add_site_container and self.site and self.site.id:
parent_container_key = self.gen_site_key(self.site.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this after the visited_project_ids check?

also, why are we adding to visited_project_ids in two places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorganized the code

@@ -910,6 +930,40 @@ def make_filter(filter_dict: dict) -> str:
return filter


def query_metadata_cursor_based_pagination(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cursor stuff still relevant after your other graphql simplifications?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it is relevant as per tableau documentation.

@hsheth2 hsheth2 merged commit 3150d90 into datahub-project:master Sep 9, 2024
68 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Dec 17, 2024
@haeniya
Copy link
Contributor

haeniya commented Jan 6, 2025

Hi @sid-acryl I just came across your changes regarding emitting projects in topological order. We solved this for us some time ago with these changes: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/tableau/tableau.py#L1103-L1107.

Did you still have issues with this? I wonder if we really need both, or if we can keep one implementation. For us, the issue was solved with this small snippet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants