Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pattern_add_dataset_dataproduct works for Oracle ingestion but not S3 #11656

Open
mikeburke24 opened this issue Oct 17, 2024 · 6 comments
Open
Labels
bug Bug report

Comments

@mikeburke24
Copy link
Contributor

mikeburke24 commented Oct 17, 2024

Describe the bug
We are trying to automatically assign data products to datasets and their container during ingestion from S3. I have included the format of our transformer below:

To Reproduce

transformers:
    -
        type: pattern_add_dataset_dataproduct
        config:
            is_container: true
            dataset_to_data_product_urns_pattern:                
                rules:
                    '.*': 'urn:li:dataProduct:<DATA_PRODUCT_URN>'

However, the ingestion fails with the following message:
Failed to configure transformers: 1 validation error for PatternDatasetDataProductConfig
is_container

extra fields not permitted (type=value_error.extra)
If we remove the is_container portion, the ingestion still fails with the message below:
ERROR :: /assets/0/destinationUrn :: field is required but not found and has no default value

Expected behavior
The documentation that you linked states that is_container is supported:

Additional context
This transformer format works fine for Oracle (if is_container is removed) but doesn't work for S3

@mikeburke24 mikeburke24 added the bug Bug report label Oct 17, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 22, 2024

@mikeburke24 Looks like due to #10928, you probably want to be on server 0.14.1 and a CLI version that is 0.14.1.x.

That should solve both the is_container config issue and the error during emission.

@mikeburke24
Copy link
Contributor Author

@hsheth2 Hi, I've upgraded to GMS tag [1f02c84] and CLI 0.14.1.3 and still we are getting this error

ERROR :: /assets/0/destinationUrn :: field is required but not found and has no default value

It does work for Oracle though. Do you have any ideas why it doesn't work for S3? Would you have any example syntax that might work?

@jjoyce0510
Copy link
Collaborator

Interesting, it looks lke we need to investigate the pattern_add_dataset_dataproduct transformer a bit more closely to determine why it would not be providing this field.

@mikeburke24
Copy link
Contributor Author

@jjoyce0510 thanks John! If you've ever got this to work or have any other example syntax please send it my way. I'm not sure what field it is looking for that it can't find. Here's an example I've tried on a local build

transformers:
-
type: pattern_add_dataset_dataproduct
config:
dataset_to_data_product_urns_pattern:
rules:
'.*': 'urn:li:dataProduct:xxxxxxxx'

@asikowitz
Copy link
Collaborator

Can you post your full S3 recipe (redacted)? It seems like we have some bug where we emit an invalid MCP but I'm having trouble narrowing it down.

@mikeburke24
Copy link
Contributor Author

mikeburke24 commented Nov 12, 2024

@asikowitz sure

source:
    type: s3
    config:
        path_specs:
            -
                include: 's3://<mybucket>/<myfile.csv>'
transformers:
    -
        type: pattern_add_dataset_dataproduct
        config:
            is_container: true
            dataset_to_data_product_urns_pattern:
                rules:
                    'urn:li:dataset:(urn:li:dataPlatform:s3,<mybucket>/<myfile.csv>,prod)': 'urn:li:dataProduct:<urn>'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

4 participants