Skip to content

Commit

Permalink
Merge branch 'master' into feat_snowflake_swap
Browse files Browse the repository at this point in the history
  • Loading branch information
mayurinehate authored Oct 15, 2024
2 parents a770bb1 + 1eec2c4 commit ac4f78f
Show file tree
Hide file tree
Showing 68 changed files with 3,487 additions and 919 deletions.
1 change: 1 addition & 0 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,7 @@ subprojects {
implementation("com.fasterxml.jackson.core:jackson-databind:$jacksonVersion")
implementation("com.fasterxml.jackson.core:jackson-dataformat-cbor:$jacksonVersion")
implementation(externalDependency.commonsIo)
implementation(externalDependency.protobuf)
}
}

Expand Down
12 changes: 12 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,18 @@ module.exports = {
id: "docs/automations/snowflake-tag-propagation",
className: "saasOnly",
},
{
label: "AI Classification",
type: "doc",
id: "docs/automations/ai-term-suggestion",
className: "saasOnly",
},
{
label: "AI Documentation",
type: "doc",
id: "docs/automations/ai-docs",
className: "saasOnly",
},
],
},
{
Expand Down
132 changes: 65 additions & 67 deletions docs/api/datahub-apis.md

Large diffs are not rendered by default.

14 changes: 13 additions & 1 deletion docs/api/tutorials/custom-assertions.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ query getAssertion {
customType # Will be your custom type.
description
lastUpdated {
time
time
actor
}
customAssertion {
Expand All @@ -282,6 +282,18 @@ query getAssertion {
}
}
}
# Fetch what entities have the assertion attached to it
relationships(input: {
types: ["Asserts"]
direction: OUTGOING
}) {
total
relationships {
entity {
urn
}
}
}
}
}
```
Expand Down
46 changes: 45 additions & 1 deletion docs/api/tutorials/structured-properties.md
Original file line number Diff line number Diff line change
Expand Up @@ -532,6 +532,50 @@ Or you can run the following command to view the properties associated with the
datahub dataset get --urn {urn}
```

## Read Structured Properties From a Dataset

For reading all structured properties from a dataset:

<Tabs>
<TabItem value="graphql" label="GraphQL" default>

```graphql
query getDataset {
dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.ecommerce.customer,PROD)") {
structuredProperties {
properties {
structuredProperty {
urn
type
definition {
displayName
description
allowedValues {
description
}
}
}
values {
... on StringValue {
stringValue
}
... on NumberValue {
numberValue
}
}
valueEntities {
urn
type
}
}
}
}
}
```

</TabItem>
</Tabs>

## Remove Structured Properties From a Dataset

For removing a structured property or list of structured properties from a dataset:
Expand Down Expand Up @@ -1733,4 +1777,4 @@ Example Response:
```

</TabItem>
</Tabs>
</Tabs>
36 changes: 36 additions & 0 deletions docs/automations/ai-docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import FeatureAvailability from '@site/src/components/FeatureAvailability';

# AI Documentation

<FeatureAvailability saasOnly />

:::info

This feature is currently in closed beta. Reach out to your Acryl representative to get access.

:::

With AI-powered documentation, you can automatically generate documentation for tables and columns.

<p align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/_7DieZeZspY?si=Q5FkCA0gZPEFMj0Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</p>

## Configuring

No configuration is required - just hit "Generate" on any table or column in the UI.

## How it works

Generating good documentation requires a holistic understanding of the data. Information we take into account includes, but is not limited to:

- Dataset name and any existing documentation
- Column name, type, description, and sample values
- Lineage relationships to upstream and downstream assets
- Metadata about other related assets

Data privacy: Your metadata is not sent to any third-party LLMs. We use AWS Bedrock internally, which means all metadata remains within the Acryl AWS account. We do not fine-tune on customer data.

## Limitations

- This feature is powered by an LLM, which can produce inaccurate results. While we've taken steps to reduce the likelihood of hallucinations, they can still occur.
72 changes: 72 additions & 0 deletions docs/automations/ai-term-suggestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import FeatureAvailability from '@site/src/components/FeatureAvailability';

# AI Glossary Term Suggestions

<FeatureAvailability saasOnly />

:::info

This feature is currently in closed beta. Reach out to your Acryl representative to get access.

:::

The AI Glossary Term Suggestion automation uses LLMs to suggest [Glossary Terms](../glossary/business-glossary.md) for tables and columns in your data.

This is useful for improving coverage of glossary terms across your organization, which is important for compliance and governance efforts.

This automation can:

- Automatically suggests glossary terms for tables and columns.
- Goes beyond a predefined set of terms and works with your business glossary.
- Generates [proposals](../managed-datahub/approval-workflows.md) for owners to review, or can automatically add terms to tables/columns.
- Automatically adjusts to human-provided feedback and curation (coming soon).

## Prerequisites

- A business glossary with terms defined. Additional metadata, like documentation and existing term assignments, will improve the accuracy of our suggestions.

## Configuring

1. **Navigate to Automations**: Click on 'Govern' > 'Automations' in the navigation bar.

<p align="center">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
</p>

2. **Create the Automation**: Click on 'Create' and select 'AI Glossary Term Suggestions'.

<p align="center">
<img width="40%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/automation-type.png"/>
</p>

3. **Configure the Automation**: Fill in the required fields to configure the automation.
The main fields to configure are (1) what terms to use for suggestions and (2) what entities to generate suggestions for.

<p align="center">
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/automation-config.png"/>
</p>

4. Once it's enabled, that's it! You'll start to see terms show up in the UI, either on assets or in the proposals page.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/term-proposals.png"/>
</p>

## How it works

The automation will scan through all the datasets matched by the configured filters. For each one, it will generate suggestions.
If new entities are added that match the configured filters, those will also be classified within 24 hours.

We take into account the following metadata when generating suggestions:

- Dataset name and description
- Column name, type, description, and sample values
- Glossary term name, documentation, and hierarchy
- Feedback loop: existing assignments and accepted/rejected proposals (coming soon)

Data privacy: Your metadata is not sent to any third-party LLMs. We use AWS Bedrock internally, which means all metadata remains within the Acryl AWS account. We do not fine-tune on customer data.

## Limitations

- A single configured automation can classify at most 10k entities.
- We cannot do partial reclassification. If you add a new column to an existing table, we won't regenerate suggestions for that table.
33 changes: 16 additions & 17 deletions docs/automations/snowflake-tag-propagation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

import FeatureAvailability from '@site/src/components/FeatureAvailability';

# Snowflake Tag Propagation Automation
Expand All @@ -20,22 +19,22 @@ both columns and tables back to Snowflake. This automation is available in DataH

1. **Navigate to Automations**: Click on 'Govern' > 'Automations' in the navigation bar.

<p align="left">
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
<p align="center">
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
</p>

2. **Create An Automation**: Click on 'Create' and select 'Snowflake Tag Propagation'.

<p align="left">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-type.png"/>
<p align="center">
<img width="60%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-type.png"/>
</p>

3. **Configure Automation**: Fill in the required fields to connect to Snowflake, along with the name, description, and category.
Note that you can limit propagation based on specific Tags and Glossary Terms. If none are selected, then ALL Tags or Glossary Terms will be automatically
propagated to Snowflake tables and columns. Finally, click 'Save and Run' to start the automation
3. **Configure Automation**: Fill in the required fields to connect to Snowflake, along with the name, description, and category.
Note that you can limit propagation based on specific Tags and Glossary Terms. If none are selected, then ALL Tags or Glossary Terms will be automatically
propagated to Snowflake tables and columns. Finally, click 'Save and Run' to start the automation

<p align="left">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-form.png"/>
<p align="center">
<img width="60%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-form.png"/>
</p>

## Propagating for Existing Assets
Expand All @@ -46,13 +45,13 @@ Note that it may take some time to complete the initial back-filling process, de
To do so, navigate to the Automation you created in Step 3 above, click the 3-dot "More" menu

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
</p>

and then click "Initialize".

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-initialize.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-initialize.png"/>
</p>

This one-time step will kick off the back-filling process for existing descriptions. If you only want to begin propagating
Expand All @@ -68,21 +67,21 @@ that you no longer want propagated descriptions to be visible.
To do this, navigate to the Automation you created in Step 3 above, click the 3-dot "More" menu

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
</p>

and then click "Rollback".

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-rollback.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-rollback.png"/>
</p>

This one-time step will remove all propagated tags and glossary terms from Snowflake. To simply stop propagating new tags, you can disable the automation.

## Viewing Propagated Tags

You can view propagated Tags (and corresponding DataHub URNs) inside the Snowflake UI to confirm the automation is working as expected.
You can view propagated Tags (and corresponding DataHub URNs) inside the Snowflake UI to confirm the automation is working as expected.

<p align="left">
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/view-snowflake-tags.png"/>
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/view-snowflake-tags.png"/>
</p>
3 changes: 2 additions & 1 deletion docs/lineage/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ conn_id = datahub_rest_default # or datahub_kafka_default
```

| Name | Default value | Description |
| -------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|----------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| enabled | true | If the plugin should be enabled. |
| conn_id | datahub_rest_default | The name of the datahub connection you set in step 1. |
| cluster | prod | name of the airflow cluster |
Expand All @@ -145,6 +145,7 @@ conn_id = datahub_rest_default # or datahub_kafka_default
| datajob_url_link | taskinstance | If taskinstance, the datajob url will be taskinstance link on airflow. It can also be grid. |
| |
| graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. |
| dag_filter_str | { "allow": [".*"] } | AllowDenyPattern value in form of JSON string to filter the DAGs from running. |

#### Validate that the plugin is working

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

import datahub.emitter.mce_builder as builder
from airflow.configuration import conf
from datahub.configuration.common import ConfigModel
from datahub.configuration.common import AllowDenyPattern, ConfigModel
from pydantic.fields import Field

if TYPE_CHECKING:
from datahub_airflow_plugin.hooks.datahub import DatahubGenericHook
Expand Down Expand Up @@ -56,6 +57,11 @@ class DatahubLineageConfig(ConfigModel):
# Makes extraction of jinja-templated fields more accurate.
render_templates: bool = True

dag_filter_pattern: AllowDenyPattern = Field(
default=AllowDenyPattern.allow_all(),
description="regex patterns for DAGs to ingest",
)

log_level: Optional[str] = None
debug_emitter: bool = False

Expand Down Expand Up @@ -93,6 +99,9 @@ def get_lineage_config() -> DatahubLineageConfig:
datajob_url_link = conf.get(
"datahub", "datajob_url_link", fallback=DatajobUrl.TASKINSTANCE.value
)
dag_filter_pattern = AllowDenyPattern.parse_raw(
conf.get("datahub", "dag_filter_str", fallback='{"allow": [".*"]}')
)

return DatahubLineageConfig(
enabled=enabled,
Expand All @@ -109,4 +118,5 @@ def get_lineage_config() -> DatahubLineageConfig:
disable_openlineage_plugin=disable_openlineage_plugin,
datajob_url_link=datajob_url_link,
render_templates=render_templates,
dag_filter_pattern=dag_filter_pattern,
)
Original file line number Diff line number Diff line change
Expand Up @@ -383,9 +383,13 @@ def on_task_instance_running(
return

logger.debug(
f"DataHub listener got notification about task instance start for {task_instance.task_id}"
f"DataHub listener got notification about task instance start for {task_instance.task_id} of dag {task_instance.dag_id}"
)

if not self.config.dag_filter_pattern.allowed(task_instance.dag_id):
logger.debug(f"DAG {task_instance.dag_id} is not allowed by the pattern")
return

if self.config.render_templates:
task_instance = _render_templates(task_instance)

Expand Down Expand Up @@ -492,6 +496,10 @@ def on_task_instance_finish(

dag: "DAG" = task.dag # type: ignore[assignment]

if not self.config.dag_filter_pattern.allowed(dag.dag_id):
logger.debug(f"DAG {dag.dag_id} is not allowed by the pattern")
return

datajob = AirflowGenerator.generate_datajob(
cluster=self.config.cluster,
task=task,
Expand Down Expand Up @@ -689,8 +697,12 @@ def on_dag_run_running(self, dag_run: "DagRun", msg: str) -> None:
f"DataHub listener got notification about dag run start for {dag_run.dag_id}"
)

self.on_dag_start(dag_run)
assert dag_run.dag_id
if not self.config.dag_filter_pattern.allowed(dag_run.dag_id):
logger.debug(f"DAG {dag_run.dag_id} is not allowed by the pattern")
return

self.on_dag_start(dag_run)
self.emitter.flush()

# TODO: Add hooks for on_dag_run_success, on_dag_run_failed -> call AirflowGenerator.complete_dataflow
Expand Down
Loading

0 comments on commit ac4f78f

Please sign in to comment.