Skip to content

Commit

Permalink
Merge branch 'datahub-project:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
hsheth2 authored Mar 22, 2024
2 parents 939147d + 1cff5ef commit 0214c6e
Show file tree
Hide file tree
Showing 19 changed files with 479 additions and 20 deletions.
4 changes: 3 additions & 1 deletion docs-website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,9 @@ module.exports = {
path: "src/pages",
mdxPageComponent: "@theme/MDXPage",
},
googleTagManager: {
containerId: 'GTM-WK28RLTG',
},
},
],
],
Expand All @@ -296,7 +299,6 @@ module.exports = {
routeBasePath: "/docs/graphql",
},
],
// '@docusaurus/plugin-google-gtag',
// [
// require.resolve("@easyops-cn/docusaurus-search-local"),
// {
Expand Down
44 changes: 29 additions & 15 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -69,24 +69,38 @@ module.exports = {
type: "category",
items: [
{
type: "doc",
id: "docs/managed-datahub/observe/freshness-assertions",
className: "saasOnly",
},
{
type: "doc",
id: "docs/managed-datahub/observe/volume-assertions",
className: "saasOnly",
},
{
type: "doc",
id: "docs/managed-datahub/observe/custom-sql-assertions",
className: "saasOnly",
label: "Assertions",
type: "category",
link: {
type: "doc",
id: "docs/managed-datahub/observe/assertions",
},
items: [
{
type: "doc",
id: "docs/managed-datahub/observe/freshness-assertions",
className: "saasOnly",
},
{
type: "doc",
id: "docs/managed-datahub/observe/volume-assertions",
className: "saasOnly",
},
{
type: "doc",
id: "docs/managed-datahub/observe/custom-sql-assertions",
className: "saasOnly",
},
{
type: "doc",
id: "docs/managed-datahub/observe/column-assertions",
className: "saasOnly",
},
],
},
{
type: "doc",
id: "docs/managed-datahub/observe/column-assertions",
className: "saasOnly",
id: "docs/managed-datahub/observe/data-contract",
},
],
},
Expand Down
48 changes: 48 additions & 0 deletions docs/managed-datahub/observe/assertions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Assertions

:::note Contract Monitoring Support
Currently we support Snowflake, Databricks, Redshift, and BigQuery for out-of-the-box contract monitoring as part of Acryl Observe.
:::

An assertion is **a data quality test that finds data that violates a specified rule.**
Assertions serve as the building blocks of [Data Contracts](/docs/managed-datahub/observe/data-contract.md) – this is how we verify the contract is met.

## How to Create and Run Assertions

Data quality tests (a.k.a. assertions) can be created and run by Acryl or ingested from a 3rd party tool.

### Acryl Observe

For Acryl-provided assertion runners, we can deploy an agent in your environment to hit your sources and DataHub. Acryl Observe offers out-of-the-box evaluation of the following kinds of assertions:

- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md) (SLAs)
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
- [Column](/docs/managed-datahub/observe/column-assertions.md)

These can be defined through the DataHub API or the UI.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/assertions/assertion-ui.png"/>
</p>

### 3rd Party Runners

You can integrate 3rd party tools as follows:

- [DBT Test](/docs/generated/ingestion/sources/dbt.md#integrating-with-dbt-test)
- [Great Expectations](../../../metadata-ingestion/integration_docs/great-expectations.md)

If you opt for a 3rd party tool, it will be your responsibility to ensure the assertions are run based on the Data Contract spec stored in DataHub. With 3rd party runners, you can get the Assertion Change events by subscribing to our Kafka topic using the [DataHub Actions Framework](/docs/actions/README.md).


## Alerts

Beyond the ability to see the results of the assertion checks (and history of the results) both on the physical asset’s page in the DataHub UI and as the result of DataHub API calls, you can also get notified via [slack messages](/docs/managed-datahub/saas-slack-setup.md) (DMs or to a team channel) based on your [subscription](https://youtu.be/VNNZpkjHG_I?t=79) to an assertion change event. In the future, we’ll also provide the ability to subscribe directly to contracts.

With Acryl Observe, you can get the Assertion Change event by getting API events via [AWS EventBridge](/docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md) (the availability and simplicity of setup of each solution dependent on your current Acryl setup – chat with your Acryl representative to learn more).


## Cost

We provide a plethora of ways to run your assertions, aiming to allow you to use the cheapest possible means to do so and/or the most accurate means to do so, depending on your use case. For example, for Freshness (SLA) assertions, it is relatively cheap to use either their Audit Log or Information Schema as a means to run freshness checks, and we support both of those as well as Last Modified Column, High Watermark Column, and DataHub Operation ([see the docs for more details](/docs/managed-datahub/observe/freshness-assertions.md#3-change-source)).
119 changes: 119 additions & 0 deletions docs/managed-datahub/observe/data-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Data Contracts

## What Is a Data Contract

A Data Contract is **an agreement between a data asset's producer and consumer**, serving as a promise about the quality of the data.
It often includes [assertions](assertions.md) about the data’s schema, freshness, and data quality.

Some of the key characteristics of a Data Contract are:

- **Verifiable** : based on the actual physical data asset, not its metadata (e.g., schema checks, column-level data checks, and operational SLA-s but not documentation, ownership, and tags).
- **A set of assertions** : The actual checks against the physical asset to determine a contract’s status (schema, freshness, volume, custom, and column)
- **Producer oriented** : One contract per physical data asset, owned by the producer.


<details>
<summary>Consumer Oriented Data contracts</summary>
We’ve gone with producer-oriented contracts to keep the number of contracts manageable and because we expect consumers to desire a lot of overlap in a given physical asset’s contract. Although, we've heard feedback that consumer-oriented data contracts meet certain needs that producer-oriented contracts do not. For example, having one contract per consumer all on the same physical data asset would allow each consumer to get alerts only when the assertions they care about are violated.We welcome feedback on this in slack!
</details>

Below is a screenshot of the Data Contracts UI in DataHub.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/validated-data-contracts-ui.png"/>
</p>

## Data Contract and Assertions

Another way to word our vision of data contracts is **A bundle of verifiable assertions on physical data assets representing a public producer commitment.**
These can be all the assertions on an asset or only the subset you want publicly promised to consumers. Data Contracts allow you to **promote a selected group of your assertions** as a public promise: if this subset of assertions is not met, the Data Contract is failing.

See docs on [assertions](/docs/managed-datahub/observe/assertions.md) for more details on the types of assertions and how to create and run them.

:::note Ownership
The owner of the physical data asset is also the owner of the contract and can accept proposed changes and make changes themselves to the contract.
:::


## How to Create Data Contracts

Data Contracts can be created via DataHub CLI (YAML), API, or UI.

### DataHub CLI using YAML

For creation via CLI, it’s a simple CLI upsert command that you can integrate into your CI/CD system to publish your Data Contracts and any change to them.

1. Define your data contract.

```yaml
{{ inline /metadata-ingestion/examples/library/create_data_contract.yml show_path_as_comment }}
```

2. Use the CLI to create the contract by running the below command.

```shell
datahub datacontract upsert -f contract_definition.yml
```

3. Now you can see your contract on the UI.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/data-contracts-ui.png"/>
</p>


### UI

1. Navigate to the Dataset Profile for the dataset you wish to create a contract for
2. Under the **Validations** > **Data Contracts** tab, click **Create**.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/create-data-contract-ui.png"/>
</p>


3. Select the assertions you wish to be included in the Data Contract.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/select-assertions.png"/>
</p>


:::note Create Data Contracts via UI
When creating a Data Contract via UI, the Freshness, Schema, and Data Quality assertions must be created first.
:::
4. Now you can see it in the UI.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/contracts-created.png"/>
</p>


### API

_API guide on creating data contract is coming soon!_


## How to Run Data Contracts

Running Data Contracts is dependent on running the contract’s assertions and getting the results on Datahub. Using Acryl Observe (available on SAAS), you can schedule assertions on Datahub itself. Otherwise, you can run your assertions outside of Datahub and have the results published back to Datahub.

Datahub integrates nicely with DBT Test and Great Expectations, as described below. For other 3rd party assertion runners, you’ll need to use our APIs to publish the assertion results back to our platform.

### DBT Test

During DBT Ingestion, we pick up the dbt `run_results` file, which contains the dbt test run results, and translate it into assertion runs. [See details here.](/docs/generated/ingestion/sources/dbt.md#module-dbt)

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/dbt-test.png"/>
</p>



### Great Expectations

For Great Expectations, you can integrate the **DataHubValidationAction** directly into your Great Expectations Checkpoint in order to have the assertion (aka. expectation) results to Datahub. [See the guide here](../../../metadata-ingestion/integration_docs/great-expectations.md).

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/data_contracts/gx-test.png"/>
</p>
39 changes: 39 additions & 0 deletions metadata-ingestion/examples/library/create_data_contract.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# id: sample_data_contract # Optional: if not provided, an id will be generated
entity: urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)
version: 1
freshness:
type: cron
cron: "4 8 * * 1-5"
data_quality:
- type: unique
column: field_foo
## here's an example of how you'd define the schema
# schema:
# type: json-schema
# json-schema:
# type: object
# properties:
# field_foo:
# type: string
# native_type: VARCHAR(100)
# field_bar:
# type: boolean
# native_type: boolean
# field_documents:
# type: array
# items:
# type: object
# properties:
# docId:
# type: object
# properties:
# docPolicy:
# type: object
# properties:
# policyId:
# type: integer
# fileId:
# type: integer
# required:
# - field_bar
# - field_documents
Original file line number Diff line number Diff line change
Expand Up @@ -407,8 +407,12 @@ def validate_include_column_lineage(
def validate_skip_sources_in_lineage(
cls, skip_sources_in_lineage: bool, values: Dict
) -> bool:
entites_enabled: DBTEntitiesEnabled = values["entities_enabled"]
if skip_sources_in_lineage and entites_enabled.sources == EmitDirective.YES:
entites_enabled: Optional[DBTEntitiesEnabled] = values.get("entities_enabled")
if (
skip_sources_in_lineage
and entites_enabled
and entites_enabled.sources == EmitDirective.YES
):
raise ValueError(
"When `skip_sources_in_lineage` is enabled, `entities_enabled.sources` must be set to NO."
)
Expand Down
24 changes: 24 additions & 0 deletions metadata-ingestion/tests/unit/test_dbt_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,30 @@ def test_dbt_entity_emission_configuration():
DBTCoreConfig.parse_obj(config_dict)


def test_dbt_config_skip_sources_in_lineage():
with pytest.raises(
ValidationError,
match="skip_sources_in_lineage.*entities_enabled.sources.*set to NO",
):
config_dict = {
"manifest_path": "dummy_path",
"catalog_path": "dummy_path",
"target_platform": "dummy_platform",
"skip_sources_in_lineage": True,
}
config = DBTCoreConfig.parse_obj(config_dict)

config_dict = {
"manifest_path": "dummy_path",
"catalog_path": "dummy_path",
"target_platform": "dummy_platform",
"skip_sources_in_lineage": True,
"entities_enabled": {"sources": "NO"},
}
config = DBTCoreConfig.parse_obj(config_dict)
assert config.skip_sources_in_lineage is True


def test_dbt_s3_config():
# test missing aws config
config_dict: dict = {
Expand Down
1 change: 1 addition & 0 deletions metadata-service/openapi-servlet/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies {
implementation project(':metadata-service:auth-impl')
implementation project(':metadata-service:factories')
implementation project(':metadata-service:schema-registry-api')
implementation project (':metadata-service:openapi-servlet:models')

implementation externalDependency.reflections
implementation externalDependency.springBoot
Expand Down
16 changes: 16 additions & 0 deletions metadata-service/openapi-servlet/models/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
plugins {
id 'java'
}

dependencies {
implementation project(':entity-registry')
implementation project(':metadata-operation-context')
implementation project(':metadata-auth:auth-api')

implementation externalDependency.jacksonDataBind
implementation externalDependency.httpClient

compileOnly externalDependency.lombok

annotationProcessor externalDependency.lombok
}
Loading

0 comments on commit 0214c6e

Please sign in to comment.