Skip to content

Commit

Permalink
fix: PR comments
Browse files Browse the repository at this point in the history
  • Loading branch information
sagar-salvi-apptware committed Oct 22, 2024
1 parent 9835e31 commit 0afe29f
Show file tree
Hide file tree
Showing 6 changed files with 50 additions and 61 deletions.
33 changes: 14 additions & 19 deletions metadata-ingestion/docs/sources/dremio/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
### Concept Mapping

- **Dremio Datasets**: Mapped to DataHub’s `Dataset` entity.
- A dataset can be physical or virtual.
- **Lineage**: Mapped to DataHub’s `UpstreamLineage` aspect, representing the flow of data between datasets and columns.
- **Containers**: Spaces, folders, and sources in Dremio are mapped to DataHub’s `Container` aspect, organizing datasets logically.

Here's a table for **Concept Mapping** between Dremio and DataHub to provide a clear overview of how entities and concepts in Dremio are mapped to corresponding entities in DataHub:

| Source Concept | DataHub Concept | Notes |
| --- | --- | --- | --- |
| **Physical Dataset** | `Dataset` | A dataset directly queried from an external source without modifications. | |
| **Virtual Dataset** | `Dataset` | A dataset built from SQL-based transformations on other datasets. | |
| **Spaces** | `Container` | Top-level organizational unit in Dremio, used to group datasets. Mapped to DataHub’s `Container` aspect. | |
| **Folders** | `Container` | Substructure inside spaces, used for organizing datasets. Mapped as a `Container` in DataHub. | |
| **Sources** | `Container` | External data sources connected to Dremio (e.g., S3, databases). Represented as a `Container` in DataHub. | |
| **Column Lineage** | `ColumnLineage` | Lineage between columns in datasets, showing how individual columns are transformed across datasets. | |
| **Dataset Lineage** | `UpstreamLineage` | Lineage between datasets, tracking the flow and transformations between different datasets. | |
| **Ownership (Dataset)** | `Ownership` | Ownership information for datasets, representing the technical owner in DataHub’s `Ownership` aspect. | |
| **Glossary Terms** | `GlossaryTerms` | Business terms associated with datasets, providing context. Mapped as `GlossaryTerms` in DataHub. | |
| **Schema Metadata** | `SchemaMetadata` | Schema details (columns, data types) for datasets. Mapped to DataHub’s `SchemaMetadata` aspect. | |
| **SQL Transformations** | `Dataset` (with lineage) | SQL queries in Dremio that transform datasets. Represented as `Dataset` in DataHub, with lineage showing dependency. | |
| **Queries** | `Query` (if mapped) | Historical SQL queries executed on Dremio datasets. These can be tracked for audit purposes in DataHub. | |
| Source Concept | DataHub Concept | Notes |
| ----------------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------- | --- |
| **Physical Dataset** | `Dataset` | A dataset directly queried from an external source without modifications. | |
| **Virtual Dataset** | `Dataset` | A dataset built from SQL-based transformations on other datasets. | |
| **Spaces** | `Container` | Top-level organizational unit in Dremio, used to group datasets. Mapped to DataHub’s `Container` aspect. | |
| **Folders** | `Container` | Substructure inside spaces, used for organizing datasets. Mapped as a `Container` in DataHub. | |
| **Sources** | `Container` | External data sources connected to Dremio (e.g., S3, databases). Represented as a `Container` in DataHub. | |
| **Column Lineage** | `ColumnLineage` | Lineage between columns in datasets, showing how individual columns are transformed across datasets. | |
| **Dataset Lineage** | `UpstreamLineage` | Lineage between datasets, tracking the flow and transformations between different datasets. | |
| **Ownership (Dataset)** | `Ownership` | Ownership information for datasets, representing the technical owner in DataHub’s `Ownership` aspect. | |
| **Glossary Terms** | `GlossaryTerms` | Business terms associated with datasets, providing context. Mapped as `GlossaryTerms` in DataHub. | |
| **Schema Metadata** | `SchemaMetadata` | Schema details (columns, data types) for datasets. Mapped to DataHub’s `SchemaMetadata` aspect. | |
| **SQL Transformations** | `Dataset` (with lineage) | SQL queries in Dremio that transform datasets. Represented as `Dataset` in DataHub, with lineage showing dependency. | |
| **Queries** | `Query` (if mapped) | Historical SQL queries executed on Dremio datasets. These can be tracked for audit purposes in DataHub. | |
17 changes: 3 additions & 14 deletions metadata-ingestion/docs/sources/dremio/dremio_pre.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,13 @@ The API token should have the necessary permissions to **read metadata** and **r
- Log in to your Dremio instance.
- Navigate to your user profile in the top-right corner.
- Select **Generate API Token** to create an API token for programmatic access.
- Ensure that the API token has sufficient permissions to access datasets, spaces, sources, and lineage.

2. **Identify the API Endpoint**:
2. **Permissions**:

- The Dremio API endpoint typically follows this format:
`https://<your-dremio-instance>/api/v3/`
- This endpoint is used to query metadata and lineage information.

3. **Get the Space, Folder, and Dataset Details**:
- To identify specific datasets or containers (spaces, folders, sources), navigate to the Dremio web interface.
- Explore the **Spaces** and **Sources** sections to identify the datasets you need to retrieve metadata for.
4. **Permissions**:
- The token should have **read-only** or **admin** permissions that allow it to:
- View all datasets (physical and virtual).
- Access all spaces, folders, and sources.
- Retrieve dataset and column-level lineage information.
5. **Verify External Data Source Permissions**:
- If Dremio is connected to external data sources (e.g., AWS S3, relational databases), ensure that Dremio has access to the credentials required for querying those sources.


Ensure your API token has the correct permissions to interact with the Dremio metadata.
3. **Verify External Data Source Permissions**:
- If Dremio is connected to external data sources (e.g., AWS S3, relational databases), ensure that Dremio has access to the credentials required for querying those sources.
6 changes: 5 additions & 1 deletion metadata-ingestion/docs/sources/dremio/dremio_recipe.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,15 @@ source:
port: 9047
tls: true

# Credentials
# Credentials with basic auth
authentication_method: password
username: user
password: pass

# Credentials with personal access token
authentication_method: PAT
password: pass

include_query_lineage: True

source_mappings:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,26 @@ def community_get_formatted_tables(

return dataset_list

def get_pattern_condition(
self, patterns: Union[str, List[str]], field: str, allow: bool = True
) -> str:
if not patterns:
return ""

if isinstance(patterns, str):
patterns = [patterns.upper()]

if ".*" in patterns and allow:
return ""

patterns = [p.upper() for p in patterns if p != ".*"]
if not patterns:
return ""

operator = "REGEXP_LIKE" if allow else "NOT REGEXP_LIKE"
pattern_str = "|".join(f"({p})" for p in patterns)
return f"AND {operator}({field}, '{pattern_str}')"

def get_all_tables_and_columns(self, containers: Deque) -> List[Dict]:
if self.edition == DremioEdition.ENTERPRISE:
query_template = DremioSQLQueries.QUERY_DATASETS_EE
Expand All @@ -382,37 +402,19 @@ def get_all_tables_and_columns(self, containers: Deque) -> List[Dict]:
else:
query_template = DremioSQLQueries.QUERY_DATASETS_CE

def get_pattern_condition(
patterns: Union[str, List[str]], field: str, allow: bool = True
) -> str:
if not patterns:
return ""

if isinstance(patterns, str):
patterns = [patterns.upper()]

if ".*" in patterns and allow:
return ""

patterns = [p.upper() for p in patterns if p != ".*"]
if not patterns:
return ""

operator = "REGEXP_LIKE" if allow else "NOT REGEXP_LIKE"
pattern_str = "|".join(f"({p})" for p in patterns)
return f"AND {operator}({field}, '{pattern_str}')"

schema_field = "CONCAT(REPLACE(REPLACE(REPLACE(UPPER(TABLE_SCHEMA), ', ', '.'), '[', ''), ']', ''))"
table_field = "UPPER(TABLE_NAME)"

schema_condition = get_pattern_condition(
schema_condition = self.get_pattern_condition(
self.allow_schema_pattern, schema_field
)
table_condition = get_pattern_condition(self.allow_dataset_pattern, table_field)
deny_schema_condition = get_pattern_condition(
table_condition = self.get_pattern_condition(
self.allow_dataset_pattern, table_field
)
deny_schema_condition = self.get_pattern_condition(
self.deny_schema_pattern, schema_field, allow=False
)
deny_table_condition = get_pattern_condition(
deny_table_condition = self.get_pattern_condition(
self.deny_dataset_pattern, table_field, allow=False
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class DremioConnectionConfig(ConfigModel):
)

authentication_method: Optional[str] = Field(
default="password",
default="PAT",
description="Authentication method: 'password' or 'PAT' (Personal Access Token)",
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@
)
from datahub.ingestion.source.dremio.dremio_profiling import (
DremioProfiler,
ProfileConfig,
)
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalHandler,
Expand Down Expand Up @@ -138,7 +137,7 @@ def __init__(self, config: DremioSourceConfig, ctx: Any):
self.dremio_catalog = DremioCatalog(dremio_api)

# Initialize profiler
profile_config = ProfileConfig()
profile_config = self.config.profiling
self.profiler = DremioProfiler(dremio_api, profile_config)

# Initialize aspects
Expand Down

0 comments on commit 0afe29f

Please sign in to comment.