Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added storage for direct filesystem references in code #2526

Merged
merged 107 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
18950f1
add support for sql functional tests
ericvergnaud Aug 30, 2024
224796d
disable
ericvergnaud Aug 30, 2024
b63cb19
more functional tests
ericvergnaud Aug 30, 2024
6d911ca
move test to functional
ericvergnaud Aug 30, 2024
5f07dc8
formatting
ericvergnaud Aug 30, 2024
a8f3ecd
formatting
ericvergnaud Aug 30, 2024
f0de866
Merge branch 'main' into support-sql-notebooks-in-functional-tests
ericvergnaud Aug 30, 2024
505402f
enhance dbfs linting to all direct file system access
ericvergnaud Sep 2, 2024
992ffe7
use dfsa for pyspark
ericvergnaud Sep 2, 2024
830b026
fix duplicate advice
ericvergnaud Sep 2, 2024
a1e15e7
fix functional tests
ericvergnaud Sep 2, 2024
eb56636
formatting
ericvergnaud Sep 2, 2024
b8e6b82
move python parsing code under dedicated package
ericvergnaud Sep 2, 2024
472c417
move PythnCodeAnalyzer to dedicated file
ericvergnaud Sep 2, 2024
7345d67
Merge branch 'refactor-python-code-analysis' into lint-direct-file-sy…
ericvergnaud Sep 3, 2024
3e256be
fix merge issues
ericvergnaud Sep 3, 2024
d1d251f
merge from stale branch
ericvergnaud Sep 2, 2024
13ea1e6
more tests
ericvergnaud Sep 2, 2024
93d194f
merge from stale branch
ericvergnaud Sep 2, 2024
f4bc0b8
merge from stale branch
ericvergnaud Sep 3, 2024
7ae9adc
fix failing tests
ericvergnaud Sep 3, 2024
fdf7a39
rename ctor arg
ericvergnaud Sep 3, 2024
054f847
fix infinite recursion with unknown ASTs
ericvergnaud Sep 3, 2024
9c4a5bf
fix infinite recursion with unknown ASTs
ericvergnaud Sep 3, 2024
6889852
make register_magic_command a decorator
ericvergnaud Sep 3, 2024
7792199
formatting
ericvergnaud Sep 3, 2024
b4ba5ae
integrate with WorkflowLinter
ericvergnaud Sep 3, 2024
2646acc
fix failing tests
ericvergnaud Sep 3, 2024
361aa2e
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 4, 2024
a5416c0
finalize integration
ericvergnaud Sep 4, 2024
c94ee81
Merge branch 'refactor-python-code-analysis' into lint-direct-file-sy…
ericvergnaud Sep 4, 2024
4a30a79
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 4, 2024
bba91c9
add logs
ericvergnaud Sep 4, 2024
a2f66c4
Merge branch 'main' into refactor-python-code-analysis
ericvergnaud Sep 4, 2024
2da934e
enhance integration test or checking stored DFSAs
ericvergnaud Sep 4, 2024
353871f
Merge branch 'refactor-python-code-analysis' into lint-direct-file-sy…
ericvergnaud Sep 4, 2024
e1329a6
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 4, 2024
5c444ff
move 'magic'-related stuff to dedicated file
ericvergnaud Sep 4, 2024
4d14367
Merge branch 'main' into refactor-python-code-analysis
ericvergnaud Sep 4, 2024
7f93d10
formatting
ericvergnaud Sep 4, 2024
6258908
Merge branch 'refactor-python-code-analysis' into store-dfsa-records
ericvergnaud Sep 4, 2024
7192e85
fix failing tests
ericvergnaud Sep 4, 2024
46cd358
Merge branch 'refactor-python-code-analysis' into lint-direct-file-sy…
ericvergnaud Sep 4, 2024
2652e00
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 4, 2024
d4be072
formatting
ericvergnaud Sep 4, 2024
0f3941a
Merge branch 'main' into refactor-python-code-analysis
ericvergnaud Sep 5, 2024
e552e3e
rename dfsa -> directfs
ericvergnaud Sep 5, 2024
e30ccfa
improve naming and drop /tmp/ exclusion
ericvergnaud Sep 5, 2024
2fd1d49
Merge branch 'main' into lint-direct-file-system-access
ericvergnaud Sep 5, 2024
4c48951
Update docs
ericvergnaud Sep 5, 2024
6e07c9b
move to functional test
ericvergnaud Sep 5, 2024
cd3b115
update docs
ericvergnaud Sep 5, 2024
c3ee620
Merge branch 'refactor-python-code-analysis' into lint-direct-file-sy…
ericvergnaud Sep 5, 2024
26096bd
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 5, 2024
8fea3eb
improve naming and comments
ericvergnaud Sep 5, 2024
8ee8a77
Merge branch 'lint-direct-file-system-access' into store-dfsa-records
ericvergnaud Sep 5, 2024
64de1e0
fix failing test
ericvergnaud Sep 5, 2024
de0c4f7
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 6, 2024
6ce8f86
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 6, 2024
5b980df
store DFSAs for paths and queries in dedicated tables
ericvergnaud Sep 6, 2024
cdcc3e1
support lineage when walking dependency graph
ericvergnaud Sep 6, 2024
1515342
sore dfsa lineage
ericvergnaud Sep 6, 2024
ca058ed
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 6, 2024
1c88c97
fix merge issues
ericvergnaud Sep 6, 2024
ef261a6
capture and store source_timestamp
ericvergnaud Sep 6, 2024
d951286
simplify
ericvergnaud Sep 6, 2024
ea74ba6
capture and store job/task infos
ericvergnaud Sep 6, 2024
e75e07d
simplify
ericvergnaud Sep 6, 2024
e62fd18
capture and store assessment start/stop, also drop source_type
ericvergnaud Sep 6, 2024
b58f47d
drop mock_dfsa_crawlers
ericvergnaud Sep 6, 2024
5a5d4ff
rename _backend -> _sql_backend
ericvergnaud Sep 6, 2024
404f6cd
rename _backend -> _sql_backend
ericvergnaud Sep 6, 2024
cb6e45d
hdfs is irrelevant, replace with dbfs
ericvergnaud Sep 6, 2024
733decc
drop mock of DirectFsAccessCrawlers
ericvergnaud Sep 9, 2024
6338af6
gather and store dfsas from refresh_report
ericvergnaud Sep 9, 2024
166e34c
prevent pylint warning
ericvergnaud Sep 9, 2024
8bc3c39
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 9, 2024
0249310
fix failing tests
ericvergnaud Sep 9, 2024
b605b77
formatting
ericvergnaud Sep 9, 2024
7f9fa06
fix failing tests
ericvergnaud Sep 9, 2024
f7108cb
fix failing tests
ericvergnaud Sep 9, 2024
abcab87
fix failing tests
ericvergnaud Sep 9, 2024
eb603e4
catch infinite recursion
ericvergnaud Sep 10, 2024
837fe6e
drop legacy code
ericvergnaud Sep 10, 2024
e80b6f0
Revert "catch infinite recursion"
ericvergnaud Sep 10, 2024
f8cdc22
Use structured lineage for DependencyGraph (#2556)
ericvergnaud Sep 10, 2024
9d4bc86
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 11, 2024
697a85e
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 11, 2024
8cb4ac0
fix merge issues
ericvergnaud Sep 11, 2024
185d060
Update src/databricks/labs/ucx/source_code/base.py
ericvergnaud Sep 12, 2024
b09a549
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 12, 2024
1742415
refactor DirectFsAccess
ericvergnaud Sep 12, 2024
c5987bb
add view
ericvergnaud Sep 12, 2024
b6fbeb4
fix failing tests
ericvergnaud Sep 12, 2024
cc194b6
formatting
ericvergnaud Sep 12, 2024
a5ada24
install added table
ericvergnaud Sep 12, 2024
4e45840
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 12, 2024
2ce8e42
address verbal comments from @asnare
ericvergnaud Sep 12, 2024
f133b97
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 13, 2024
145fbae
rename table and drop unused view
ericvergnaud Sep 16, 2024
810e356
rename method that is not yet in line with new crawler design
ericvergnaud Sep 16, 2024
64b8a84
Merge branch 'main' into store-dfsa-records
ericvergnaud Sep 16, 2024
9f24f97
rename method that is not yet in line with new crawler design
ericvergnaud Sep 16, 2024
b93b565
Update src/databricks/labs/ucx/source_code/jobs.py
ericvergnaud Sep 16, 2024
3f846a8
document design decision
ericvergnaud Sep 16, 2024
67f9d68
simplify
ericvergnaud Sep 16, 2024
506a761
Merge branch 'store-dfsa-records' of github.com:databrickslabs/ucx in…
ericvergnaud Sep 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/databricks/labs/ucx/contexts/application.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from databricks.labs.ucx.recon.metadata_retriever import DatabricksTableMetadataRetriever
from databricks.labs.ucx.recon.migration_recon import MigrationRecon
from databricks.labs.ucx.recon.schema_comparator import StandardSchemaComparator
from databricks.labs.ucx.source_code.directfs_access import DirectFsAccessCrawlers
from databricks.labs.ucx.source_code.python_libraries import PythonLibraryResolver
from databricks.sdk import AccountClient, WorkspaceClient, core
from databricks.sdk.errors import ResourceDoesNotExist
Expand Down Expand Up @@ -425,9 +426,14 @@ def workflow_linter(self):
self.dependency_resolver,
self.path_lookup,
MigrationIndex([]), # TODO: bring back self.tables_migrator.index()
self.directfs_access_crawlers,
self.config.include_job_ids,
)

@cached_property
def directfs_access_crawlers(self):
return DirectFsAccessCrawlers(self.sql_backend, self.inventory_database)

@cached_property
def redash(self):
return Redash(
Expand Down
2 changes: 1 addition & 1 deletion src/databricks/labs/ucx/mixins/fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ def create(notebook_path: str | Path | None = None, **kwargs):

job = ws.jobs.create(**kwargs)
logger.info(f"Job: {ws.config.host}#job/{job.job_id}")
return job
return ws.jobs.get(job.job_id)

yield from factory("job", create, lambda item: ws.jobs.delete(item.job_id))

Expand Down
7 changes: 7 additions & 0 deletions src/databricks/labs/ucx/queries/views/direct_fs_access.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
SELECT
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as a good practice of views, please specify explicit column names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… especially for UNION where the order of columns is important.

(People often assume that schema changes which modify the column order are forward and backward compatible and don't realise UNIONs will break. The only solution is to defensively enumerate columns explicitly when performing a UNION.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped from this PR

FROM direct_file_system_access_in_paths
UNION
SELECT
*
FROM direct_file_system_access_in_queries
1 change: 1 addition & 0 deletions src/databricks/labs/ucx/source_code/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from databricks.sdk.service.workspace import Language

from databricks.labs.blueprint.paths import WorkspacePath

from databricks.labs.ucx.source_code.python.python_ast import Tree

# Code mapping between LSP, PyLint, and our own diagnostics:
Expand Down
136 changes: 136 additions & 0 deletions src/databricks/labs/ucx/source_code/directfs_access.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
from __future__ import annotations


import logging
from collections.abc import Sequence, Iterable
from dataclasses import dataclass, field
from datetime import datetime

from databricks.labs.ucx.framework.crawlers import CrawlerBase, Result
from databricks.labs.lsql.backends import SqlBackend
from databricks.sdk.errors import DatabricksError

logger = logging.getLogger(__name__)


@dataclass
class LineageAtom:

object_type: str
object_id: str
other: dict[str, str] | None = None


@dataclass
class DirectFsAccess:
"""A record describing a Direct File System Access"""

UNKNOWN = "unknown"

path: str
is_read: bool
is_write: bool
source_id: str = UNKNOWN
source_timestamp: datetime = datetime.fromtimestamp(0)
source_lineage: list[LineageAtom] = field(default_factory=list)
job_id: int = -1
job_name: str = UNKNOWN
task_key: str = UNKNOWN
assessment_start_timestamp: datetime = datetime.fromtimestamp(0)
assessment_end_timestamp: datetime = datetime.fromtimestamp(0)
asnare marked this conversation as resolved.
Show resolved Hide resolved

def replace_source(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow-up PR, please use dataclasses.replace

self,
source_id: str | None = None,
source_lineage: list[LineageAtom] | None = None,
source_timestamp: datetime | None = None,
):
return DirectFsAccess(
path=self.path,
is_read=self.is_read,
is_write=self.is_write,
source_id=source_id or self.source_id,
source_timestamp=source_timestamp or self.source_timestamp,
source_lineage=source_lineage or self.source_lineage,
job_id=self.job_id,
job_name=self.job_name,
task_key=self.task_key,
assessment_start_timestamp=self.assessment_start_timestamp,
assessment_end_timestamp=self.assessment_start_timestamp,
)

def replace_job_infos(
self,
job_id: int | None = None,
job_name: str | None = None,
task_key: str | None = None,
):
return DirectFsAccess(
path=self.path,
is_read=self.is_read,
is_write=self.is_write,
source_id=self.source_id,
source_timestamp=self.source_timestamp,
source_lineage=self.source_lineage,
job_id=job_id or self.job_id,
job_name=job_name or self.job_name,
task_key=task_key or self.task_key,
assessment_start_timestamp=self.assessment_start_timestamp,
assessment_end_timestamp=self.assessment_start_timestamp,
)

def replace_assessment_infos(
self, assessment_start: datetime | None = None, assessment_end: datetime | None = None
):
return DirectFsAccess(
path=self.path,
is_read=self.is_read,
is_write=self.is_write,
source_id=self.source_id,
source_timestamp=self.source_timestamp,
source_lineage=self.source_lineage,
job_id=self.job_id,
job_name=self.job_name,
task_key=self.task_key,
assessment_start_timestamp=assessment_start or self.assessment_start_timestamp,
assessment_end_timestamp=assessment_end or self.assessment_start_timestamp,
)


class _DirectFsAccessCrawler(CrawlerBase):
asnare marked this conversation as resolved.
Show resolved Hide resolved

def __init__(self, backend: SqlBackend, schema: str, table: str):
"""
Initializes a DFSACrawler instance.

Args:
sql_backend (SqlBackend): The SQL Execution Backend abstraction (either REST API or Spark)
schema: The schema name for the inventory persistence.
"""
super().__init__(backend, "hive_metastore", schema, table, DirectFsAccess)

def append(self, dfsas: Sequence[DirectFsAccess]):
try:
self._append_records(dfsas)
except DatabricksError as e:
logger.error("Failed to store DFSAs", exc_info=e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of (callback) interface isn't supposed to be how crawlers gather and write their records: the records are written as a single operation at the end.

If you want to accumulate the records, pass a list (or builder of some sort) through for the linter to accumulate the records as it runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asnare @nfx as discussed, this is acceptable for now because WorkflowLinter simultaneously collects JobProblems and DFSAs, thus avoiding duplicate i/o and parsing.
What we need is a MultiCrawler such that we can simultaneously collect data stored in different tables. Until we have that, the proposed code is acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a crawler that is happy with existing empty snapshots


def _try_fetch(self) -> Iterable[DirectFsAccess]:
sql = f"SELECT * FROM {self.full_name}"
yield from self._backend.fetch(sql)

def _crawl(self) -> Iterable[Result]:
asnare marked this conversation as resolved.
Show resolved Hide resolved
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the linting process should be invoked and start from, and it should return the records that are going to be written into the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works fine if the process is dedicated, but for jobs, the process collects simultaneously JobProblem's and DirectFsAccess's...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment



class DirectFsAccessCrawlers:

def __init__(self, sql_backend: SqlBackend, schema: str):
self._sql_backend = sql_backend
self._schema = schema

def for_paths(self) -> _DirectFsAccessCrawler:
return _DirectFsAccessCrawler(self._sql_backend, self._schema, "direct_file_system_access_in_paths")

def for_queries(self) -> _DirectFsAccessCrawler:
return _DirectFsAccessCrawler(self._sql_backend, self._schema, "direct_file_system_access_in_queries")
16 changes: 15 additions & 1 deletion src/databricks/labs/ucx/source_code/graph.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import abc
import itertools
import logging
from dataclasses import dataclass
from pathlib import Path
Expand All @@ -11,6 +12,7 @@
NodeNG,
)
from databricks.labs.ucx.source_code.base import Advisory, CurrentSessionState, is_a_notebook
from databricks.labs.ucx.source_code.directfs_access import LineageAtom
from databricks.labs.ucx.source_code.python.python_ast import Tree
from databricks.labs.ucx.source_code.path_lookup import PathLookup

Expand Down Expand Up @@ -304,7 +306,7 @@ class DependencyGraphContext:
session_state: CurrentSessionState


class Dependency(abc.ABC):
class Dependency:

def __init__(self, loader: DependencyLoader, path: Path, inherits_context=True):
self._loader = loader
Expand All @@ -331,6 +333,10 @@ def load(self, path_lookup: PathLookup) -> SourceContainer | None:
def __repr__(self):
return f"Dependency<{self.path}>"

@property
def lineage(self) -> list[LineageAtom]:
return [LineageAtom("path", str(self.path))]


class SourceContainer(abc.ABC):

Expand Down Expand Up @@ -608,6 +614,7 @@ def __init__(self, graph: DependencyGraph, walked_paths: set[Path], path_lookup:
self._graph = graph
self._walked_paths = walked_paths
self._path_lookup = path_lookup
self._lineage: list[Dependency] = []

def __iter__(self) -> Iterator[T]:
for dependency in self._graph.root_dependencies:
Expand All @@ -619,6 +626,7 @@ def __iter__(self) -> Iterator[T]:
def _iter_one(self, dependency: Dependency, graph: DependencyGraph, root_path: Path) -> Iterable[T]:
if dependency.path in self._walked_paths:
return
self._lineage.append(dependency)
self._walked_paths.add(dependency.path)
self._log_walk_one(dependency)
if dependency.path.is_file() or is_a_notebook(dependency.path):
Expand All @@ -631,6 +639,7 @@ def _iter_one(self, dependency: Dependency, graph: DependencyGraph, root_path: P
child_graph = maybe_graph.graph
for child_dependency in child_graph.local_dependencies:
yield from self._iter_one(child_dependency, child_graph, root_path)
self._lineage.pop()

def _log_walk_one(self, dependency: Dependency):
logger.debug(f'Analyzing dependency: {dependency}')
Expand All @@ -639,3 +648,8 @@ def _log_walk_one(self, dependency: Dependency):
def _process_dependency(
self, dependency: Dependency, path_lookup: PathLookup, inherited_tree: Tree | None
) -> Iterable[T]: ...

@property
def lineage(self) -> list[LineageAtom]:
lists: list[list[LineageAtom]] = [dependency.lineage for dependency in self._lineage]
return list(itertools.chain(*lists))
Loading
Loading