Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added storage for direct filesystem references in code #2526

Merged
merged 107 commits into from
Sep 16, 2024
Merged

Conversation

ericvergnaud
Copy link
Contributor

@ericvergnaud ericvergnaud commented Sep 3, 2024

Changes

On top of linting, collects DFSA records and stores them

Linked issues

Progresses #2350

Functionality

  • added a new table directfs_in_paths

Tests

  • added unit tests
  • updated integration tests
  • manually tested schema upgrade:

Screenshot 2024-09-13 at 16 00 19

@@ -120,6 +121,9 @@ def deploy_schema(sql_backend: SqlBackend, inventory_schema: str):
functools.partial(table, "udfs", Udf),
functools.partial(table, "logs", LogRecord),
functools.partial(table, "recon_results", ReconResult),
functools.partial(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding direct_file_system_access_in_queries table will be done in PR #2599

@@ -128,6 +132,7 @@ def deploy_schema(sql_backend: SqlBackend, inventory_schema: str):
deployer.deploy_view("misc_patterns", "queries/views/misc_patterns.sql")
deployer.deploy_view("code_patterns", "queries/views/code_patterns.sql")
deployer.deploy_view("reconciliation_results", "queries/views/reconciliation_results.sql")
# direct_file_system_access view will be added in upcoming PR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding direct_file_system_access view will be done in PR #2599

@ericvergnaud ericvergnaud requested a review from nfx September 13, 2024 14:02
unix_time = 0.0
if isinstance(path, WorkspacePath):
# TODO add stats method in blueprint, see https://github.com/databrickslabs/blueprint/issues/142
# pylint: disable=protected-access
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go away with databrickslabs/blueprint#144

unix_time += float(path._object_info.modified_at) / 1000.0 if path._object_info.modified_at else 0.0
elif isinstance(path, DBFSPath):
# TODO add stats method in blueprint, see https://github.com/databrickslabs/blueprint/issues/143
# pylint: disable=protected-access
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go away with databrickslabs/blueprint#144

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for final reply from @asnare

@@ -120,6 +121,9 @@ def deploy_schema(sql_backend: SqlBackend, inventory_schema: str):
functools.partial(table, "udfs", Udf),
functools.partial(table, "logs", LogRecord),
functools.partial(table, "recon_results", ReconResult),
functools.partial(
table, "direct_file_system_access_in_paths", DirectFsAccess
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
table, "direct_file_system_access_in_paths", DirectFsAccess
table, "directfs_in_workspace", DirectFsAccess

please pick a shorter name, as it's abnormally longer than others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, sticking to ...paths because queries also belong to workspace

@@ -0,0 +1,7 @@
SELECT
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as a good practice of views, please specify explicit column names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… especially for UNION where the order of columns is important.

(People often assume that schema changes which modify the column order are forward and backward compatible and don't realise UNIONs will break. The only solution is to defensively enumerate columns explicitly when performing a UNION.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped from this PR

Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major; mainly some clarifying remarks/questions.

@@ -0,0 +1,7 @@
SELECT
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… especially for UNION where the order of columns is important.

(People often assume that schema changes which modify the column order are forward and backward compatible and don't realise UNIONs will break. The only solution is to defensively enumerate columns explicitly when performing a UNION.)

src/databricks/labs/ucx/source_code/jobs.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/source_code/directfs_access.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/source_code/jobs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

assessment_start_timestamp: datetime = datetime.fromtimestamp(0)
assessment_end_timestamp: datetime = datetime.fromtimestamp(0)

def replace_source(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow-up PR, please use dataclasses.replace

@nfx nfx merged commit 60e77e0 into main Sep 16, 2024
5 of 6 checks passed
@nfx nfx deleted the store-dfsa-records branch September 16, 2024 11:39
nfx added a commit that referenced this pull request Sep 19, 2024
* Added ability to run create-missing-principals command as collection ([#2675](#2675)). This release introduces the capability to run the `create-missing-principals` command as a collection in the UCX (Unified Cloud Experience) tool with the new optional flag `run-as-collection`. This allows for more control and flexibility when managing cloud resources, particularly in handling multiple workspaces. The existing `create-missing-principals` command has been modified to accept a new `run_as_collection` parameter, enabling the command to run on multiple workspaces when set to True. The function has been updated to handle a list of `WorkspaceContext` objects, allowing it to iterate over each object and execute necessary actions for each workspace. Additionally, a new `AccountClient` parameter has been added to facilitate the retrieval of all workspaces associated with a specific account. New test functions have been added to `test_cli.py` to test this new functionality on AWS and Azure cloud providers. The `acc_client` argument has been added to the test functions to enable running the tests with an authenticated AWS or Azure client, and the `MockPrompts` object is used to simulate user responses to the prompts displayed during the execution of the command.
* Added storage for direct filesystem references in code ([#2526](#2526)). The open-source library has been updated with a new table `directfs_in_paths` to store Direct File System Access (DFSA) records, extending support for managing and collecting DFSAs as part of addressing issue [#2350](#2350) and [#2526](#2526). The changes include a new class `DirectFsAccessCrawlers` and methods for handling DFSAs, as well as linting, testing, and a manually verified schema upgrade. Additionally, a new SQL query deprecates the use of direct filesystem references. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare.
* Added task for linting queries ([#2630](#2630)). This commit introduces a new `QueryLinter` class for linting SQL queries in the workspace, similar to the existing `WorkflowLinter` for jobs. The `QueryLinter` checks for any issues in dashboard queries and reports them in a new `query_problems` table. The commit also includes the addition of unit tests, integration tests, and manual testing of the schema upgrade. The `QueryLinter` method has been updated to include a `TableMigrationIndex` object, which is currently set to an empty list and will be updated in a future commit. This change improves the quality of the codebase by ensuring that all SQL queries are properly linted and any issues are reported, allowing for better maintenance and development of the system. The commit is co-authored by multiple developers, including Eric Vergnaud, Serge Smertin, Andrew Snare, and Cor. Additionally, a new linting rule, "direct-filesystem-access", has been introduced to deprecate the use of direct filesystem references in favor of more abstracted file access methods in the project's codebase.
* Adopt `databricks-labs-pytester` PyPI package ([#2663](#2663)). In this release, we have made updates to the `pyproject.toml` file, removing the `pytest` package version 8.1.0 and updating it to 8.3.3. We have also added the `databricks-labs-pytester` package with a minimum version of 0.2.1. This update also includes the adoption of the `databricks-labs-pytester` PyPI package, which moves fixture usage from `mixins.fixtures` into its own top-level library. This affects various test files, including `test_jobs.py`, by replacing the `get_purge_suffix` fixture with `watchdog_purge_suffix` to standardize the approach to creating and managing temporary directories and files used in tests. Additionally, new fixtures have been introduced in a separate PR for testing the `databricks.labs.ucx` package, including `debug_env_name`, `product_info`, `inventory_schema`, `make_lakeview_dashboard`, `make_dashboard`, `make_dbfs_data_copy`, `make_mounted_location`, `make_storage_dir`, `sql_exec`, and `migrated_group`. These fixtures simplify the testing process by providing preconfigured resources that can be used in the tests. The `redash.py` file has been removed from the `databricks/labs/ucx/mixins` directory as the Redash API is being deprecated and replaced with a new library.
* Assessment: crawl UDFs as a task in parallel to tables instead of implicitly during grants ([#2642](#2642)). This release introduces changes to the assessment workflow, specifically in how User Defined Functions (UDFs) are crawled/scanned. Previously, UDFs were crawled/scanned implicitly by the GrantsCrawler, which requested a snapshot from the UDFSCrawler that hadn't executed yet. With this update, UDFs are now crawled/scanned as their own task, running in parallel with tables before grants crawling begins. This modification addresses issue [#2574](#2574), which requires grants and UDFs to be refreshable but only once within a given workflow run. A new method, crawl_udfs, has been introduced to iterate over all UDFs in the Hive Metastore of the current workspace and persist their metadata in a table named $inventory_database.udfs. This inventory is utilized when scanning securable objects for issues with grants that cannot be migrated to Unit Catalog. The crawl_grants task now depends on crawl_udfs, crawl_tables, and setup_tacl, ensuring that UDFs are crawled/scanned before grants are.
* Collect direct filesystem access from queries ([#2599](#2599)). This commit introduces support for extracting Direct File System Access (DirectFsAccess) records from workspace queries, adding a new table `directfs_in_queries` and a new view `directfs` that unions `directfs_in_paths` with the new table. The DirectFsAccessCrawlers class has been refactored into two separate classes: `DirectFsAccessCrawler.for_paths` and `DirectFsAccessCrawler.for_queries`, and a new `QueryLinter` class has been introduced to check queries for DirectFsAccess records. Unit tests and manual tests have been conducted to ensure the correct functioning of the schema upgrade. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare.
* Fixed failing integration test: `test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace` ([#2624](#2624)). In this release, we have made updates to the group migration workflow, addressing an issue ([#2623](#2623)) where the integration test `test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace` failed due to unhandled scenarios where a workspace group already existed with the same name as an account group to be reflected. The changes include the addition of a new method, `_workspace_groups_in_workspace()`, which checks for the existence of workspace groups. We have also modified the `group-migration` workflow and integrated test `test_reflect_account_groups_on_workspace_skips_account_groups_when_a_workspace_group_has_same_name`. To enhance consistency and robustness, the `GroupManager` class has been updated with two new methods: `test_reflect_account_groups_on_workspace_warns_skipping_when_a_workspace_group_has_same_name` and `test_reflect_account_groups_on_workspace_logs_skipping_groups_when_already_reflected_on_workspace`. These new methods check if a group is skipped when a workspace group with the same name exists and log a warning message, as well as log skipping groups that are already reflected on the workspace. These improvements ensure that the system behaves as expected during the group migration process, handling cases where workspace groups and account groups share the same name.
* Fixed failing solution accelerator verification tests ([#2648](#2648)). This release includes a fix for an issue in the LocalCodeLinter class that was unable to normalize Python code at the notebook cell level. The solution involved modifying the LocalCodeLinter constructor to include a notebook loader, as well as adding a conditional block to the lint_path method to determine the correct loader to use based on whether the path is a notebook or not. These changes allow the linter to handle Python code more effectively within Jupyter notebook cells. The tests for this change were manually verified using `make solacc` on the files that failed in CI. This commit has been co-authored by Eric Vergnaud. The functionality of the linter remains unchanged, and there is no impact on the overall software functionality. The target audience for this description includes software engineers who adopt this open-source library.
* Fixed handling of potentially corrupt `state.json` of UCX workflows ([#2673](#2673)). This commit introduces a fix for potential corruption of `state.json` files in UCX workflows, addressing issue [#2673](#2673) and resolving [#2667](#2667). It updates the import statement in `install.py`, introduces a new `with_extra` function, and centralizes the deletion of jobs, improving code maintainability. Two new methods are added to check if a job is managed by UCX. Additionally, the commit removes deprecation warnings for direct filesystem references in pytester fixtures and adjusts the known.json file to accurately reflect the project's state. A new `Task` method is added for defining UCX workflow tasks, and several test cases are updated to ensure the correct handling of jobs during the uninstallation process. Overall, these changes enhance the reliability and user-friendliness of the UCX workflow installation process.
* Let `create-catalog-schemas` command run as collection ([#2653](#2653)). The `create-catalog-schemas` and `validate-external-locations` commands in the `databricks labs ucx` package have been updated to operate as collections, allowing for simultaneous execution on multiple workspaces. These changes, which resolve issue [#2609](#2609), include the addition of new parameters and flags to the command signatures and method signatures, as well as updates to the existing functionality for creating catalogs and schemas. The changes have been manually tested and accompanied by unit tests, with integration tests to be added in a future update. The `create-catalog-schemas` command now accepts a list of workspace clients and a `run_as_collection` parameter, and skips existing catalogs and schemas while logging a message. The `validate-external-locations` command also operates as a collection, though specific details about this change are not provided.
* Let `create-uber-principal` command run on collection of workspaces ([#2640](#2640)). The `create-uber-principal` command has been updated to support running on a collection of workspaces, allowing for more efficient management of service principals across multiple workspaces. This change includes the addition of a new flag, `run-as-collection`, which, when set to true, allows the command to run on a collection of workspaces with UCX installed. The command continues to grant STORAGE_BLOB_READER access to Azure storage accounts and identify S3 buckets used in AWS workspaces. The changes also include updates to the testing strategy, with manual testing and unit tests added. Integration tests will be added in a future PR. These modifications enhance the functionality and reliability of the command, improving the user experience for managing workspaces. In terms of implementation, the `create_uber_principal` method in the `access.py` and `cli.py` files has been updated to support running on a collection of workspaces. The modification includes the addition of a new parameter, `run_as_collection`, which, when set to True, allows the method to retrieve a collection of workspace contexts and execute the necessary operations for each context. The changes also include updates to the underlying methods, such as the `aws_profile` method, to ensure the correct cloud provider is being utilized. The behavior of the command has been isolated from the underlying `ucx` functionality by introducing mock values for the uber service principal ID and policy ID. The changes also include updates to the tests to reflect these modifications, with new tests added to ensure that the command behaves correctly when run on a collection of workspaces and to test the error handling for unsupported cloud providers and missing subscription IDs.
* Let `migrate-acls` command run as collection ([#2664](#2664)). The `migrate-acls` command in the `labs.yml` file has been updated to facilitate the migration of access control lists (ACLs) from a legacy metastore to a UC metastore for a collection of workspaces with Unity Catalog (UC) installed. This command now supports running as a collection, enabled by a new optional flag `run-as-collection`. When set to true, the command will run for all workspaces with UC installed, enhancing efficiency and ease of use. The new functionality has been manually tested and verified with added unit tests. However, integration tests are yet to be added. The command is part of the `databricks/labs/ucx` module and is implemented in the `cli.py` file. This update addresses issue [#2611](#2611) and includes both manual and unit tests.
* Let `migrate-dbsql-dashboards` command to run as collection ([#2656](#2656)). The `migrate-dbsql-dashboards` command in the `databricks labs ucx` command group has been updated to support running as a collection, allowing it to migrate queries for all dashboards in one or more workspaces. This new feature is achieved by adding an optional flag `run-as-collection` to the command. If set to True, the command will be executed for all workspaces with ucx installed, resolving issue [#2612](#2612). The `migrate-dbsql-dashboards` function has been updated to take additional parameters `ctx`, `run_as_collection`, and `a`. The `ctx` parameter is an optional `WorkspaceContext` object, which can be used to specify the context for a single workspace. If not provided, the function will retrieve a list of `WorkspaceContext` objects for all workspaces. The `run_as_collection` parameter is a boolean flag indicating whether the command should run as a collection. If set to True, the function will iterate over all workspaces and migrate queries for all dashboards in each workspace. The `a` parameter is an optional `AccountClient` object for authentication. Unit tests have been added to ensure that the new functionality works as expected. This feature will be useful for users who need to migrate many dashboards at once. Integration tests will be added in a future update after issue [#2507](#2507) is addressed.
* Let `migrate-locations` command run as collection ([#2652](#2652)). The `migrate-locations` command in the `databricks labs ucx` library for AWS and Azure has been enhanced to support running as a collection of workspaces, allowing for more efficient management of external locations. This has been achieved by modifying the existing `databricks labs ucx migrate-locations` command and adding a `run_as_collection` flag to specify that the command should run for a collection of workspaces. The changes include updates to the `run` method in `locations.py` to return a list of strings containing the URLs of missing external locations, and the addition of the `_filter_unsupported_location` method to filter out unsupported locations. A new `_get_workspace_contexts` function has been added to return a list of `WorkspaceContext` objects based on the provided `WorkspaceClient`, `AccountClient`, and named parameters. The commit also includes new test cases for handling unsupported cloud providers and testing the `run as collection` functionality with multiple workspaces, as well as manual and unit tests. Note that due to current limitations in unit testing, the `run as collection` tests for both Azure and AWS raise exceptions.
* Let `migrate-tables` command run as collection ([#2654](#2654)). The `migrate-tables` command in the `labs.yml` configuration file has been updated to support running as a collection of workspaces with UCX installed. This change includes adding a new flag `run_as_collection` that, when set to `True`, allows the command to run on all workspaces in the collection, and modifying the existing command to accept an `AccountClient` object and `WorkspaceContext` objects. The function `_get_workspace_contexts` is used to retrieve the `WorkspaceContext` objects for each workspace in the collection. Additionally, the `migrate_tables` command now checks for the presence of hiveserde and external tables and prompts the user to run the `migrate-external-hiveserde-tables-in-place-experimental` and `migrate-external-tables-ctas` workflows, respectively. The command's documentation and tests have also been updated to reflect this new functionality. Integration tests will be added in a future update. These changes improve the scalability and efficiency of the `migrate-tables` command, allowing for easier and more streamlined execution across multiple workspaces.
* Let `validate-external-locations` command run as collection ([#2649](#2649)). In this release, the `validate-external-locations` command has been updated to support running as a collection, allowing it to operate on multiple workspaces simultaneously. This change includes the addition of new parameters `ctx`, `run_as_collection`, and `a` to the `validate-external-locations` command in the `cli.py` file. The `ctx` parameter determines the current workspace context, obtained through the `_get_workspace_contexts` function when `run_as_collection` is set to True. The function queries for all available workspaces associated with the given account client `a`. The `save_as_terraform_definitions_on_workspace` method is then called to save the external locations as Terraform definitions on the workspace. This enhancement improves the validation process for external locations across multiple workspaces. Additionally, the command's implementation has been updated to include the `run_as_collection` parameter, which controls whether the command is executed as a collection, ensuring sequential execution of each statement within the command. The unit tests have been updated to include a test case that verifies this functionality. The `validate_external_locations` function has also been updated to include a `ctx` parameter, which is used to specify the workspace context. These changes improve the functionality of the `validate-external-locations` command, ensuring sequential execution of statements across workspaces.
* Let `validate-groups-membership` command to run as collection ([#2657](#2657)). The latest commit introduces an optional `run-as-collection` flag to the `validate-groups-membership` command in the `labs.yml` configuration file. This flag, when set to true, enables the command to run for a collection of workspaces equipped with UCX. The updated `validate-groups-membership` command in `databricks/labs/ucx/cli.py` now accepts new arguments: `ctx`, `run_as_collection`, and `a`. This change resolves issue [#2613](#2613) and includes updated unit and manual tests, ensuring thorough functionality verification. The new feature allows software engineers to validate group memberships across multiple workspaces simultaneously, enhancing efficiency and ease of use. When run as a collection, the command validates groups at both the account and workspace levels, comparing memberships for each specified workspace context.
* Removed installing on workspace log message in `_get_installer` ([#2641](#2641)). In this enhancement, the `_get_installer` function in the `install.py` file has undergone modification to improve the clarity of the installation process for users. Specifically, a confusing log message that incorrectly indicated that UCX was being installed when it was not, has been removed. The log message has been relocated to a more accurate position in the codebase. It is important to note that the `_get_installer` function itself has not been modified, only the log message has been removed. This change eliminates confusion about the installation of UCX, thus enhancing the overall user experience.
* Support multiple subscription ids for command line commands ([#2647](#2647)). The `databricks labs ucx` tool now supports multiple subscription IDs for the `create-uber-principal`, `guess-external-locations`, `migrate-credentials`, and `migrate-locations` commands. This change allows users to specify multiple subscriptions for scanning storage accounts, improving management for users who handle multiple subscriptions simultaneously. Relevant flags in the `labs.yml` configuration file have been updated, and unit tests, as well as manual testing, have been conducted to ensure proper functionality. In the `cli.py` file, the `create_uber_principal` and `principal_prefix_access` functions have been updated to accept a list of subscription IDs, affecting the `create_uber_principal` and `principal_prefix_access` commands. The `azure_subscription_id` property has been renamed to `azure_subscription_ids`, modifying the `azureResources` constructor and ensuring correct handling of the subscription IDs.
* Updated databricks-labs-lsql requirement from <0.11,>=0.5 to >=0.5,<0.12 ([#2666](#2666)). In this release, we have updated the version requirement for the `databricks-labs-lsql` library in the 'pyproject.toml' file from a version greater than or equal to 0.5 and less than 0.11 to a version greater than or equal to 0.5 and less than 0.12. This change allows us to use the latest version of the `databricks-labs-lsql` library while still maintaining a version range constraint. This library provides functionality for managing and querying data in Databricks, and this update ensures compatibility with the project's existing dependencies. No other changes are included in this commit.
* Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22 ([#2633](#2633)). In this pull request, we have updated the `sqlglot` dependency requirement in the `pyproject.toml` file. The previous requirement was for a minimum version of 25.5.0 and less than 25.21, which has now been changed to a minimum version of 25.5.0 and less than 25.22. This update allows us to utilize the latest version of `sqlglot`, up to but not including version 25.22. While the changelog and commits for the latest version of `sqlglot` have been provided for reference, the specific changes made to the project as a result of this update are not detailed in the pull request description. Therefore, as a reviewer, it is essential to verify the compatibility of the updated `sqlglot` version with our project and ensure that any necessary modifications have been made to accommodate the new version.
* fix test_running_real_remove_backup_groups_job timeout ([#2651](#2651)). In this release, we have made an adjustment to the `test_running_real_remove_backup_groups_job` test case by increasing the timeout of an inner task from 90 seconds to 3 minutes. This change is implemented to address the timeout issue reported in issue [#2639](#2639). Furthermore, to ensure the correct functioning of the code, we have incorporated integration tests. It is important to note that the functionality of the code remains unaffected. This enhancement aims to provide a more reliable and efficient testing process, thereby improving the overall quality of the open-source library.

Dependency updates:

 * Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22 ([#2633](#2633)).
 * Updated databricks-labs-lsql requirement from <0.11,>=0.5 to >=0.5,<0.12 ([#2666](#2666)).
@nfx nfx mentioned this pull request Sep 19, 2024
nfx added a commit that referenced this pull request Sep 19, 2024
* Added ability to run create-missing-principals command as collection
([#2675](#2675)). This
release introduces the capability to run the `create-missing-principals`
command as a collection in the UCX (Unified Cloud Experience) tool with
the new optional flag `run-as-collection`. This allows for more control
and flexibility when managing cloud resources, particularly in handling
multiple workspaces. The existing `create-missing-principals` command
has been modified to accept a new `run_as_collection` parameter,
enabling the command to run on multiple workspaces when set to True. The
function has been updated to handle a list of `WorkspaceContext`
objects, allowing it to iterate over each object and execute necessary
actions for each workspace. Additionally, a new `AccountClient`
parameter has been added to facilitate the retrieval of all workspaces
associated with a specific account. New test functions have been added
to `test_cli.py` to test this new functionality on AWS and Azure cloud
providers. The `acc_client` argument has been added to the test
functions to enable running the tests with an authenticated AWS or Azure
client, and the `MockPrompts` object is used to simulate user responses
to the prompts displayed during the execution of the command.
* Added storage for direct filesystem references in code
([#2526](#2526)). The
open-source library has been updated with a new table
`directfs_in_paths` to store Direct File System Access (DFSA) records,
extending support for managing and collecting DFSAs as part of
addressing issue
[#2350](#2350) and
[#2526](#2526). The changes
include a new class `DirectFsAccessCrawlers` and methods for handling
DFSAs, as well as linting, testing, and a manually verified schema
upgrade. Additionally, a new SQL query deprecates the use of direct
filesystem references. The commit is co-authored by Eric Vergnaud, Serge
Smertin, and Andrew Snare.
* Added task for linting queries
([#2630](#2630)). This
commit introduces a new `QueryLinter` class for linting SQL queries in
the workspace, similar to the existing `WorkflowLinter` for jobs. The
`QueryLinter` checks for any issues in dashboard queries and reports
them in a new `query_problems` table. The commit also includes the
addition of unit tests, integration tests, and manual testing of the
schema upgrade. The `QueryLinter` method has been updated to include a
`TableMigrationIndex` object, which is currently set to an empty list
and will be updated in a future commit. This change improves the quality
of the codebase by ensuring that all SQL queries are properly linted and
any issues are reported, allowing for better maintenance and development
of the system. The commit is co-authored by multiple developers,
including Eric Vergnaud, Serge Smertin, Andrew Snare, and Cor.
Additionally, a new linting rule, "direct-filesystem-access", has been
introduced to deprecate the use of direct filesystem references in favor
of more abstracted file access methods in the project's codebase.
* Adopt `databricks-labs-pytester` PyPI package
([#2663](#2663)). In this
release, we have made updates to the `pyproject.toml` file, removing the
`pytest` package version 8.1.0 and updating it to 8.3.3. We have also
added the `databricks-labs-pytester` package with a minimum version of
0.2.1. This update also includes the adoption of the
`databricks-labs-pytester` PyPI package, which moves fixture usage from
`mixins.fixtures` into its own top-level library. This affects various
test files, including `test_jobs.py`, by replacing the
`get_purge_suffix` fixture with `watchdog_purge_suffix` to standardize
the approach to creating and managing temporary directories and files
used in tests. Additionally, new fixtures have been introduced in a
separate PR for testing the `databricks.labs.ucx` package, including
`debug_env_name`, `product_info`, `inventory_schema`,
`make_lakeview_dashboard`, `make_dashboard`, `make_dbfs_data_copy`,
`make_mounted_location`, `make_storage_dir`, `sql_exec`, and
`migrated_group`. These fixtures simplify the testing process by
providing preconfigured resources that can be used in the tests. The
`redash.py` file has been removed from the `databricks/labs/ucx/mixins`
directory as the Redash API is being deprecated and replaced with a new
library.
* Assessment: crawl UDFs as a task in parallel to tables instead of
implicitly during grants
([#2642](#2642)). This
release introduces changes to the assessment workflow, specifically in
how User Defined Functions (UDFs) are crawled/scanned. Previously, UDFs
were crawled/scanned implicitly by the GrantsCrawler, which requested a
snapshot from the UDFSCrawler that hadn't executed yet. With this
update, UDFs are now crawled/scanned as their own task, running in
parallel with tables before grants crawling begins. This modification
addresses issue
[#2574](#2574), which
requires grants and UDFs to be refreshable but only once within a given
workflow run. A new method, crawl_udfs, has been introduced to iterate
over all UDFs in the Hive Metastore of the current workspace and persist
their metadata in a table named $inventory_database.udfs. This inventory
is utilized when scanning securable objects for issues with grants that
cannot be migrated to Unit Catalog. The crawl_grants task now depends on
crawl_udfs, crawl_tables, and setup_tacl, ensuring that UDFs are
crawled/scanned before grants are.
* Collect direct filesystem access from queries
([#2599](#2599)). This
commit introduces support for extracting Direct File System Access
(DirectFsAccess) records from workspace queries, adding a new table
`directfs_in_queries` and a new view `directfs` that unions
`directfs_in_paths` with the new table. The DirectFsAccessCrawlers class
has been refactored into two separate classes:
`DirectFsAccessCrawler.for_paths` and
`DirectFsAccessCrawler.for_queries`, and a new `QueryLinter` class has
been introduced to check queries for DirectFsAccess records. Unit tests
and manual tests have been conducted to ensure the correct functioning
of the schema upgrade. The commit is co-authored by Eric Vergnaud, Serge
Smertin, and Andrew Snare.
* Fixed failing integration test:
`test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace`
([#2624](#2624)). In this
release, we have made updates to the group migration workflow,
addressing an issue
([#2623](#2623)) where the
integration test
`test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace`
failed due to unhandled scenarios where a workspace group already
existed with the same name as an account group to be reflected. The
changes include the addition of a new method,
`_workspace_groups_in_workspace()`, which checks for the existence of
workspace groups. We have also modified the `group-migration` workflow
and integrated test
`test_reflect_account_groups_on_workspace_skips_account_groups_when_a_workspace_group_has_same_name`.
To enhance consistency and robustness, the `GroupManager` class has been
updated with two new methods:
`test_reflect_account_groups_on_workspace_warns_skipping_when_a_workspace_group_has_same_name`
and
`test_reflect_account_groups_on_workspace_logs_skipping_groups_when_already_reflected_on_workspace`.
These new methods check if a group is skipped when a workspace group
with the same name exists and log a warning message, as well as log
skipping groups that are already reflected on the workspace. These
improvements ensure that the system behaves as expected during the group
migration process, handling cases where workspace groups and account
groups share the same name.
* Fixed failing solution accelerator verification tests
([#2648](#2648)). This
release includes a fix for an issue in the LocalCodeLinter class that
was unable to normalize Python code at the notebook cell level. The
solution involved modifying the LocalCodeLinter constructor to include a
notebook loader, as well as adding a conditional block to the lint_path
method to determine the correct loader to use based on whether the path
is a notebook or not. These changes allow the linter to handle Python
code more effectively within Jupyter notebook cells. The tests for this
change were manually verified using `make solacc` on the files that
failed in CI. This commit has been co-authored by Eric Vergnaud. The
functionality of the linter remains unchanged, and there is no impact on
the overall software functionality. The target audience for this
description includes software engineers who adopt this open-source
library.
* Fixed handling of potentially corrupt `state.json` of UCX workflows
([#2673](#2673)). This
commit introduces a fix for potential corruption of `state.json` files
in UCX workflows, addressing issue
[#2673](#2673) and resolving
[#2667](#2667). It updates
the import statement in `install.py`, introduces a new `with_extra`
function, and centralizes the deletion of jobs, improving code
maintainability. Two new methods are added to check if a job is managed
by UCX. Additionally, the commit removes deprecation warnings for direct
filesystem references in pytester fixtures and adjusts the known.json
file to accurately reflect the project's state. A new `Task` method is
added for defining UCX workflow tasks, and several test cases are
updated to ensure the correct handling of jobs during the uninstallation
process. Overall, these changes enhance the reliability and
user-friendliness of the UCX workflow installation process.
* Let `create-catalog-schemas` command run as collection
([#2653](#2653)). The
`create-catalog-schemas` and `validate-external-locations` commands in
the `databricks labs ucx` package have been updated to operate as
collections, allowing for simultaneous execution on multiple workspaces.
These changes, which resolve issue
[#2609](#2609), include the
addition of new parameters and flags to the command signatures and
method signatures, as well as updates to the existing functionality for
creating catalogs and schemas. The changes have been manually tested and
accompanied by unit tests, with integration tests to be added in a
future update. The `create-catalog-schemas` command now accepts a list
of workspace clients and a `run_as_collection` parameter, and skips
existing catalogs and schemas while logging a message. The
`validate-external-locations` command also operates as a collection,
though specific details about this change are not provided.
* Let `create-uber-principal` command run on collection of workspaces
([#2640](#2640)). The
`create-uber-principal` command has been updated to support running on a
collection of workspaces, allowing for more efficient management of
service principals across multiple workspaces. This change includes the
addition of a new flag, `run-as-collection`, which, when set to true,
allows the command to run on a collection of workspaces with UCX
installed. The command continues to grant STORAGE_BLOB_READER access to
Azure storage accounts and identify S3 buckets used in AWS workspaces.
The changes also include updates to the testing strategy, with manual
testing and unit tests added. Integration tests will be added in a
future PR. These modifications enhance the functionality and reliability
of the command, improving the user experience for managing workspaces.
In terms of implementation, the `create_uber_principal` method in the
`access.py` and `cli.py` files has been updated to support running on a
collection of workspaces. The modification includes the addition of a
new parameter, `run_as_collection`, which, when set to True, allows the
method to retrieve a collection of workspace contexts and execute the
necessary operations for each context. The changes also include updates
to the underlying methods, such as the `aws_profile` method, to ensure
the correct cloud provider is being utilized. The behavior of the
command has been isolated from the underlying `ucx` functionality by
introducing mock values for the uber service principal ID and policy ID.
The changes also include updates to the tests to reflect these
modifications, with new tests added to ensure that the command behaves
correctly when run on a collection of workspaces and to test the error
handling for unsupported cloud providers and missing subscription IDs.
* Let `migrate-acls` command run as collection
([#2664](#2664)). The
`migrate-acls` command in the `labs.yml` file has been updated to
facilitate the migration of access control lists (ACLs) from a legacy
metastore to a UC metastore for a collection of workspaces with Unity
Catalog (UC) installed. This command now supports running as a
collection, enabled by a new optional flag `run-as-collection`. When set
to true, the command will run for all workspaces with UC installed,
enhancing efficiency and ease of use. The new functionality has been
manually tested and verified with added unit tests. However, integration
tests are yet to be added. The command is part of the
`databricks/labs/ucx` module and is implemented in the `cli.py` file.
This update addresses issue
[#2611](#2611) and includes
both manual and unit tests.
* Let `migrate-dbsql-dashboards` command to run as collection
([#2656](#2656)). The
`migrate-dbsql-dashboards` command in the `databricks labs ucx` command
group has been updated to support running as a collection, allowing it
to migrate queries for all dashboards in one or more workspaces. This
new feature is achieved by adding an optional flag `run-as-collection`
to the command. If set to True, the command will be executed for all
workspaces with ucx installed, resolving issue
[#2612](#2612). The
`migrate-dbsql-dashboards` function has been updated to take additional
parameters `ctx`, `run_as_collection`, and `a`. The `ctx` parameter is
an optional `WorkspaceContext` object, which can be used to specify the
context for a single workspace. If not provided, the function will
retrieve a list of `WorkspaceContext` objects for all workspaces. The
`run_as_collection` parameter is a boolean flag indicating whether the
command should run as a collection. If set to True, the function will
iterate over all workspaces and migrate queries for all dashboards in
each workspace. The `a` parameter is an optional `AccountClient` object
for authentication. Unit tests have been added to ensure that the new
functionality works as expected. This feature will be useful for users
who need to migrate many dashboards at once. Integration tests will be
added in a future update after issue
[#2507](#2507) is addressed.
* Let `migrate-locations` command run as collection
([#2652](#2652)). The
`migrate-locations` command in the `databricks labs ucx` library for AWS
and Azure has been enhanced to support running as a collection of
workspaces, allowing for more efficient management of external
locations. This has been achieved by modifying the existing `databricks
labs ucx migrate-locations` command and adding a `run_as_collection`
flag to specify that the command should run for a collection of
workspaces. The changes include updates to the `run` method in
`locations.py` to return a list of strings containing the URLs of
missing external locations, and the addition of the
`_filter_unsupported_location` method to filter out unsupported
locations. A new `_get_workspace_contexts` function has been added to
return a list of `WorkspaceContext` objects based on the provided
`WorkspaceClient`, `AccountClient`, and named parameters. The commit
also includes new test cases for handling unsupported cloud providers
and testing the `run as collection` functionality with multiple
workspaces, as well as manual and unit tests. Note that due to current
limitations in unit testing, the `run as collection` tests for both
Azure and AWS raise exceptions.
* Let `migrate-tables` command run as collection
([#2654](#2654)). The
`migrate-tables` command in the `labs.yml` configuration file has been
updated to support running as a collection of workspaces with UCX
installed. This change includes adding a new flag `run_as_collection`
that, when set to `True`, allows the command to run on all workspaces in
the collection, and modifying the existing command to accept an
`AccountClient` object and `WorkspaceContext` objects. The function
`_get_workspace_contexts` is used to retrieve the `WorkspaceContext`
objects for each workspace in the collection. Additionally, the
`migrate_tables` command now checks for the presence of hiveserde and
external tables and prompts the user to run the
`migrate-external-hiveserde-tables-in-place-experimental` and
`migrate-external-tables-ctas` workflows, respectively. The command's
documentation and tests have also been updated to reflect this new
functionality. Integration tests will be added in a future update. These
changes improve the scalability and efficiency of the `migrate-tables`
command, allowing for easier and more streamlined execution across
multiple workspaces.
* Let `validate-external-locations` command run as collection
([#2649](#2649)). In this
release, the `validate-external-locations` command has been updated to
support running as a collection, allowing it to operate on multiple
workspaces simultaneously. This change includes the addition of new
parameters `ctx`, `run_as_collection`, and `a` to the
`validate-external-locations` command in the `cli.py` file. The `ctx`
parameter determines the current workspace context, obtained through the
`_get_workspace_contexts` function when `run_as_collection` is set to
True. The function queries for all available workspaces associated with
the given account client `a`. The
`save_as_terraform_definitions_on_workspace` method is then called to
save the external locations as Terraform definitions on the workspace.
This enhancement improves the validation process for external locations
across multiple workspaces. Additionally, the command's implementation
has been updated to include the `run_as_collection` parameter, which
controls whether the command is executed as a collection, ensuring
sequential execution of each statement within the command. The unit
tests have been updated to include a test case that verifies this
functionality. The `validate_external_locations` function has also been
updated to include a `ctx` parameter, which is used to specify the
workspace context. These changes improve the functionality of the
`validate-external-locations` command, ensuring sequential execution of
statements across workspaces.
* Let `validate-groups-membership` command to run as collection
([#2657](#2657)). The latest
commit introduces an optional `run-as-collection` flag to the
`validate-groups-membership` command in the `labs.yml` configuration
file. This flag, when set to true, enables the command to run for a
collection of workspaces equipped with UCX. The updated
`validate-groups-membership` command in `databricks/labs/ucx/cli.py` now
accepts new arguments: `ctx`, `run_as_collection`, and `a`. This change
resolves issue
[#2613](#2613) and includes
updated unit and manual tests, ensuring thorough functionality
verification. The new feature allows software engineers to validate
group memberships across multiple workspaces simultaneously, enhancing
efficiency and ease of use. When run as a collection, the command
validates groups at both the account and workspace levels, comparing
memberships for each specified workspace context.
* Removed installing on workspace log message in `_get_installer`
([#2641](#2641)). In this
enhancement, the `_get_installer` function in the `install.py` file has
undergone modification to improve the clarity of the installation
process for users. Specifically, a confusing log message that
incorrectly indicated that UCX was being installed when it was not, has
been removed. The log message has been relocated to a more accurate
position in the codebase. It is important to note that the
`_get_installer` function itself has not been modified, only the log
message has been removed. This change eliminates confusion about the
installation of UCX, thus enhancing the overall user experience.
* Support multiple subscription ids for command line commands
([#2647](#2647)). The
`databricks labs ucx` tool now supports multiple subscription IDs for
the `create-uber-principal`, `guess-external-locations`,
`migrate-credentials`, and `migrate-locations` commands. This change
allows users to specify multiple subscriptions for scanning storage
accounts, improving management for users who handle multiple
subscriptions simultaneously. Relevant flags in the `labs.yml`
configuration file have been updated, and unit tests, as well as manual
testing, have been conducted to ensure proper functionality. In the
`cli.py` file, the `create_uber_principal` and `principal_prefix_access`
functions have been updated to accept a list of subscription IDs,
affecting the `create_uber_principal` and `principal_prefix_access`
commands. The `azure_subscription_id` property has been renamed to
`azure_subscription_ids`, modifying the `azureResources` constructor and
ensuring correct handling of the subscription IDs.
* Updated databricks-labs-lsql requirement from <0.11,>=0.5 to
>=0.5,<0.12
([#2666](#2666)). In this
release, we have updated the version requirement for the
`databricks-labs-lsql` library in the 'pyproject.toml' file from a
version greater than or equal to 0.5 and less than 0.11 to a version
greater than or equal to 0.5 and less than 0.12. This change allows us
to use the latest version of the `databricks-labs-lsql` library while
still maintaining a version range constraint. This library provides
functionality for managing and querying data in Databricks, and this
update ensures compatibility with the project's existing dependencies.
No other changes are included in this commit.
* Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22
([#2633](#2633)). In this
pull request, we have updated the `sqlglot` dependency requirement in
the `pyproject.toml` file. The previous requirement was for a minimum
version of 25.5.0 and less than 25.21, which has now been changed to a
minimum version of 25.5.0 and less than 25.22. This update allows us to
utilize the latest version of `sqlglot`, up to but not including version
25.22. While the changelog and commits for the latest version of
`sqlglot` have been provided for reference, the specific changes made to
the project as a result of this update are not detailed in the pull
request description. Therefore, as a reviewer, it is essential to verify
the compatibility of the updated `sqlglot` version with our project and
ensure that any necessary modifications have been made to accommodate
the new version.
* fix test_running_real_remove_backup_groups_job timeout
([#2651](#2651)). In this
release, we have made an adjustment to the
`test_running_real_remove_backup_groups_job` test case by increasing the
timeout of an inner task from 90 seconds to 3 minutes. This change is
implemented to address the timeout issue reported in issue
[#2639](#2639). Furthermore,
to ensure the correct functioning of the code, we have incorporated
integration tests. It is important to note that the functionality of the
code remains unaffected. This enhancement aims to provide a more
reliable and efficient testing process, thereby improving the overall
quality of the open-source library.

Dependency updates:

* Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22
([#2633](#2633)).
* Updated databricks-labs-lsql requirement from <0.11,>=0.5 to
>=0.5,<0.12 ([#2666](#2666)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants