-
Notifications
You must be signed in to change notification settings - Fork 986
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding 1.9 dbt-databricks documentation for new python model config (#…
…6350) ## What are you changing in this pull request and why? Adds the first batch of new documentation for dbt-databricks 1.9, focusing on newly support python submission configuration. ## Checklist - [x] I have reviewed the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines. - [x] The topic I'm writing about is for specific dbt version(s) and I have versioned it according to the [version a whole page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version) and/or [version a block of content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content) guidelines. - [x] I have added checklist item(s) to this list for anything anything that needs to happen before this PR is merged, such as "needs technical review" or "change base branch." <!-- PRE-RELEASE VERSION OF dbt (if so, uncomment): - [ ] Add a note to the prerelease version [Migration Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/docs/dbt-versions/core-upgrade) --> <!-- ADDING OR REMOVING PAGES (if so, uncomment): - [ ] Add/remove page in `website/sidebars.js` - [ ] Provide a unique filename for new pages - [ ] Add an entry for deleted pages in `website/vercel.json` - [ ] Run link testing locally with `npm run build` to update the links that point to deleted pages --> --------- Co-authored-by: Amy Chen <[email protected]> Co-authored-by: Leona B. Campbell <[email protected]>
- Loading branch information
1 parent
cb69cf6
commit 469ab4c
Showing
1 changed file
with
108 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,6 +65,107 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this | |
|
||
</VersionBlock> | ||
|
||
<VersionBlock firstVersion="1.9"> | ||
|
||
### Python submission methods | ||
|
||
In dbt v1.9 and higher, or in [Versionless](/docs/dbt-versions/versionless-cloud) dbt Cloud, you can use these four options for `submission_method`: | ||
|
||
* `all_purpose_cluster`: Executes the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run | ||
* `job_cluster`: Creates a new job cluster to execute an uploaded notebook as a one-off job run | ||
* `serverless_cluster`: Uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run | ||
* `workflow_job`: Creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. | ||
:::caution | ||
This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. | ||
::: | ||
|
||
We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). | ||
|
||
As such, the supported config matrix is somewhat complicated: | ||
|
||
| Config | Use | Default | `all_purpose_cluster`* | `job_cluster` | `serverless_cluster` | `workflow_job` | | ||
| --------------------- | -------------------------------------------------------------------- | ------------------ | ---------------------- | ------------- | -------------------- | -------------- | | ||
| `create_notebook` | if false, use Command API, otherwise upload notebook and use job run | `false` | ✅ | ❌ | ❌ | ❌ | | ||
| `timeout` | maximum time to wait for command/job to run | `0` (No timeout) | ✅ | ✅ | ✅ | ✅ | | ||
| `job_cluster_config` | configures a [new cluster](https://docs.databricks.com/api/workspace/jobs/submit#tasks-new_cluster) for running the model | `{}` | ❌ | ✅ | ❌ | ✅ | | ||
| `access_control_list` | directly configures [access control](https://docs.databricks.com/api/workspace/jobs/submit#access_control_list) for the job | `{}` | ✅ | ✅ | ✅ | ✅ | | ||
| `packages` | list of packages to install on the executing cluster | `[]` | ✅ | ✅ | ✅ | ✅ | | ||
| `index_url` | url to install `packages` from | `None` (uses pypi) | ✅ | ✅ | ✅ | ✅ | | ||
| `additional_libs` | directly configures [libraries](https://docs.databricks.com/api/workspace/jobs/submit#tasks-libraries) | `[]` | ✅ | ✅ | ✅ | ✅ | | ||
| `python_job_config` | additional configuration for jobs/workflows (see table below) | `{}` | ✅ | ✅ | ✅ | ✅ | | ||
| `cluster_id` | id of existing all purpose cluster to execute against | `None` | ✅ | ❌ | ❌ | ✅ | | ||
| `http_path` | path to existing all purpose cluster to execute against | `None` | ✅ | ❌ | ❌ | ❌ | | ||
|
||
\* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false | ||
|
||
With the introduction of the `workflow_job` submission method, we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. | ||
|
||
The support matrix for this feature is divided into `workflow_job` and all others (assuming `all_purpose_cluster` with `create_notebook`==true). | ||
Each config option listed must be nested under `python_job_config`: | ||
|
||
| Config | Use | Default | `workflow_job` | All others | | ||
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ------- | -------------- | ---------- | | ||
| `name` | The name to give (or used to look up) the created workflow | `None` | ✅ | ❌ | | ||
| `grants` | A simplified way to specify access control for the workflow | `{}` | ✅ | ✅ | | ||
| `existing_job_id` | Id to use to look up the created workflow (in place of `name`) | `None` | ✅ | ❌ | | ||
| `post_hook_tasks` | [Tasks](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include after the model notebook execution | `[]` | ✅ | ❌ | | ||
| `additional_task_settings` | Additional [task config](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | `{}` | ✅ | ❌ | | ||
| [Other job run settings](https://docs.databricks.com/api/workspace/jobs/submit) | Config will be copied into the request, outside of the model task | `None` | ❌ | ✅ | | ||
| [Other workflow settings](https://docs.databricks.com/api/workspace/jobs/create) | Config will be copied into the request, outside of the model task | `None` | ✅ | ❌ | | ||
|
||
This example uses the new configuration options in the previous table: | ||
|
||
<File name='schema.yml'> | ||
|
||
```yaml | ||
models: | ||
- name: my_model | ||
config: | ||
submission_method: workflow_job | ||
|
||
# Define a job cluster to create for running this workflow | ||
# Alternately, could specify cluster_id to use an existing cluster, or provide neither to use a serverless cluster | ||
job_cluster_config: | ||
spark_version: "15.3.x-scala2.12" | ||
node_type_id: "rd-fleet.2xlarge" | ||
runtime_engine: "{{ var('job_cluster_defaults.runtime_engine') }}" | ||
data_security_mode: "{{ var('job_cluster_defaults.data_security_mode') }}" | ||
autoscale: { "min_workers": 1, "max_workers": 4 } | ||
|
||
python_job_config: | ||
# These settings are passed in, as is, to the request | ||
email_notifications: { on_failure: ["[email protected]"] } | ||
max_retries: 2 | ||
|
||
name: my_workflow_name | ||
|
||
# Override settings for your model's dbt task. For instance, you can | ||
# change the task key | ||
additional_task_settings: { "task_key": "my_dbt_task" } | ||
|
||
# Define tasks to run before/after the model | ||
# This example assumes you have already uploaded a notebook to /my_notebook_path to perform optimize and vacuum | ||
post_hook_tasks: | ||
[ | ||
{ | ||
"depends_on": [{ "task_key": "my_dbt_task" }], | ||
"task_key": "OPTIMIZE_AND_VACUUM", | ||
"notebook_task": | ||
{ "notebook_path": "/my_notebook_path", "source": "WORKSPACE" }, | ||
}, | ||
] | ||
|
||
# Simplified structure, rather than having to specify permission separately for each user | ||
grants: | ||
view: [{ "group_name": "marketing-team" }] | ||
run: [{ "user_name": "[email protected]" }] | ||
manage: [] | ||
``` | ||
</File> | ||
</VersionBlock> | ||
## Incremental models | ||
dbt-databricks plugin leans heavily on the [`incremental_strategy` config](/docs/build/incremental-strategy). This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of four values: | ||
|
@@ -556,9 +657,15 @@ Databricks adapter ... using compute resource <name of compute>. | |
|
||
Materializing a python model requires execution of SQL as well as python. | ||
Specifically, if your python model is incremental, the current execution pattern involves executing python to create a staging table that is then merged into your target table using SQL. | ||
<VersionBlock lastVersion="1.8"> | ||
The python code needs to run on an all purpose cluster, while the SQL code can run on an all purpose cluster or a SQL Warehouse. | ||
</VersionBlock> | ||
<VersionBlock firstVersion="1.9"> | ||
The python code needs to run on an all purpose cluster (or serverless cluster, see [Python Submission Methods](#python-submission-methods)), while the SQL code can run on an all purpose cluster or a SQL Warehouse. | ||
</VersionBlock> | ||
When you specify your `databricks_compute` for a python model, you are currently only specifying which compute to use when running the model-specific SQL. | ||
If you wish to use a different compute for executing the python itself, you must specify an alternate `http_path` in the config for the model. Please note that declaring a separate SQL compute and a python compute for your python dbt models is optional. If you wish to do this: | ||
If you wish to use a different compute for executing the python itself, you must specify an alternate compute in the config for the model. | ||
For example: | ||
|
||
<File name="model.py"> | ||
|
||
|