Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created new section called Parallel batch execution #6589

Merged
merged 83 commits into from
Dec 7, 2024
Merged
Show file tree
Hide file tree
Changes from 78 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
9db07cc
Created new section called Parallel batch execution
nataliefiann Dec 4, 2024
2f5ec6e
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 4, 2024
ca9e4e9
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
8fd0faa
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
ca3e04c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
0cb99ba
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
099a2bc
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
40c57e8
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
2491176
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
e9a42d7
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
efbeae5
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
ed3132e
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
3e207c0
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
cbed4d6
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
6d9e7c1
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
f349a16
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
a8e56df
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
7c89a59
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
c17a31a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
5e24a5a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
7ae1121
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 5, 2024
4da02b1
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
bc2adaf
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
b30107c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
1460526
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ed0fa04
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
dadbe4a
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 5, 2024
6a7af2c
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
3c9539b
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
9a4b3c0
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
4cfa151
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
f989037
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
28e11af
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
e4c9bf2
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
be2d332
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
7384dd9
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
979f689
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
690286f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
dc700ab
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
5f41e4c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
e54777a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
08c4b5c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
fd5351f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ab12f7b
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
32144f5
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
4b54892
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
8f38762
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
864f52d
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
e060242
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
504fb91
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
113b356
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 5, 2024
dfc6555
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
33f0166
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
97d96c0
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ff16511
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
b06f3ea
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 6, 2024
7586373
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
c9c1012
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
d1f94cc
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
8504487
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
14ccc71
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
c7f6642
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
349f912
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 6, 2024
2227b72
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
f4b3b15
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 6, 2024
292aef4
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
722851f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
7967a82
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
78d0dd6
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
37a8ba8
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
722778b
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
685cbd1
Added nested bullet
nataliefiann Dec 6, 2024
ed8b85b
Merge branch 'nfiann-rbip' of https://github.com/dbt-labs/docs.getdbt…
nataliefiann Dec 6, 2024
19b4095
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
f102e15
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
d873999
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 6, 2024
7f824a4
Updated upgrading to v1.9 guide to included parallel batch execution …
nataliefiann Dec 6, 2024
261f781
Created concurrent batches page (#6601)
nataliefiann Dec 6, 2024
6f84d2b
Apply suggestions from code review
runleonarun Dec 6, 2024
ef3ccc3
Update incremental-microbatch.md
runleonarun Dec 6, 2024
3926908
Update incremental-microbatch.md
runleonarun Dec 6, 2024
6c66d84
Apply suggestions from code review
runleonarun Dec 6, 2024
258a66d
Update incremental-microbatch.md
runleonarun Dec 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 136 additions & 7 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ Incremental models in dbt are a [materialization](/docs/build/materializations)
Microbatch is an incremental strategy designed for large time-series datasets:
- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to define time-based ranges for filtering. Set the `event_time` column for your microbatch model and its direct parents (upstream models). Note, this is different to `partition_by`, which groups rows into partitions.
- It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing.
- Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills).
- Unlike traditional incremental strategies, microbatch enables you to [reprocess failed batches](/docs/build/incremental-microbatch#retry), auto-detect [parallel batch execution](#parallel-batch-execution), and eliminate the need to implement complex conditional logic for [backfilling](#backfills).

- Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies).

### How microbatch works
Expand Down Expand Up @@ -179,12 +180,15 @@ It does not matter whether the table already contains data for that day. Given t

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| [`begin`](/reference/resource-configs/begin) | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| [`batch_size`](/reference/resource-configs/batch-size) | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| [`lookback`](/reference/resource-configs/lookback) | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |

| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `concurrent_batches` | An override for whether batches run concurrently (at the same time) or sequentially (one after the other). | `None` | Boolean | Optional |


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

Expand Down Expand Up @@ -280,6 +284,131 @@ For now, dbt assumes that all values supplied are in UTC:

While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## Parallel batch execution

The microbatch strategy offers the benefit of updating a model in smaller, more manageable batches.

Parallel batch execution means that multiple batches are processed at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models.

dbt automatically detects whether a batch can be run in parallel in most cases, which means you don’t need to configure this setting. However, the `concurrent_batches` config is available as an override (not a gate), allowing you to specify whether batches should or shouldn’t be run in parallel in specific cases.

For example, if you have a microbatch model with 12 batches, you can execute those batches to run in parallel. Specifically they'll run in parallel limited by the number of [available threads](/docs/running-a-dbt-project/using-threads).

### Prerequisites

To enable parallel execution, you must meet the following conditions:
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

- You use the following supported adapters:
runleonarun marked this conversation as resolved.
Show resolved Hide resolved
- Snowflake
- Databricks
- More adapters coming soon!
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
- We'll be continuing to test and add concurrency support for adapters. This means that some adapters might get concurrency support _after_ the 1.9 initial release.



- You meet [additional conditions](#how-parallel-batch-execution-works) mentioned in the next section
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

### How parallel batch execution works

A batch can only run in parallel if:

| Step | Condition | Parallel execution | Sequential execution|
| ---- | ---------------| :------------------: | :----------: |
| 1. | **Not** the first batch | ✅ | - |
| 2. | **Not** the last batch | ✅ | - |
| 3. | [Adapter supports](#prerequisites) parallel batches | ✅ | - |


After checking for 1, 2, and 3 in the previous table &mdash; and if `concurrent_batches` value isn't set, dbt will intelligently auto-detect if the model invokes the [`{{ this }}`](/reference/dbt-jinja-functions/this) Jinja function. If it references `{{ this }}`, the batches will run sequentially since `{{ this }}` represents the database of the current model and referencing the same relation causes conflict.

Otherwise, if `{{ this }}` isn't detected (and other conditions are met), the batches will run in parallel. This can be overriden by setting a value for `concurrent_batches`.
### Parallel or sequential execution



Choosing between parallel batch execution and sequential processing depends on the specific requirements of your use case.

- Parallel batch execution is faster but requires logic that's independent of batch execution order. For example, if you're developing a data pipeline for a system that processes user transactions in batches, each batch is executed in parallel for better performance. However, the logic used to process each transaction shouldn't depend on the order of how batches are executed or completed.
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
- Sequential processing is slower but essential for calculations like [cumulative metrics](/docs/build/cumulative) in microbatch models. It processes data in the correct order, allowing each step to build on the previous one.

<!-- You can override the check for `this` by setting `concurrent_batches` to either `True` or `False`. If set to `False`, the batch will be run sequentially. If set to `True` the batch will be run in parallel (assuming [1], [2], and [3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nataliefiann checking w quigley

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed

To override the `this` check, use the `concurrent_batches` configuration:

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True
```

</File>

or:

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
concurrent_batches=True,
incremental_strategy='microbatch'
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

...
)
}}

select ...
```

</File>
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
-->

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
### Configure `concurrent_batches`

By default, dbt auto-detects whether batches can run in parallel for microbatch models, and this works correctly in most cases. However, you can override dbt's detection by setting the `concurrent_batches` config in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet all the [conditions](#prerequisites):
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

<Tabs>
<TabItem value="yaml" label="dbt_project.yml">

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True # value set to True to run batches in parallel
```

</File>
</TabItem>

<TabItem value="sql" label="my_model.sql">

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day
concurrent_batches=True, # value set to True to run batches in parallel
...
)
}}

select ...
```
</File>
</TabItem>
</Tabs>

Depending on your use case, configuring your microbatch models to run in parallel offer faster processing, in comparison to running batches sequentially.
### How microbatch compares to other incremental strategies

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
As data warehouses roll out new operations for concurrently replacing/upserting data partitions, we may find that the new operation for the data warehouse is more efficient than what the adapter uses for microbatch. In such instances we reserve the right the update the default operation for microbatch, so long as it works as intended/documented for models that fit the microbatch paradigm.
## How `microbatch` compares to other incremental strategies?
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ Starting in Core 1.9, you can use the new [microbatch strategy](/docs/build/incr
- Simplified query design: Write your model query for a single batch of data. dbt will use your `event_time`, `lookback`, and `batch_size` configurations to automatically generate the necessary filters for you, making the process more streamlined and reducing the need for you to manage these details.
- Independent batch processing: dbt automatically breaks down the data to load into smaller batches based on the specified `batch_size` and processes each batch independently, improving efficiency and reducing the risk of query timeouts. If some of your batches fail, you can use `dbt retry` to load only the failed batches.
- Targeted reprocessing: To load a *specific* batch or batches, you can use the CLI arguments `--event-time-start` and `--event-time-end`.
- [Automatic parallel batch execution](/docs/build/incremental-microbatch#parallel-batch-execution): Process multiple batches at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models. dbt intelligently auto-detects if your batches can run in parallel, while also allowing you to manually override parallel execution with the `concurrent_batches` config.


Currently microbatch is supported on these adapters with more to come:
* postgres
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "concurrent_batches"
resource_types: [models]
datatype: model_name
description: "Learn about concurrent_batches in dbt."
---

:::note

Available in dbt Core v1.9+ or the [dbt Cloud "Latest" release tracks](/docs/dbt-versions/cloud-release-tracks).

Check warning on line 10 in website/docs/reference/resource-properties/concurrent_batches.md

View workflow job for this annotation

GitHub Actions / vale

[vale] website/docs/reference/resource-properties/concurrent_batches.md#L10

[custom.Typos] Oops there's a typo -- did you really mean 'v1.9+'?
Raw output
{"message": "[custom.Typos] Oops there's a typo -- did you really mean 'v1.9+'? ", "location": {"path": "website/docs/reference/resource-properties/concurrent_batches.md", "range": {"start": {"line": 10, "column": 23}}}, "severity": "WARNING"}

:::

<Tabs>
<TabItem value="Project file">


<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: true
```

</File>

</TabItem>


<TabItem value="sql file">

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
concurrent_batches=true,
incremental_strategy='microbatch'
...
)
}}
select ...
```

</File>

</TabItem>
</Tabs>

## Definition

`concurrent_batches` is an override which allows you to decide whether or not you want to run batches in parallel or sequentially (one at a time).

For more information, refer to [how batch execution works](/docs/build/incremental-microbatch#how-parallel-batch-execution-works).
## Example

By default, dbt auto-detects whether batches can run in parallel for microbatch models. However, you can override dbt's detection by setting the `concurrent_batches` config to `false` in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet these conditions:
* You've configured a microbatch incremental strategy.
* You're working with cumulative metrics or any logic that depends on batch order.

Set `concurrent_batches` config to `false` to ensure batches are processed sequentially. For example:

<File name='dbt_project.yml'>

```yaml
models:
my_project:
cumulative_metrics_model:
+concurrent_batches: false
```
</File>


<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch'
concurrent_batches=false
)
}}
select ...

```
</File>


1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -956,6 +956,7 @@ const sidebarSettings = {
"reference/resource-configs/materialized",
"reference/resource-configs/on_configuration_change",
"reference/resource-configs/sql_header",
"reference/resource-properties/concurrent_batches",
],
},
{
Expand Down
Loading