Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

Closed
2 tasks done
amardatar opened this issue May 11, 2023 · 7 comments · May be fixed by #110
Closed
2 tasks done
Labels
feature:incremental Issues related to incremental materializations Stale Mark an issue or PR as stale, to be closed type:bug Something isn't working as documented

Comments

@amardatar
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When dbt runs an incremental update which uses a unique_key, if that unique_key has fields which are null, dbt will insert "duplicate" rows.

Expected Behavior

Null fields should not result in duplicate rows, but should be overwritten when an equivalent (based on is null equivalence) row is available in the update.

Steps To Reproduce

  1. Create an incremental model which includes some null fields as part of the unique_key
  2. Run the model first to create the initial table
  3. Run the model again to perform an incremental update
  4. The model will have duplicate rows

As an example, with the SQL:

select 'a' as field_1, 1 as field_2, current_timestamp as field_3
union all
select 'b' as field_1, null as field_2, current_timestamp as field_3

and YAML:

models:
  - name: test
    config:
      materialized: incremental
      unique_key:
        - field_1
        - field_2

A first run will produce a table with:

field_1 field_2 field_3
a 1 2023-05-11 11:17:33.832151 +00:00
b 2023-05-11 11:17:33.832151 +00:00

And a second run will update the table to:

field_1 field_2 field_3
a 1 2023-05-11 11:21:46.899969 +00:00
b 2023-05-11 11:17:33.832151 +00:00
b 2023-05-11 11:21:46.899969 +00:00

The middle row should have been deleted.

Relevant log output

No response

Environment

- OS: macOS Ventura 13.3.1
- Python: 3.11.1
- dbt: 1.4.6

Which database adapter are you using with dbt?

redshift

Additional Context

I've chosen the Bug template for this, but I don't know if it counts as a bug or a feature (or something else).

Broader context:

In most databases, null comparison is done via the is null operator, and value = null will return null (which will be treated as false). Databases typically don't allow nullable fields in their primary key, from a bit of research that appears to be the main reason why.

In some senses, the unique_key option in dbt is corollary to a primary key in a database table. In practice - dbt tends to be used in data warehouse scenarios, which often don't enforce (or even allow) primary keys, and it's often the case that by necessity source tables being transformed will include nullable fields in what would be considered their unique key. Application designers can usually handle this by adding an extra field which is an actual primary key to handle this scenario; analytics engineers typically don't have this freedom and are required to use the data as it exists upstream. As such, I suspect this behaviour would be preferred by a number of other analytics teams, as it allows them to use incremental behaviour in scenarios like this.

Issue details:

The issue itself boils down to how the deletion is handled during an incremental merge. At present, the deletion uses the condition {{ target }}.{{ key }} = {{ source }}.{{ key }}. The change I'm proposing is essentially to change this to ({{ target }}.{{ key }} = {{ source }}.{{ key }} or {{ target }}.{{ key }} is null and {{ source }}.{{ key }} is null).

Suggestions:

Assuming this is an issue others face as well and that there's a desire to implement a change for, I'd imagine this could either be done by:

  1. Making the change directly, and assuming that if users have included a nullable field in their unique_key this is the behaviour they're expecting; or
  2. Adding an additional config option to allow null values to be compared using the above method if any fields in the unique_key are nullable.
@amardatar amardatar added type:bug Something isn't working as documented triage:product In Product's queue labels May 11, 2023
@github-actions github-actions bot changed the title [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null [CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null May 11, 2023
@dbeatty10 dbeatty10 self-assigned this May 11, 2023
@dbeatty10
Copy link
Contributor

Thanks for opening @amardatar !

Your proposal for NULL-aware comparisons makes sense to me 🤩

After all, "there are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic."

Prior art

I believe this aligns with the opposite logic that we see in snapshots for discovering when rows are different ("is distinct from"):
https://github.com/dbt-labs/dbt-core/blob/630cd3aba063c570b856e5dc1f6e60cf112ea220/core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql#L152-L158

Related feature request

We have an open feature request for an ergonomic implementation of is not distinct from here:
dbt-labs/dbt-core#6997

If we were to implement it, then your update might* be as simple as:

{{ dbt.is_not_distinct_from(source_key, target_key) }}

* depending on final macro name and assuming you already have Jinja variables source_key and target_key.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale Mark an issue or PR as stale, to be closed label Dec 12, 2023
@amardatar
Copy link
Author

Commenting to keep this open - I consider it's still relevant and there's a PR for it.

@github-actions github-actions bot removed the Stale Mark an issue or PR as stale, to be closed label Dec 13, 2023
@dbeatty10 dbeatty10 transferred this issue from dbt-labs/dbt-core Apr 10, 2024
@sjain-cn
Copy link

@dbeatty10 @martynydbt , is there any update on this ?

@dbeatty10
Copy link
Contributor

@sjain-cn there's an open PR that you can subscribe to: #110.

@dbeatty10 dbeatty10 added the feature:incremental Issues related to incremental materializations label May 30, 2024
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale Mark an issue or PR as stale, to be closed label Nov 27, 2024
mikealfare pushed a commit that referenced this issue Dec 2, 2024
Signed-off-by: Henri Blancke <[email protected]>
Co-authored-by: nicor88 <[email protected]>
Copy link
Contributor

github-actions bot commented Dec 4, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:incremental Issues related to incremental materializations Stale Mark an issue or PR as stale, to be closed type:bug Something isn't working as documented
Projects
None yet
4 participants