Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null-safe equality for unique_key within delete+insert and merge incremental strategies #110

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

amardatar
Copy link

@amardatar amardatar commented Feb 27, 2024

Null-safe equality for unique_key within delete+insert and merge incremental strategies with an appropriate null comparison.

resolves #159

Problem

The current implementation of the incremental merge does not use null-safe equality, meaning if any of the columns are null, duplicate values will be inserted every time the model is run. This is unlikely to be desired behaviour for anyone.

Solution

The SQL for the delete condition is updated to handle a check for null values.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@nikita-sheremet-flocktory

@dbeatty10

I appologies that reach you directly but is there news for this PR? If there is additional help or actions required? Could you please tell me wether help is needed and if I can I willl try to help. This PR fixed the issue that is really annoying :(

Thanks in advance!

@nikita-sheremet-flocktory

@amardatar

First of all many thanks for such usefull PR! You provided changes for default__get_delete_insert_merge_sql that implements delete+insert strategy. But there is also additional function default__get_merge_sql that responsible for merge strategy and it has it's own {% if unique_key %}.

Could you please add or is null condition for it also?

Thanks in advance!

@amardatar amardatar requested a review from a team as a code owner May 14, 2024 17:31
@dbeatty10 dbeatty10 changed the title Updating incremental merge WHERE condition to handle nullable fields Null-safe equality for unique_key within delete+insert incremental strategy May 15, 2024
@dbeatty10 dbeatty10 changed the title Null-safe equality for unique_key within delete+insert incremental strategy Null-safe equality for unique_key within delete+insert and merge incremental strategies May 15, 2024
@peterallenwebb
Copy link
Contributor

peterallenwebb commented May 30, 2024

@colin-rogers-dbt, I Reviewed this and thought about it with @dbeatty10 a bit. We think it looks good.

@dbeatty10
Copy link
Contributor

@peterallenwebb and I took a closer look at this today, and we think there's a couple more test cases to add:

  1. Check null comparison when the unique_key has a single column (also with multiple rows with null values)
  2. Check null comparison when the unique_key has a multiple columns (also with multiple rows with null values)

I will take responsibility for trying out these cases and adding these tests.

@dbeatty10
Copy link
Contributor

Important

After decoupling dbt-adapters from dbt-core, only the unit tests run in CI; none of the functional tests run in CI in this repo.

So we'll need to run the changes in this PR along within the functional tests in dbt-postgres (and possibly also dbt-snowflake, dbt-bigquery, and dbt-redshift) in order to know that the changes work as intended.

@colin-rogers-dbt colin-rogers-dbt added community A PR, or an issue with a PR, from a community member feature:incremental Issues related to incremental materializations labels Aug 9, 2024
@rightx2
Copy link

rightx2 commented Aug 19, 2024

Can't wait this feature!

@dbeatty10
Copy link
Contributor

Short test summary info for dbt-postgres (8 failed tests)

=========================== short test summary info ============================
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__no_unique_keys - AssertionError: assert 8 == 9
 +  where 8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=None, relation='seed').seed_rows
 +  and   9 = ResultHolder(seed_count=1, model_count=1, seed_rows=9, inc_test_model_count=1, opt_model_count=None, relation='no_unique_key').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__empty_str_unique_key - AssertionError: assert 8 == 9
 +  where 8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=None, relation='seed').seed_rows
 +  and   9 = ResultHolder(seed_count=1, model_count=1, seed_rows=9, inc_test_model_count=1, opt_model_count=None, relation='empty_str_unique_key').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__one_unique_key - AssertionError: assert 7 == 8
 +  where 7 = ResultHolder(seed_count=1, model_count=1, seed_rows=7, inc_test_model_count=1, opt_model_count=1, relation='one_str__overwrite').seed_rows
 +  and   8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=1, relation='str_unique_key').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__empty_unique_key_list - AssertionError: assert 8 == 9
 +  where 8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=None, relation='seed').seed_rows
 +  and   9 = ResultHolder(seed_count=1, model_count=1, seed_rows=9, inc_test_model_count=1, opt_model_count=None, relation='empty_unique_key_list').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__unary_unique_key_list - AssertionError: assert 7 == 8
 +  where 7 = ResultHolder(seed_count=1, model_count=1, seed_rows=7, inc_test_model_count=1, opt_model_count=1, relation='unique_key_list__inplace_overwrite').seed_rows
 +  and   8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=1, relation='unary_unique_key_list').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__duplicated_unary_unique_key_list - AssertionError: assert 7 == 8
 +  where 7 = ResultHolder(seed_count=1, model_count=1, seed_rows=7, inc_test_model_count=1, opt_model_count=1, relation='unique_key_list__inplace_overwrite').seed_rows
 +  and   8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=1, relation='duplicated_unary_unique_key_list').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__trinary_unique_key_list - AssertionError: assert 7 == 8
 +  where 7 = ResultHolder(seed_count=1, model_count=1, seed_rows=7, inc_test_model_count=1, opt_model_count=1, relation='unique_key_list__inplace_overwrite').seed_rows
 +  and   8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=1, relation='trinary_unique_key_list').seed_rows
FAILED tests/functional/adapter/test_incremental.py::TestIncrementalUniqueKey::test__trinary_unique_key_list_no_update - AssertionError: assert 8 == 9
 +  where 8 = ResultHolder(seed_count=1, model_count=1, seed_rows=8, inc_test_model_count=1, opt_model_count=None, relation='seed').seed_rows
 +  and   9 = ResultHolder(seed_count=1, model_count=1, seed_rows=9, inc_test_model_count=1, opt_model_count=None, relation='nontyped_trinary_unique_key_list').seed_rows
============ 8 failed, 569 passed, 12 skipped in 472.15s (0:07:52) =============

It looks like there are similar test failures across all the adapters I tested against.

I'm not sure if these failures are because the tests are making a simplifying assumption that isn't true as it relates to NULLs, of if there is something off about the implementation.

So it looks like either the implementation will need to be updated, or the these test cases will need to be updated.

@amardatar
Copy link
Author

Hey @dbeatty10 had a look - is it because we've added actual rows to the seeds, and needed to update the seed_rows param? I would've missed that, didn't realise that's how it worked - but have pushed a change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes The PR author has signed the CLA community A PR, or an issue with a PR, from a community member feature:incremental Issues related to incremental materializations triage:ready-for-review In Eng's queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null
7 participants