Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Improve speed of dictionary encoding NaN float values #6953

Merged
merged 2 commits into from
Jan 11, 2025

Conversation

adamreeve
Copy link
Contributor

Which issue does this PR close?

Closes #6952

Rationale for this change

This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.

What changes are included in this PR?

  • Introduces a new Intern trait to define equality behaviour for interning, replacing the use of PartialEq.
  • Adds a benchmark for writing floating point values with NaNs to Parquet.

Are there any user-facing changes?

  • Users should see improved performance when writing floating point data with many NaNs to Parquet

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 8, 2025
@adamreeve
Copy link
Contributor Author

Benchmark results from the new benchmarks before changing the interning behaviour:

write_batch primitive/4096 values float with NaNs
                        time:   [5.6968 ms 5.7060 ms 5.7141 ms]
                        thrpt:  [9.6186 MiB/s 9.6324 MiB/s 9.6479 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
write_batch primitive/4096 values float with no NaNs
                        time:   [383.44 µs 383.65 µs 383.85 µs]
                        thrpt:  [143.18 MiB/s 143.26 MiB/s 143.34 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

This shows that writing with 50% NaN values is much slower than with no NaNs.

After the change, performance with NaNs is very similar to without NaNs:

write_batch primitive/4096 values float with NaNs
                        time:   [406.40 µs 406.63 µs 406.88 µs]
                        thrpt:  [135.08 MiB/s 135.16 MiB/s 135.24 MiB/s]
                 change:
                        time:   [-92.875% -92.861% -92.845%] (p = 0.00 < 0.05)
                        thrpt:  [+1297.6% +1300.7% +1303.5%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values float with no NaNs
                        time:   [382.52 µs 384.16 µs 385.50 µs]
                        thrpt:  [142.58 MiB/s 143.07 MiB/s 143.68 MiB/s]
                 change:
                        time:   [+0.1803% +0.3520% +0.5192%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5165% -0.3507% -0.1799%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

(I removed the 4096 values float with no NaNs benchmark from this PR after running these benchmarks as I don't think there's a lot of value in keeping it)

@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> {
.dedup
.entry(
hash,
|index| value == self.storage.get(*index),
|index| value.eq(self.storage.get(*index)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:

|index| value.as_bytes() == self.storage.get(*index).as_bytes()

I will check what effect that has on performance.

@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! This seems like an elegant solution. Thanks!

@adamreeve
Copy link
Contributor Author

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

@adamreeve adamreeve marked this pull request as draft January 8, 2025 21:03
@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased

THanks again @adamreeve

@github-actions github-actions bot removed the arrow Changes to the arrow crate label Jan 8, 2025
@adamreeve
Copy link
Contributor Author

OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine.

@adamreeve adamreeve marked this pull request as ready for review January 8, 2025 23:42
@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Running benchmarks now

@alamb alamb changed the title [Parquet] Fix slow dictionary encoding of NaN float values [Parquet] Improve speed of dictionary encoding NaN float values Jan 10, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adamreeve

I think technically speaking this could be a behavior change as previously different Nan representations would be collapsed into a single dictionary entry, but now they will use different entries.

This actually seems like a better behavior to me as it means the parquet writer will produce exactly what went in (aka won't normalize the Nan representation)

I also ran the newly added benchmark and verified this approach appears to be much faster for NaNs (7.5x faster)

++ critcmp main nan-dict-encode
group                                                main                                   nan-dict-encode
-----                                                ----                                   ---------------
write_batch primitive/4096 values float with NaNs    7.53      8.3±0.09ms     6.6 MB/sec    1.00  1108.4±40.39µs    49.6 MB/sec
++ popd
~/datafusion-benchmarking

Thanks again

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

FYI @etseidl

@tustvold tustvold merged commit 7aecc3f into apache:main Jan 11, 2025
17 checks passed
CurtHagenlocher pushed a commit to CurtHagenlocher/arrow-rs that referenced this pull request Jan 13, 2025
…he#6953)

* Treat NaNs equal to NaN when interning for dictionary encoding

* Compare all values by bytes rather than adding Intern trait
@adamreeve
Copy link
Contributor Author

I think technically speaking this could be a behavior change as previously different Nan representations would be collapsed into a single dictionary entry, but now they will use different entries.

I don't think this is true as the NaN == NaN test being false would have caused all NaN values to get separate entries in the dictionary, so behaviour shouldn't have changed. But it does differ from C++ Arrow which does collapse all NaNs to a single entry.

@adamreeve adamreeve deleted the nan-dict-encode branch January 13, 2025 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing floating point values containing NaN to Parquet is slow when using dictionary encoding
4 participants