[Parquet] Improve speed of dictionary encoding NaN float values #6953

adamreeve · 2025-01-08T03:46:05Z

Which issue does this PR close?

Rationale for this change

This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.

What changes are included in this PR?

Introduces a new Intern trait to define equality behaviour for interning, replacing the use of PartialEq.
Adds a benchmark for writing floating point values with NaNs to Parquet.

Are there any user-facing changes?

Users should see improved performance when writing floating point data with many NaNs to Parquet

adamreeve · 2025-01-08T03:49:34Z

Benchmark results from the new benchmarks before changing the interning behaviour:

write_batch primitive/4096 values float with NaNs
                        time:   [5.6968 ms 5.7060 ms 5.7141 ms]
                        thrpt:  [9.6186 MiB/s 9.6324 MiB/s 9.6479 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
write_batch primitive/4096 values float with no NaNs
                        time:   [383.44 µs 383.65 µs 383.85 µs]
                        thrpt:  [143.18 MiB/s 143.26 MiB/s 143.34 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

This shows that writing with 50% NaN values is much slower than with no NaNs.

After the change, performance with NaNs is very similar to without NaNs:

write_batch primitive/4096 values float with NaNs
                        time:   [406.40 µs 406.63 µs 406.88 µs]
                        thrpt:  [135.08 MiB/s 135.16 MiB/s 135.24 MiB/s]
                 change:
                        time:   [-92.875% -92.861% -92.845%] (p = 0.00 < 0.05)
                        thrpt:  [+1297.6% +1300.7% +1303.5%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values float with no NaNs
                        time:   [382.52 µs 384.16 µs 385.50 µs]
                        thrpt:  [142.58 MiB/s 143.07 MiB/s 143.68 MiB/s]
                 change:
                        time:   [+0.1803% +0.3520% +0.5192%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5165% -0.3507% -0.1799%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

(I removed the 4096 values float with no NaNs benchmark from this PR after running these benchmarks as I don't think there's a lot of value in keeping it)

adamreeve · 2025-01-08T04:19:44Z

parquet/src/util/interner.rs

@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> {
            .dedup
            .entry(
                hash,
-                |index| value == self.storage.get(*index),
+                |index| value.eq(self.storage.get(*index)),


Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:

|index| value.as_bytes() == self.storage.get(*index).as_bytes()

I will check what effect that has on performance.

alamb · 2025-01-08T14:02:56Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

etseidl

Nice catch! This seems like an elegant solution. Thanks!

adamreeve · 2025-01-08T21:03:24Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

alamb · 2025-01-08T22:32:24Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased

THanks again @adamreeve

adamreeve · 2025-01-08T23:42:07Z

OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine.

alamb · 2025-01-10T19:41:56Z

Running benchmarks now

alamb

Thank you @adamreeve

I think technically speaking this could be a behavior change as previously different Nan representations would be collapsed into a single dictionary entry, but now they will use different entries.

This actually seems like a better behavior to me as it means the parquet writer will produce exactly what went in (aka won't normalize the Nan representation)

I also ran the newly added benchmark and verified this approach appears to be much faster for NaNs (7.5x faster)

++ critcmp main nan-dict-encode
group                                                main                                   nan-dict-encode
-----                                                ----                                   ---------------
write_batch primitive/4096 values float with NaNs    7.53      8.3±0.09ms     6.6 MB/sec    1.00  1108.4±40.39µs    49.6 MB/sec
++ popd
~/datafusion-benchmarking

Thanks again

alamb · 2025-01-10T19:51:13Z

FYI @etseidl

…he#6953) * Treat NaNs equal to NaN when interning for dictionary encoding * Compare all values by bytes rather than adding Intern trait

adamreeve · 2025-01-13T08:49:39Z

I think technically speaking this could be a behavior change as previously different Nan representations would be collapsed into a single dictionary entry, but now they will use different entries.

I don't think this is true as the NaN == NaN test being false would have caused all NaN values to get separate entries in the dictionary, so behaviour shouldn't have changed. But it does differ from C++ Arrow which does collapse all NaNs to a single entry.

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 8, 2025

adamreeve commented Jan 8, 2025

View reviewed changes

etseidl approved these changes Jan 8, 2025

View reviewed changes

adamreeve mentioned this pull request Jan 8, 2025

[Parquet] Add benchmark and test for writing NaNs to Parquet #6955

Merged

adamreeve marked this pull request as draft January 8, 2025 21:03

adamreeve added 2 commits January 9, 2025 12:14

Treat NaNs equal to NaN when interning for dictionary encoding

e342a38

Compare all values by bytes rather than adding Intern trait

ef980a7

adamreeve force-pushed the nan-dict-encode branch from 38d8581 to ef980a7 Compare January 8, 2025 23:39

github-actions bot removed the arrow Changes to the arrow crate label Jan 8, 2025

adamreeve marked this pull request as ready for review January 8, 2025 23:42

alamb changed the title ~~[Parquet] Fix slow dictionary encoding of NaN float values~~ [Parquet] Improve speed of dictionary encoding NaN float values Jan 10, 2025

alamb approved these changes Jan 10, 2025

View reviewed changes

tustvold approved these changes Jan 10, 2025

View reviewed changes

tustvold merged commit 7aecc3f into apache:main Jan 11, 2025
17 checks passed

adamreeve deleted the nan-dict-encode branch January 13, 2025 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Improve speed of dictionary encoding NaN float values #6953

[Parquet] Improve speed of dictionary encoding NaN float values #6953

adamreeve commented Jan 8, 2025

adamreeve commented Jan 8, 2025

adamreeve Jan 8, 2025

alamb commented Jan 8, 2025

etseidl left a comment

adamreeve commented Jan 8, 2025

alamb commented Jan 8, 2025

adamreeve commented Jan 8, 2025

alamb commented Jan 10, 2025

alamb left a comment

alamb commented Jan 10, 2025

adamreeve commented Jan 13, 2025

[Parquet] Improve speed of dictionary encoding NaN float values #6953

[Parquet] Improve speed of dictionary encoding NaN float values #6953

Conversation

adamreeve commented Jan 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

adamreeve commented Jan 8, 2025

adamreeve Jan 8, 2025

Choose a reason for hiding this comment

alamb commented Jan 8, 2025

etseidl left a comment

Choose a reason for hiding this comment

adamreeve commented Jan 8, 2025

alamb commented Jan 8, 2025

adamreeve commented Jan 8, 2025

alamb commented Jan 10, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 10, 2025

adamreeve commented Jan 13, 2025