-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Improve speed of dictionary encoding NaN float values #6953
Conversation
Benchmark results from the new benchmarks before changing the interning behaviour:
This shows that writing with 50% NaN values is much slower than with no NaNs. After the change, performance with NaNs is very similar to without NaNs:
(I removed the |
parquet/src/util/interner.rs
Outdated
@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> { | |||
.dedup | |||
.entry( | |||
hash, | |||
|index| value == self.storage.get(*index), | |||
|index| value.eq(self.storage.get(*index)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:
|index| value.as_bytes() == self.storage.get(*index).as_bytes()
I will check what effect that has on performance.
Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! This seems like an elegant solution. Thanks!
Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged. |
THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased THanks again @adamreeve |
38d8581
to
ef980a7
Compare
OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine. |
Running benchmarks now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adamreeve
I think technically speaking this could be a behavior change as previously different Nan representations would be collapsed into a single dictionary entry, but now they will use different entries.
This actually seems like a better behavior to me as it means the parquet writer will produce exactly what went in (aka won't normalize the Nan representation)
I also ran the newly added benchmark and verified this approach appears to be much faster for NaNs (7.5x faster)
++ critcmp main nan-dict-encode
group main nan-dict-encode
----- ---- ---------------
write_batch primitive/4096 values float with NaNs 7.53 8.3±0.09ms 6.6 MB/sec 1.00 1108.4±40.39µs 49.6 MB/sec
++ popd
~/datafusion-benchmarking
Thanks again
FYI @etseidl |
…he#6953) * Treat NaNs equal to NaN when interning for dictionary encoding * Compare all values by bytes rather than adding Intern trait
I don't think this is true as the |
Which issue does this PR close?
Closes #6952
Rationale for this change
This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.
What changes are included in this PR?
Intern
trait to define equality behaviour for interning, replacing the use ofPartialEq
.Are there any user-facing changes?