-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Remove redundant hash combine step for single-column row hasher #17687
Comments
If am remembering correctly, we can solve this by disentangling the MurmurHash3_x86_32 kernel (a user-facing hash algorithm that should align with external implementations of MurmurHash3_x86_32) from our internal uses of that hash algorithm. In other words, We need to replace the current cudf/cpp/src/hash/murmurhash3_x86_32.cu Lines 53 to 56 in caf97ef
cudf/cpp/src/hash/xxhash_64.cu Lines 52 to 62 in caf97ef
|
I'm working on a PR for this. |
I found that using the same construction as |
Using a row hasher here centralizes the handling of hash combine logic, simplifying the tracking and maintenance of similar issues in the future. Additionally, updating the row hasher eliminates the extra hash combine step in row hashing and ensures consistent behavior between the row hasher and column hash APIs. |
The main potential difference here is that the row hasher is meant to be internal and the column hash API is meant to be external. We should be able to have a boundary in our API so that no external APIs will observe or need to know how the internal row hasher works. That gives us the freedom to improve our internal hashing for performance, reduced collisions, etc. without external effects. From a different perspective, multiple libraries should be able to agree on what the MurmurHash3_x86_32 result is for a given message of bytes. Our “hash combine” is really only meant for internal use, and we (abuse?) extend it with a custom implementation for lists and structs. What is the binary representation of a list or struct? It doesn’t have a well-defined answer that all libraries would agree on, and thus most libraries do not implement hashes of lists or structs in their external APIs. Only types like integer, float, string for which a common binary layout can be agreed on typically have a public hash implementation. Of course, hashes of lists/structs are still needed internally for hashmaps and the like. I would agree that it is simpler to retain a single implementation, but we may want to rethink how much of the row hasher we expose externally. |
Is your feature request related to a problem? Please describe.
cudf/cpp/include/cudf/table/experimental/row_operators.cuh
Lines 1881 to 1884 in 34e2045
The current row hasher includes a hash combine step to improve hash quality for multi-column data. However, this step is unnecessary when only a single column is present. I would like libcudf to skip the hash combine step for single-column data.
Describe the solution you'd like
A straightforward approach would be to add an if branch like
if (_table.num_columns() == 1) { // no hash combine code path }
. However, branching on the device is less optimal.Additional context
Corresponding C++ unit tests and pytests need to be updated since the gold reference values are different when removing hash combine.
The text was updated successfully, but these errors were encountered: