Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remove redundant hash combine step for single-column row hasher #17687

Open
PointKernel opened this issue Jan 7, 2025 · 5 comments
Open
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@PointKernel
Copy link
Member

Is your feature request related to a problem? Please describe.

// Hash each element and combine all the hash values together
return detail::accumulate(it, it + _table.num_columns(), _seed, [](auto hash, auto h) {
return cudf::hashing::detail::hash_combine(hash, h);
});

The current row hasher includes a hash combine step to improve hash quality for multi-column data. However, this step is unnecessary when only a single column is present. I would like libcudf to skip the hash combine step for single-column data.

Describe the solution you'd like
A straightforward approach would be to add an if branch like if (_table.num_columns() == 1) { // no hash combine code path }. However, branching on the device is less optimal.

Additional context
Corresponding C++ unit tests and pytests need to be updated since the gold reference values are different when removing hash combine.

@PointKernel PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 7, 2025
@PointKernel PointKernel mentioned this issue Jan 7, 2025
3 tasks
@bdice
Copy link
Contributor

bdice commented Jan 7, 2025

If am remembering correctly, we can solve this by disentangling the MurmurHash3_x86_32 kernel (a user-facing hash algorithm that should align with external implementations of MurmurHash3_x86_32) from our internal uses of that hash algorithm. In other words, murmurhash3_x86_32.cu should not call the row hasher.

We need to replace the current thrust::tabulate(..., row_hasher.device_hasher<MurmurHash3_x86_32>(...)) here (

thrust::tabulate(rmm::exec_policy(stream),
output_view.begin<hash_value_type>(),
output_view.end<hash_value_type>(),
row_hasher.device_hasher<MurmurHash3_x86_32>(nullable, seed));
) with a style more like XXHash64 (
__device__ auto operator()(size_type row_index) const noexcept
{
return cudf::detail::accumulate(
_table.begin(),
_table.end(),
_seed,
[row_index, nulls = _check_nulls] __device__(auto hash, auto column) {
return cudf::type_dispatcher(
column.type(), element_hasher_adapter{}, column, row_index, nulls, hash);
});
}
) if I am remembering correctly.

@bdice
Copy link
Contributor

bdice commented Jan 8, 2025

I'm working on a PR for this.

@bdice bdice self-assigned this Jan 8, 2025
@bdice
Copy link
Contributor

bdice commented Jan 8, 2025

I found that using the same construction as xxhash_32 and xxhash_64 breaks list and struct support. I'm trying the other approach (conditional) mentioned in the top of the issue.

@PointKernel
Copy link
Member Author

Using a row hasher here centralizes the handling of hash combine logic, simplifying the tracking and maintenance of similar issues in the future. Additionally, updating the row hasher eliminates the extra hash combine step in row hashing and ensures consistent behavior between the row hasher and column hash APIs.

@bdice
Copy link
Contributor

bdice commented Jan 9, 2025

The main potential difference here is that the row hasher is meant to be internal and the column hash API is meant to be external. We should be able to have a boundary in our API so that no external APIs will observe or need to know how the internal row hasher works. That gives us the freedom to improve our internal hashing for performance, reduced collisions, etc. without external effects.

From a different perspective, multiple libraries should be able to agree on what the MurmurHash3_x86_32 result is for a given message of bytes. Our “hash combine” is really only meant for internal use, and we (abuse?) extend it with a custom implementation for lists and structs. What is the binary representation of a list or struct? It doesn’t have a well-defined answer that all libraries would agree on, and thus most libraries do not implement hashes of lists or structs in their external APIs. Only types like integer, float, string for which a common binary layout can be agreed on typically have a public hash implementation. Of course, hashes of lists/structs are still needed internally for hashmaps and the like.

I would agree that it is simpler to retain a single implementation, but we may want to rethink how much of the row hasher we expose externally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants