[FEA] Remove redundant hash combine step for single-column row hasher #17687

PointKernel · 2025-01-07T17:43:02Z

Is your feature request related to a problem? Please describe.

cudf/cpp/include/cudf/table/experimental/row_operators.cuh

Lines 1881 to 1884 in 34e2045

    
           // Hash each element and combine all the hash values together 
        
           return detail::accumulate(it, it + _table.num_columns(), _seed, [](auto hash, auto h) { 
        
             return cudf::hashing::detail::hash_combine(hash, h); 
        
           });

The current row hasher includes a hash combine step to improve hash quality for multi-column data. However, this step is unnecessary when only a single column is present. I would like libcudf to skip the hash combine step for single-column data.

Describe the solution you'd like
A straightforward approach would be to add an if branch like if (_table.num_columns() == 1) { // no hash combine code path }. However, branching on the device is less optimal.

Additional context
Corresponding C++ unit tests and pytests need to be updated since the gold reference values are different when removing hash combine.

The text was updated successfully, but these errors were encountered:

bdice · 2025-01-07T18:00:46Z

If am remembering correctly, we can solve this by disentangling the MurmurHash3_x86_32 kernel (a user-facing hash algorithm that should align with external implementations of MurmurHash3_x86_32) from our internal uses of that hash algorithm. In other words, murmurhash3_x86_32.cu should not call the row hasher.

We need to replace the current thrust::tabulate(..., row_hasher.device_hasher<MurmurHash3_x86_32>(...)) here (

cudf/cpp/src/hash/murmurhash3_x86_32.cu

Lines 53 to 56 in caf97ef

    
           thrust::tabulate(rmm::exec_policy(stream), 
        
                            output_view.begin<hash_value_type>(), 
        
                            output_view.end<hash_value_type>(), 
        
                            row_hasher.device_hasher<MurmurHash3_x86_32>(nullable, seed));

) with a style more like XXHash64 (

cudf/cpp/src/hash/xxhash_64.cu

Lines 52 to 62 in caf97ef

    
           __device__ auto operator()(size_type row_index) const noexcept 
        
           { 
        
             return cudf::detail::accumulate( 
        
               _table.begin(), 
        
               _table.end(), 
        
               _seed, 
        
               [row_index, nulls = _check_nulls] __device__(auto hash, auto column) { 
        
                 return cudf::type_dispatcher( 
        
                   column.type(), element_hasher_adapter{}, column, row_index, nulls, hash); 
        
               }); 
        
           }

) if I am remembering correctly.

bdice · 2025-01-08T19:57:03Z

I'm working on a PR for this.

bdice · 2025-01-08T22:01:35Z

I found that using the same construction as xxhash_32 and xxhash_64 breaks list and struct support. I'm trying the other approach (conditional) mentioned in the top of the issue.

PointKernel · 2025-01-08T22:57:49Z

Using a row hasher here centralizes the handling of hash combine logic, simplifying the tracking and maintenance of similar issues in the future. Additionally, updating the row hasher eliminates the extra hash combine step in row hashing and ensures consistent behavior between the row hasher and column hash APIs.

bdice · 2025-01-09T02:16:10Z

The main potential difference here is that the row hasher is meant to be internal and the column hash API is meant to be external. We should be able to have a boundary in our API so that no external APIs will observe or need to know how the internal row hasher works. That gives us the freedom to improve our internal hashing for performance, reduced collisions, etc. without external effects.

From a different perspective, multiple libraries should be able to agree on what the MurmurHash3_x86_32 result is for a given message of bytes. Our “hash combine” is really only meant for internal use, and we (abuse?) extend it with a custom implementation for lists and structs. What is the binary representation of a list or struct? It doesn’t have a well-defined answer that all libraries would agree on, and thus most libraries do not implement hashes of lists or structs in their external APIs. Only types like integer, float, string for which a common binary layout can be agreed on typically have a public hash implementation. Of course, hashes of lists/structs are still needed internally for hashmaps and the like.

I would agree that it is simpler to retain a single implementation, but we may want to rethink how much of the row hasher we expose externally.

PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 7, 2025

PointKernel mentioned this issue Jan 7, 2025

Add XXHash_32 hasher #17533

Merged

3 tasks

bdice self-assigned this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Remove redundant hash combine step for single-column row hasher #17687

[FEA] Remove redundant hash combine step for single-column row hasher #17687

PointKernel commented Jan 7, 2025

bdice commented Jan 7, 2025

bdice commented Jan 8, 2025

bdice commented Jan 8, 2025

PointKernel commented Jan 8, 2025

bdice commented Jan 9, 2025 •

edited

Loading

[FEA] Remove redundant hash combine step for single-column row hasher #17687

[FEA] Remove redundant hash combine step for single-column row hasher #17687

Comments

PointKernel commented Jan 7, 2025

bdice commented Jan 7, 2025

bdice commented Jan 8, 2025

bdice commented Jan 8, 2025

PointKernel commented Jan 8, 2025

bdice commented Jan 9, 2025 • edited Loading

bdice commented Jan 9, 2025 •

edited

Loading