Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixed page data truncation in parquet writer under certain conditions. (
rapidsai#15474) Fixes rapidsai#15473 The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers. Specifically, we poke the value of dict_bits into the output buffer here: https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1892 But, if we have no leaf values (for example, because everything in the page is null) `s->cur` never gets updated here, because we never enter the containing loop. https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1948 The fix is to just always update `s->cur` after this if-else block https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1891 Note that this was already handled by our reader. But some third party readers (Trino) are expecting that data to be there and crash if it's not. Authors: - https://github.com/nvdbaranec Approvers: - Nghia Truong (https://github.com/ttnghia) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#15474
- Loading branch information