Skip to content

Commit

Permalink
Fixed page data truncation in parquet writer under certain conditions. (
Browse files Browse the repository at this point in the history
rapidsai#15474)

Fixes rapidsai#15473

The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers.   Specifically, we poke the value of dict_bits into the output buffer here:

https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1892

But, if we have no leaf values (for example, because everything in the page is null) `s->cur` never gets updated here, because we never enter the containing loop.

https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1948

The fix is to just always update `s->cur` after this if-else block

https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1891

Note that this was already handled by our reader.  But some third party readers (Trino) are expecting that data to be there and crash if it's not.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: rapidsai#15474
  • Loading branch information
nvdbaranec authored Apr 9, 2024
1 parent 72b2759 commit 3b48f8b
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1896,6 +1896,7 @@ CUDF_KERNEL void __launch_bounds__(block_size, 8)
s->rle_out = dst + RLE_LENGTH_FIELD_LEN;
s->rle_len_pos = dst;
}
s->cur = s->rle_out;
s->page_start_val = row_to_value_idx(s->page.start_row, s->col);
s->chunk_start_val = row_to_value_idx(s->ck.start_row, s->col);
}
Expand Down

0 comments on commit 3b48f8b

Please sign in to comment.