Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45185: [C++][Parquet] Raise an error for invalid repetition levels when delimiting records #45186

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jan 7, 2025

Rationale for this change

See #45185. Invalid repetition levels would previously only cause a fatal error in debug builds.

What changes are included in this PR?

Converts an existing ARROW_DCHECK_EQ of the repetition level with a check that will raise an exception in release builds too.

Are these changes tested?

Yes, using a new example file (apache/parquet-testing#67)

Are there any user-facing changes?

Yes, reading columns with invalid repetition levels as Arrow arrays will now raise an exception.

@raulcd
Copy link
Member

raulcd commented Jan 7, 2025

Note for others taking a look, tests won't be successful until we merge the parquet-testing PR

@adamreeve adamreeve force-pushed the validate-repetition branch from be347da to 22794e2 Compare January 20, 2025 03:21
@adamreeve adamreeve marked this pull request as ready for review January 20, 2025 10:03
@adamreeve adamreeve requested a review from wgtmac as a code owner January 20, 2025 10:03
@adamreeve
Copy link
Contributor Author

The failing tests all look to be caused by #45305 rather than this change.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 20, 2025
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've check the code, and I found that check the rep-levels here is ok, since check it in other places is nearly impossible here 😂

TEST(TestArrowReaderAdHoc, InvalidRepetitionLevels) {
// GH-45185 - Repetition levels start with 1 instead of 0
auto path = test::get_data_file("ARROW-GH-45185.parquet", /*is_good=*/false);
TryReadDataFile(path, ::arrow::StatusCode::IOError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind also check the status is "The repetition level at the start of a record must be 0 but got ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

@@ -1611,7 +1611,12 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
// another record start or exhausting the ColumnChunk
int64_t level = levels_position_;
if (at_record_start_) {
ARROW_DCHECK_EQ(0, rep_levels[levels_position_]);
if (rep_levels[levels_position_] != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ARROW_PREDICT_FALSE to check it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants