Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43758: [C++] Compute: More comment in RowEncoder #43763

Merged
merged 15 commits into from
Sep 2, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 19, 2024

Rationale for this change

Some comments for RowEncoder

What changes are included in this PR?

Some comments for RowEncoder

Are these changes tested?

Covered by existing

Are there any user-facing changes?

no

@mapleFU mapleFU changed the title GH-43758: [C++] Compute: More comment in RowTable GH-43758: [C++] Compute: More comment in RowEncoder Aug 21, 2024
@mapleFU mapleFU marked this pull request as ready for review August 21, 2024 14:53
@mapleFU mapleFU requested a review from zanmato1984 August 28, 2024 09:24
@mapleFU
Copy link
Member Author

mapleFU commented Aug 28, 2024

@zanmato1984 @pitrou I've add some doc for row_encoder_internal.h. I'm not good at doc writing(but I've tried), would you mind take a look?

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 30, 2024
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nits.

cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
Comment on lines 281 to 282
/// 3. The "variable width" encoding for the column, it would exists only
/// for non-null string/binary columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is misleading. If by "variable width" encoding you mean the length + var-length-bytes, then the length always occupies sizeof(Offset) bytes, even for null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to variable payload

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nits.

// within the bytes_ vector. This allows for quick access to individual rows.
//
// The size would be num_rows + 1 if not empty, the last element is the total
// length of the bytes_ vector.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this issue: I'm thinking of an optimization here. We can define a flag to indicate that all the columns are fixed-sized or null. If it's, we can not maintain the offsets, just static compute a fixed-row-size, and using fixed-row-size to seek for the row.

@@ -259,6 +334,9 @@ class ARROW_EXPORT RowEncoder {
Status EncodeAndAppend(const ExecSpan& batch);
Result<ExecBatch> Decode(int64_t num_rows, const int32_t* row_ids);

// Returns the encoded representation of the row at index i.
// If i is kRowIdForNulls, it returns the pre-encoded all-nulls
// row.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization might be std::string_view unsafe_encoded_row(int32_t i), which not copying the row. When the std::string cannot applying SSO, it would be benifit from less heap allocation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for that?

Would do after this merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.cc Outdated Show resolved Hide resolved
/// The row format is composed of the the KeyColumn encodings for each,
/// and the column is encoded as follows:
/// 1. A null byte for each column, indicating whether the column is null.
/// "1" for null, "0" for non-null.
/// 2. The "fixed width" encoding for the column, it would exist whether
/// the column is null or not.
/// 3. The "variable width" encoding for the column, it would exists only
/// 3. The "variable payload" encoding for the column, it would exists only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it count into the "length" part for the var-length column (which will exist no matter the column is null or not)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add some descr below

@mapleFU mapleFU force-pushed the more-comment-in-compute-row branch from cba86df to 8f3a8f6 Compare August 31, 2024 16:27
/// [null byte, variable-byte length, variable bytes]. For example:
///
/// String "abc" Would be encoded as:
/// [0 0 0 0 3 'a' 'b' 'c']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be [0 3 0 0 0 'a' 'b' 'c'] on little-endian platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch
How about:

0 ( 1 byte for not null) + 3 ( 4 bytes for length ) + "abc" (payload)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mapleFU mapleFU force-pushed the more-comment-in-compute-row branch from c470800 to 34ec45f Compare September 2, 2024 14:17
@mapleFU mapleFU requested a review from pitrou September 2, 2024 14:18
@mapleFU
Copy link
Member Author

mapleFU commented Sep 2, 2024

Failed CI unrelated

@mapleFU mapleFU merged commit 44d3f76 into apache:main Sep 2, 2024
37 of 40 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Sep 2, 2024
@mapleFU mapleFU deleted the more-comment-in-compute-row branch September 2, 2024 15:06
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44d3f76.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

mapleFU added a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
### Rationale for this change

Some comments for RowEncoder

### What changes are included in this PR?

Some comments for RowEncoder

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

no

* GitHub Issue: apache#43758

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: mwish <[email protected]>
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
### Rationale for this change

Some comments for RowEncoder

### What changes are included in this PR?

Some comments for RowEncoder

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

no

* GitHub Issue: apache#43758

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: mwish <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
### Rationale for this change

Some comments for RowEncoder

### What changes are included in this PR?

Some comments for RowEncoder

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

no

* GitHub Issue: apache#43758

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants