-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43758: [C++] Compute: More comment in RowEncoder #43763
Conversation
@zanmato1984 @pitrou I've add some doc for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Co-authored-by: Rossi Sun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more nits.
/// 3. The "variable width" encoding for the column, it would exists only | ||
/// for non-null string/binary columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is misleading. If by "variable width" encoding
you mean the length + var-length-bytes
, then the length
always occupies sizeof(Offset)
bytes, even for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to variable payload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more nits.
// within the bytes_ vector. This allows for quick access to individual rows. | ||
// | ||
// The size would be num_rows + 1 if not empty, the last element is the total | ||
// length of the bytes_ vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this issue: I'm thinking of an optimization here. We can define a flag to indicate that all the columns are fixed-sized or null. If it's, we can not maintain the offsets, just static compute a fixed-row-size
, and using fixed-row-size to seek for the row.
@@ -259,6 +334,9 @@ class ARROW_EXPORT RowEncoder { | |||
Status EncodeAndAppend(const ExecSpan& batch); | |||
Result<ExecBatch> Decode(int64_t num_rows, const int32_t* row_ids); | |||
|
|||
// Returns the encoded representation of the row at index i. | |||
// If i is kRowIdForNulls, it returns the pre-encoded all-nulls | |||
// row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another optimization might be std::string_view unsafe_encoded_row(int32_t i)
, which not copying the row. When the std::string cannot applying SSO, it would be benifit from less heap allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a separate PR for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a separate PR for that?
Would do after this merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The row format is composed of the the KeyColumn encodings for each, | ||
/// and the column is encoded as follows: | ||
/// 1. A null byte for each column, indicating whether the column is null. | ||
/// "1" for null, "0" for non-null. | ||
/// 2. The "fixed width" encoding for the column, it would exist whether | ||
/// the column is null or not. | ||
/// 3. The "variable width" encoding for the column, it would exists only | ||
/// 3. The "variable payload" encoding for the column, it would exists only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it count into the "length" part for the var-length column (which will exist no matter the column is null or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add some descr below
Co-authored-by: Rossi Sun <[email protected]>
cba86df
to
8f3a8f6
Compare
/// [null byte, variable-byte length, variable bytes]. For example: | ||
/// | ||
/// String "abc" Would be encoded as: | ||
/// [0 0 0 0 3 'a' 'b' 'c'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be [0 3 0 0 0 'a' 'b' 'c']
on little-endian platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
How about:
0 ( 1 byte for not null) + 3 ( 4 bytes for length ) + "abc" (payload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Co-authored-by: Antoine Pitrou <[email protected]>
c470800
to
34ec45f
Compare
Failed CI unrelated |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44d3f76. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Some comments for RowEncoder ### What changes are included in this PR? Some comments for RowEncoder ### Are these changes tested? Covered by existing ### Are there any user-facing changes? no * GitHub Issue: apache#43758 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: mwish <[email protected]>
### Rationale for this change Some comments for RowEncoder ### What changes are included in this PR? Some comments for RowEncoder ### Are these changes tested? Covered by existing ### Are there any user-facing changes? no * GitHub Issue: apache#43758 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: mwish <[email protected]>
### Rationale for this change Some comments for RowEncoder ### What changes are included in this PR? Some comments for RowEncoder ### Are these changes tested? Covered by existing ### Are there any user-facing changes? no * GitHub Issue: apache#43758 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
Some comments for RowEncoder
What changes are included in this PR?
Some comments for RowEncoder
Are these changes tested?
Covered by existing
Are there any user-facing changes?
no