Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43758: [C++] Compute: More comment in RowEncoder #43763

Merged
merged 15 commits into from
Sep 2, 2024
Merged
8 changes: 5 additions & 3 deletions cpp/src/arrow/compute/row/row_encoder_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -159,17 +159,19 @@ Status FixedWidthKeyEncoder::Encode(const ExecValue& data, int64_t batch_length,
};
if (data.is_array()) {
ArraySpan viewed = data.array;
// The original type might not FixedSizeBinaryType, but it would
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
// treat the input as binary data.
auto view_ty = fixed_size_binary(byte_width_);
viewed.type = view_ty.get();
VisitArraySpanInline<FixedSizeBinaryType>(viewed, handle_next_valid_value,
handle_next_null_value);
} else {
const auto& scalar = data.scalar_as<arrow::internal::PrimitiveScalarBase>();
if (scalar.is_valid) {
const std::string_view data = scalar.view();
DCHECK_EQ(data.size(), static_cast<size_t>(byte_width_));
const std::string_view scalar_data = scalar.view();
DCHECK_EQ(scalar_data.size(), static_cast<size_t>(byte_width_));
for (int64_t i = 0; i < batch_length; i++) {
handle_next_valid_value(data);
handle_next_valid_value(scalar_data);
}
} else {
for (int64_t i = 0; i < batch_length; i++) {
Expand Down
61 changes: 44 additions & 17 deletions cpp/src/arrow/compute/row/row_encoder_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ struct ARROW_EXPORT KeyEncoder {

virtual ~KeyEncoder() = default;

// Increment the values in the lengths array by the length of the encoded key for the corresponding value in the given column.
// Increment the values in the lengths array by the length of the encoded key for the
// corresponding value in the given column.
//
// Generally if Encoder is for a fixed-width type, the length of the encoded key
// would add ExtraByteForNull + byte_width.
Expand All @@ -52,21 +53,24 @@ struct ARROW_EXPORT KeyEncoder {
// It's a special case for AddLength like `AddLength(Null-Scalar, 1, lengths)`.
virtual void AddLengthNull(int32_t* length) = 0;

// Encode the column into the encoded_bytes, which is an array of pointers to each row buffer.
// Encode the column into the encoded_bytes, which is an array of pointers to each row
// buffer.
//
// If value is an array, the array-size should be batch_length.
// If value is a scalar, the value would repeat batch_length times.
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
// NB: The pointers in the encoded_bytes will be advanced as values being encoded into.
virtual Status Encode(const ExecValue&, int64_t batch_length,
uint8_t** encoded_bytes) = 0;

// Encode a null value into the encoded_bytes, which is an array of pointers to each row buffer.
// Encode a null value into the encoded_bytes, which is an array of pointers to each row
// buffer.
//
// It's a special case for Encode like `Encode(Null-Scalar, 1, encoded_bytes)`.
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
// NB: The pointers in the encoded_bytes will be advanced as values being encoded into.
virtual void EncodeNull(uint8_t** encoded_bytes) = 0;

// Decode the encoded key from the encoded_bytes, which is an array of pointers to each row buffer, into an ArrayData.
// Decode the encoded key from the encoded_bytes, which is an array of pointers to each
// row buffer, into an ArrayData.
//
// NB: The pointers in the encoded_bytes will be advanced as values being decoded from.
virtual Result<std::shared_ptr<ArrayData>> Decode(uint8_t** encoded_bytes,
Expand Down Expand Up @@ -115,7 +119,7 @@ struct ARROW_EXPORT FixedWidthKeyEncoder : KeyEncoder {
MemoryPool* pool) override;

std::shared_ptr<DataType> type_;
int byte_width_;
const int byte_width_;
};

struct ARROW_EXPORT DictionaryKeyEncoder : FixedWidthKeyEncoder {
Expand Down Expand Up @@ -178,6 +182,7 @@ struct ARROW_EXPORT VarLengthKeyEncoder : KeyEncoder {
encoded_ptr += sizeof(Offset);
};
if (data.is_array()) {
DCHECK_EQ(data.length(), batch_length);
VisitArraySpanInline<T>(data.array, handle_next_valid_value,
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
handle_next_null_value);
} else {
Expand Down Expand Up @@ -267,37 +272,44 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder {
/// created by concatenating the encoded form of each column. The encoding
/// for each column depends on its data type.
///
/// This is used to encode columns into row-major format, which will be beneficial for grouping and joining operations.
/// This is used to encode columns into row-major format, which will be
/// beneficial for grouping and joining operations.
///
/// Unlike DuckDB and arrow-rs, currently this row format can not help
/// sortings because the row-format is uncomparable.
///
/// # Key Column Encoding
///
/// The row format is composed of the the KeyColumn encodings for each,
/// and the column is encoded as follows:
/// 1. A null byte for each column, indicating whether the column is null.
/// "1" for null, "0" for non-null.
/// 2. The "fixed width" encoding for the column, it would exist whether
/// the column is null or not.
/// 3. The "variable width" encoding for the column, it would exists only
/// 3. The "variable payload" encoding for the column, it would exists only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it count into the "length" part for the var-length column (which will exist no matter the column is null or not)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add some descr below

/// for non-null string/binary columns.
/// 4. Specially, if all columns in a row are null, the caller may decide to refer to kRowIdForNulls instead of actually encoding/decoding it. See the comment for encoded_nulls_.
/// 4. Specially, if all columns in a row are null, the caller may decide
/// to refer to kRowIdForNulls instead of actually encoding/decoding
/// it using any KeyEncoder. See the comment for encoded_nulls_.
///
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
/// ## Null Type
///
/// Null Type is a special case, it doesn't occupy any space in the encoded row.
/// Null Type is a special case, it doesn't occupy any space in the
/// encoded row.
///
/// ## Fixed Width Type
///
/// Fixed Width Type is encoded as a fixed-width byte sequence. For example:
/// ```
/// Int8: [5, null, 6]
/// Int8: 5, null, 6
/// ```
/// Would be encoded as [0 5 1 0 0 6].
/// Would be encoded as [0 5], [1 0], [0 6].
///
/// ### Dictionary Type
///
/// Dictionary Type is encoded as a fixed-width byte sequence using dictionary
/// indices, the dictionary should be identical for all rows.
/// Dictionary Type is encoded as a fixed-width byte sequence using
/// dictionary indices, the dictionary should be identical for all
/// rows.
///
/// ## Variable Width Type
///
Expand All @@ -309,6 +321,10 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder {
///
/// String null Would be encoded as:
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
/// [1 0 0 0 0]
///
/// # Row Encoding
///
/// The row format is the concatenation of the encodings of each column.
class ARROW_EXPORT RowEncoder {
public:
static constexpr int kRowIdForNulls() { return -1; }
Expand All @@ -318,7 +334,9 @@ class ARROW_EXPORT RowEncoder {
Status EncodeAndAppend(const ExecSpan& batch);
Result<ExecBatch> Decode(int64_t num_rows, const int32_t* row_ids);

// Return the encoded row at the given index as a string
// Returns the encoded representation of the row at index i.
// If i is kRowIdForNulls, it returns the pre-encoded all-nulls
// row.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization might be std::string_view unsafe_encoded_row(int32_t i), which not copying the row. When the std::string cannot applying SSO, it would be benifit from less heap allocation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for that?

Would do after this merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline std::string encoded_row(int32_t i) const {
if (i == kRowIdForNulls()) {
return std::string(reinterpret_cast<const char*>(encoded_nulls_.data()),
Expand All @@ -336,11 +354,20 @@ class ARROW_EXPORT RowEncoder {
private:
ExecContext* ctx_{nullptr};
std::vector<std::shared_ptr<KeyEncoder>> encoders_;
// The offsets of each row in the encoded bytes.
// The size would be num_rows + 1 if not empty.
// offsets_ vector stores the starting position (offset) of each encoded row
// within the bytes_ vector. This allows for quick access to individual rows.
//
// The size would be num_rows + 1 if not empty, the last element is the total
// length of the bytes_ vector.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this issue: I'm thinking of an optimization here. We can define a flag to indicate that all the columns are fixed-sized or null. If it's, we can not maintain the offsets, just static compute a fixed-row-size, and using fixed-row-size to seek for the row.

std::vector<int32_t> offsets_;
// The encoded bytes of all "non kRowIdForNulls" rows.
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
std::vector<uint8_t> bytes_;
// A constant row with all its columns encoded as null. Useful when the caller is certain that an entire row is null and then uses kRowIdForNulls to refer to it.
// A pre-computed constant row with all its columns encoded as null. Useful when
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
// the caller is certain that an entire row is null and then uses kRowIdForNulls
// to refer to it.
//
// EncodeAndAppend would never append this row, but encoded_row and Decode would
// return this row when kRowIdForNulls is passed.
std::vector<uint8_t> encoded_nulls_;
std::vector<std::shared_ptr<ExtensionType>> extension_types_;
};
Expand Down
Loading