-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400
base: main
Are you sure you want to change the base?
Conversation
This is port of #35691 . I'm busy previous days and now I've time on it now. The previous comment are solved. cc @pitrou @wgtmac @emkornfield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I just did an initial review except the test.
I'm quite busy these few days but I promise I would try my best to check this in this month |
The feature freeze was yesterday. This will not make it in time for 17.0.0 |
@emkornfield I've try to resolve the comment |
[6, "f"] | ||
])"}; | ||
auto table = ::arrow::TableFromJSON(schema, contents); | ||
auto non_dict_table = ::arrow::TableFromJSON(origin_schema, contents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these:
table
-> dict_encoded_table
/table_with_dict
non_dict_table
-> table
cpp/src/parquet/bloom_filter.h
Outdated
/// Compute hash for ByteArray value by using its plain encoding result. | ||
/// | ||
/// @param value the value to hash. | ||
uint64_t Hash(const ByteArray& value) const { return Hash(&value); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems they are better to be relocated to below line 107.
if (iter == row_group_bloom_filter.end()) { | ||
auto block_split_bloom_filter = | ||
std::make_unique<BlockSplitBloomFilter>(properties_->memory_pool()); | ||
block_split_bloom_filter->Init(BlockSplitBloomFilter::OptimalNumOfBytes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed that PR and it could be a followup change. Writer implementation has the freedom to try smart things.
FYI, parquet-java also discards the bloom filter if dictionary encoding is applied to all data pages, though I don't think we should do the same thing.
std::array<uint64_t, kHashBatchSize> hashes; | ||
for (int64_t i = 0; i < num_values; i += kHashBatchSize) { | ||
int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i); | ||
bloom_filter_->Hashes(values, static_cast<int>(current_hash_batch_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have forgot something, where does int32 range
come from? @emkornfield
void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t, | ||
const uint8_t*, | ||
int64_t) { | ||
DCHECK(bloom_filter_ == nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on an explicit exception
# Conflicts: # cpp/src/parquet/column_writer_test.cc
505e23b
to
e9c550a
Compare
Two need fix:
|
fd3856d
to
79bdaeb
Compare
79bdaeb
to
d892819
Compare
@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices. |
The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield |
Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze. |
# Conflicts: # cpp/src/parquet/column_writer.cc # cpp/src/parquet/type_fwd.h
Rebased, ready for review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed for another pass and generally LGTM.
It would be good if @pitrou @emkornfield can take a look after the holiday season.
ASSERT_EQ(nullptr, bloom_filter); | ||
} else { | ||
ASSERT_NE(nullptr, bloom_filter); | ||
bloom_filters_.push_back(std::move(bloom_filter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing bloom_filters_
to be an output parameter to function ReadBloomFilters
instead of a class member variable?
std::vector<std::unique_ptr<BloomFilter>> bloom_filters_; | ||
}; | ||
|
||
TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?
cpp/src/parquet/metadata.h
Outdated
struct BloomFilterLocation { | ||
/// Row group bloom filter index locations which uses row group ordinal as the key. | ||
/// | ||
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here | |
/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here |
/// | ||
/// Number of columns with a bloom filter to be relatively small compared to | ||
/// the number of overall columns, so map is used. | ||
using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about defining RowGroupBloomFilterLocation
and FileBloomFilterLocation
in the BloomFilterLocation
?
Rationale for this change
Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.
This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.
What changes are included in this PR?
Allow writing bf:
ParquetPageIndexRoundTripTest
Are these changes tested?
Yes
Are there any user-facing changes?
User can create Bloom Filter in parquet with C++ api