Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
GH-45201: [C++][Parquet] Improve performance of generating size stati…
…stics (#45202) ### Rationale for this change We found out in #45085 that there is a non-trivial overhead when writing size statistics is enabled. ### What changes are included in this PR? Dramatically reduce overhead by speeding up def/rep levels histogram updates. Performance results on the author's machine: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8103053 ns 8098569 ns 86 bytes_per_second=1003.26Mi/s items_per_second=129.477M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 8153499 ns 8148492 ns 86 bytes_per_second=997.117Mi/s items_per_second=128.683M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 8212560 ns 8207754 ns 83 bytes_per_second=989.918Mi/s items_per_second=127.754M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10405020 ns 10400775 ns 67 bytes_per_second=444.142Mi/s items_per_second=100.817M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 10464784 ns 10460778 ns 66 bytes_per_second=441.594Mi/s items_per_second=100.239M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 10469832 ns 10465739 ns 67 bytes_per_second=441.385Mi/s items_per_second=100.191M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13004962 ns 12992678 ns 52 bytes_per_second=657.101Mi/s items_per_second=80.7052M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 13718352 ns 13705599 ns 50 bytes_per_second=622.921Mi/s items_per_second=76.5071M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 13845553 ns 13832138 ns 52 bytes_per_second=617.222Mi/s items_per_second=75.8072M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15715263 ns 15702707 ns 44 bytes_per_second=320.449Mi/s items_per_second=66.7768M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 16507328 ns 16493800 ns 43 bytes_per_second=305.079Mi/s items_per_second=63.5739M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 16575359 ns 16561311 ns 42 bytes_per_second=303.836Mi/s items_per_second=63.3148M/s output_size=927.377k page_index_size=55 ``` Performance results without this PR: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8042576 ns 8037678 ns 87 bytes_per_second=1010.86Mi/s items_per_second=130.458M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 9576627 ns 9571279 ns 73 bytes_per_second=848.894Mi/s items_per_second=109.554M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 9570204 ns 9563595 ns 73 bytes_per_second=849.576Mi/s items_per_second=109.642M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10165397 ns 10160868 ns 69 bytes_per_second=454.628Mi/s items_per_second=103.197M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 11662568 ns 11657396 ns 60 bytes_per_second=396.265Mi/s items_per_second=89.9494M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 11657135 ns 11653063 ns 60 bytes_per_second=396.412Mi/s items_per_second=89.9829M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13182006 ns 13168704 ns 51 bytes_per_second=648.318Mi/s items_per_second=79.6264M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 16438205 ns 16421762 ns 43 bytes_per_second=519.89Mi/s items_per_second=63.8528M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 16424615 ns 16409032 ns 42 bytes_per_second=520.293Mi/s items_per_second=63.9024M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15387808 ns 15373086 ns 46 bytes_per_second=327.32Mi/s items_per_second=68.2086M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 18319628 ns 18302938 ns 37 bytes_per_second=274.924Mi/s items_per_second=57.29M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 18346665 ns 18329336 ns 37 bytes_per_second=274.528Mi/s items_per_second=57.2075M/s output_size=927.377k page_index_size=55 ``` ### Are these changes tested? Tested by existing tests, validated by existing benchmarks. ### Are there any user-facing changes? No. * GitHub Issue: #45201 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
- Loading branch information