Apache Pinot (incubating) 0.4.0
Summary
This release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.
The release was cut from this commit:
008be2d
with cherry-picking the following patches:
Notable New Features
- Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
- Supported range queries using indexes (#5240)
- Supported complex aggregation functions
- Added a simple PinotFS benchmark driver (#5160)
- Supported default star-tree (#5147)
- Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)
- One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)
- Added access control for Pinot server segment download api (#5260)
- Added Pinot S3 Filesystem Plugin (#5249)
- Text search improvement
- Pruned stop words for text index (#5297)
- Used 8byte offsets in chunk based raw index creator (#5285)
- Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)
- Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)
- Made text index query cache a configurable option (#5176)
- Added Lucene DocId to PinotDocId cache to improve performance (#5177)
- Removed the construction of second bitmap in text index reader to improve performance (#5199)
- Tooling/usability improvement
- Added template support for Pinot Ingestion Job Spec (#5341)
- Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)
- Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)
- Update JVM settings for scripts (#5127)
- Added Stream github events demo (#5189)
- Moved docs link from gitbook to docs.pinot.apache.org (#5193)
- Re-implemented ORCRecordReader (#5267)
- Evaluated schema transform expressions during ingestion (#5238)
- Handled count distinct query in selection list (#5223)
- Enabled async processing in pinot broker query api (#5229)
- Supported bootstrap mode for table rebalance (#5224)
- Supported order-by on BYTES column (#5213)
- Added Nightly publish to binary (#5190)
- Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)
- Supported inbuilt transform functions (#5312)
- Added date time transform functions (#5326)
- Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)
- APIs Additions/Changes
- Added a new server api for download of segments
- /GET /segments/{tableNameWithType}/{segmentName}
- Added a new server api for download of segments
- Upgraded helix to 0.9.7 (#5411)
- Added support to execute functions during query compilation (#5406)
- Other notable refactoring
- Moved table config into pinot-spi (#5194)
- Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)
- Added jsonExtractScalar function to extract field from json object (#4597)
- Added template support for Pinot Ingestion Job Spec #5372
- Cleaned up AggregationFunctionContext (#5364)
- Optimized real-time range predicate when cardinality is high (#5331)
- Made PinotOutputFormat use table config and schema to create segments (#5350)
- Tracked unavailable segments in InstanceSelector (#5337)
- Added a new best effort segment uploader with bounded upload time (#5314)
- In SegmentPurger, used table config to generate the segment (#5325)
- Decoupled schema from RecordReader and StreamMessageDecoder (#5309)
- Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)
- Improved GroupBy query performance (#5291)
- Optimized ExpressionFilterOperator (#5132)
Major Bug Fixes
- Do not release the PinotDataBuffer when closing the index (#5400)
- Handled a no-arg function in query parsing and expression tree (#5375)
- Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
- Fixed missing error message from pinot-admin command (#5305)
- Fixed HDFS copy logic (#5218)
- Fixed spark ingestion issue (#5216)
- Fixed the capacity of the DistinctTable (#5204)
- Fixed various links in the Pinot website
Work in Progress
- Upsert: support overriding data in the real-time table (#4261).
- Add pinot upsert features to pinot common (#5175)
- Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc
Backward Incompatible Changes
- TableConfig no longer support de-serialization from json string of nested json string (i.e. no
\"
inside the json) (#5194) - The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):
void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
- A different segment writing logic was introduced in #5256. Although this is backward compatible in a sense that the old segments can be read by the new code, rollback would be tricky since new segments after the upgrade would have been written in the new format, and the old code cannot read those new segments.