Skip to content

Apache Pinot 0.9.0

Compare
Choose a tag to compare
@xiangfu0 xiangfu0 released this 12 Nov 09:06

Summary

This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.

The release was cut from the following commit: 13c9ee9 and the following cherry-picks: 668b5e0, ee887b9

Support Segment Merge and Roll-up

LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.

To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.

At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.

Major Changes:

  • Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor (#7180)
  • Merge/Rollup task scheduler for offline tables. (#7178)
  • Fix MergeRollupTask uploading segments not updating their metadata (#7289)
  • MergeRollupTask integration tests (#7283)
  • Add mergeRollupTask delay metrics (#7368)
  • MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling (#7481)
  • Use maxEndTimeMs for merge/roll-up delay metrics. (#7617)

UI Improvement

This release also sees improvements to Pinot’s query console UI.

  • Cmd+Enter shortcut to run query in query console (#7359)
  • Showing tooltip in SQL Editor (#7387)
  • Make the SQL Editor box expandable (#7381)
  • Fix tables ordering by number of segments (#7564)

SQL Improvements

There have also been improvements and additions to Pinot’s SQL implementation.

New functions:

  • IN (#7542)
  • LASTWITHTIME (#7584)
  • ID_SET on MV columns (#7355)
  • Raw results for Percentile TDigest and Est (#7226),
  • Add timezone as argument in function toDateTime (#7552)

New predicates are supported:

Query compatibility improvements:

  • Infer data type for Literal (#7332)
  • Support logical identifier in predicate (#7347)
  • Support JSON queries with top-level array path expression. (#7511)
  • Support configurable group by trim size to improve results accuracy (#7241)

Performance Improvements

This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:

  • Reduce the disk usage for segment conversion task (#7193)
  • Simplify association between Java Class and PinotDataType for faster mapping (#7402)
  • Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation (#7412)
  • Replace MINUS with STRCMP (#7394)
  • Bit-sliced range index for int, long, float, double, dictionarized SV columns (#7454)
  • Use MethodHandle to access vectorized unsigned comparison on JDK9+ (#7487)
  • Add option to limit thread usage per query (#7492)
  • Improved range queries (#7513)
  • Faster bitmap scans (#7530)
  • Optimize EmptySegmentPruner to skip pruning when there is no empty segments (#7531)
  • Map bitmaps through a bounded window to avoid excessive disk pressure (#7535)
  • Allow RLE compression of bitmaps for smaller file sizes (#7582)
  • Support raw index properties for columns with JSON and RANGE indexes (#7615)
  • Enhance BloomFilter rule to include IN predicate(#7444) (#7624)
  • Introduce LZ4_WITH_LENGTH chunk compression type (#7655)
  • Enhance ColumnValueSegmentPruner and support bloom filter prefetch (#7654)
  • Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func (#7630)
  • During segment pruning, release the bloom filter after each segment is processed (#7668)
  • Fix JSONPath cache inefficient issue (#7409)
  • Optimize getUnpaddedString with SWAR padding search (#7708)
  • Lighter weight LiteralTransformFunction, avoid excessive array fills (#7707)
  • Inline binary comparison ops to prevent function call overhead (#7709)
  • Memoize literals in query context in order to deduplicate them (#7720)

Other Notable New Features and Changes

  • Human Readable Controller Configs (#7173)
  • Add the support of geoToH3 function (#7182)
  • Add Apache Pulsar as Pinot Plugin (#7223) (#7247)
  • Add dropwizard metrics plugin (#7263)
  • Introduce OR Predicate Execution On Star Tree Index (#7184)
  • Allow to extract values from array of objects with jsonPathArray (#7208)
  • Add Realtime table metadata and indexes API. (#7169)
  • Support array with mixing data types (#7234)
  • Support force download segment in reload API (#7249)
  • Show uncompressed znRecord from zk api (#7304)
  • Add debug endpoint to get minion task status. (#7300)
  • Validate CSV Header For Configured Delimiter (#7237)
  • Add auth tokens and user/password support to ingestion job command (#7233)
  • Add option to store the hash of the upsert primary key (#7246)
  • Add null support for time column (#7269)
  • Add mode aggregation function (#7318)
  • Support disable swagger in Pinot servers (#7341)
  • Delete metadata properly on table deletion (#7329)
  • Add basic Obfuscator Support (#7407)
  • Add AWS sts dependency to enable auth using web identity token. (#7017)(#7445)
  • Mask credentials in debug endpoint /appconfigs (#7452)
  • Fix /sql query endpoint now compatible with auth (#7230)
  • Fix case sensitive issue in BasicAuthPrincipal permission check (#7354)
  • Fix auth token injection in SegmentGenerationAndPushTaskExecutor (#7464)
  • Add segmentNameGeneratorType config to IndexingConfig (#7346)
  • Support trigger PeriodicTask manually (#7174)
  • Add endpoint to check minion task status for a single task. (#7353)
  • Showing partial status of segment and counting CONSUMING state as good segment status (#7327)
  • Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule (#7282)
  • Check schema backward-compatibility when updating schema through addSchema with override (#7374)
  • Optimize IndexedTable (#7373)
  • Support indices remove in V3 segment format (#7301)
  • Optimize TableResizer (#7392)
  • Introduce resultSize in IndexedTable (#7420)
  • Offset based realtime consumption status checker (#7267)
  • Add causes to stack trace return (#7460)
  • Create controller resource packages config key (#7488)
  • Enhance TableCache to support schema name different from table name (#7525)
  • Add validation for realtimeToOffline task (#7523)
  • Unify CombineOperator multi-threading logic (#7450)
  • Support no downtime rebalance for table with 1 replica in TableRebalancer (#7532)
  • Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. (#7516)
  • Adjust tuner api (#7553)
  • Adding config for metrics library (#7551)
  • Add geo type conversion scalar functions (#7573)
  • Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types (#7581)
  • Add MV raw forward index and MV BYTES data type (#7595)
  • Enhance TableRebalancer to offload the segments from most loaded instances first (#7574)
  • Improve get tenant API to differentiate offline and realtime tenants (#7548)
  • Refactor query rewriter to interfaces and implementations to allow customization (#7576)
  • In ServiceStartable, apply global cluster config in ZK to instance config (#7593)
  • Make dimension tables creation bypass tenant validation (#7559)
  • Allow Metadata and Dictionary Based Plans for No Op Filters (#7563)
  • Reject query with identifiers not in schema (#7590)
  • Round Robin IP addresses when retry uploading/downloading segments (#7585)
  • Support multi-value derived column in offline table reload (#7632)
  • Support segmentNamePostfix in segment name (#7646)
  • Add select segments API (#7651)
  • Controller getTableInstance() call now returns the list of live brokers of a table. (#7556)
  • Allow MV Field Support For Raw Columns in Text Indices (#7638)
  • Allow override distinctCount to segmentPartitionedDistinctCount (#7664)
  • Add a quick start with both UPSERT and JSON index (#7669)
  • Add revertSegmentReplacement API (#7662)
  • Smooth segment reloading with non blocking semantic (#7675)
  • Clear the reused record in PartitionUpsertMetadataManager (#7676)
  • Replace args4j with picocli (#7665)
  • Handle datetime column consistently (#7645)(#7705)
  • Allow to carry headers with query requests (#7696) (#7712)
  • Allow adding JSON data type for dimension column types (#7718)
  • Separate SegmentDirectoryLoader and tierBackend concepts (#7737)
  • Implement size balanced V4 raw chunk format (#7661)
  • Add presto-pinot-driver lib (#7384)

Major Bug fixes

  • Fix null pointer exception for non-existed metric columns in schema for JDBC driver (#7175)
  • Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD (#7198)
  • Fixed pinot java client to add zkClient close (#7196)
  • Ignore query json parse errors (#7165)
  • Fix shutdown hook for PinotServiceManager (#7251) (#7253)
  • Make STRING to BOOLEAN data type change as backward compatible schema change (#7259)
  • Replace gcp hardcoded values with generic annotations (#6985)
  • Fix segment conversion executor for in-place conversion (#7265)
  • Fix reporting consuming rate when the Kafka partition level consumer isn't stopped (#7322)
  • Fix the issue with concurrent modification for segment lineage (#7343)
  • Fix TableNotFound error message in PinotHelixResourceManager (#7340)
  • Fix upload LLC segment endpoint truncated download URL (#7361)
  • Fix task scheduling on table update (#7362)
  • Fix metric method for ONLINE_MINION_INSTANCES metric (#7363)
  • Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema (#7366)
  • Fix currentOffset volatility in consuming segment(#7365)
  • Fix misleading error msg for missing URI (#7367)
  • Fix the correctness of getColumnIndices method (#7370)
  • Fix SegmentZKMetadta time handling (#7375)
  • Fix retention for cleaning up segment lineage (#7424)
  • Fix segment generator to not return illegal filenames (#7085)
  • Fix missing LLC segments in segment store by adding controller periodic task to upload them (#6778)
  • Fix parsing error messages returned to FileUploadDownloadClient (#7428)
  • Fix manifest scan which drives /version endpoint (#7456)
  • Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection (#7470)
  • Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: (#7427)
  • Fix pql double quote checker exception (#7485)
  • Fix minion metrics exporter config (#7496)
  • Fix segment unable to retry issue by catching timeout exception during segment replace (#7509)
  • Add Exception to Broker Response When Not All Segments Are Available (Partial Response) (#7397)
  • Fix segment generation commands (#7527)
  • Return non zero from main with exception (#7482)
  • Fix parquet plugin shading error (#7570)
  • Fix the lowest partition id is not 0 for LLC (#7066)
  • Fix star-tree index map when column name contains '.' (#7623)
  • Fix cluster manager URLs encoding issue(#7639)
  • Fix fieldConfig nullable validation (#7648)
  • Fix verifyHostname issue in FileUploadDownloadClient (#7703)
  • Fix TableCache schema to include the built-in virtual columns (#7706)
  • Fix DISTINCT with AS function (#7678)
  • Fix SDF pattern in DataPreprocessingHelper (#7721)
  • Fix fields missing issue in the source in ParquetNativeRecordReader (#7742)