description |
---|
Learn how to write fast queries for looking up ids in a list of values. |
A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.
ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )
This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:
- INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
- LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
- Other types - Bloom Filter
The following parameters are used to configure the Bloom Filter:
- expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
- fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0
Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.
IN_ID_SET(columnName, base64EncodedIdSet)
This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.
IN_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.
IN_PARTITIONED_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.
This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.
You can create an IdSet of the values in the yearID column by running the following:
SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'
idset(yearID) |
---|
ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc= |
When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:
SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName) |
---|
AwIBBQAAAAL///////////////////// |
SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName) |
---|
AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8= |
We can also configure the fpp parameter:
SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName) |
---|
AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8= |
We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for _yearID_s in the IdSet, run the following:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_ID_SET(
yearID,
'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
) = 1
GROUP BY yearID
To return rows for _yearID_s not in the IdSet, run the following:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_ID_SET(
yearID,
'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
) = 0
GROUP BY yearID
To filter rows for _yearID_s in the IdSet on a Pinot Broker, run the following query:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 1
GROUP BY yearID
To filter rows for _yearID_s not in the IdSet on a Pinot Broker, run the following query:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 0
GROUP BY yearID
To filter rows for _yearID_s in the IdSet on a Pinot Server, run the following query:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_PARTITIONED_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 1
GROUP BY yearID
To filter rows for _yearID_s not in the IdSet on a Pinot Server, run the following query:
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_PARTITIONED_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 0
GROUP BY yearID