Reservoir sampler #884

thorfour · 2024-05-29T18:14:09Z

Adds a new operator to the query engine to perform random sampling of the rows.

Initially this was implemented in the physical conversion layer but it required us to push down the filter into the physical layer which created a significant amount of overhead as filtering on parquet values is slow. That approach has been abandoned for now but can be revisited in the future.

brancz

Generally the implementation looks fine (except for the scalability concerns).

I'm unconvinced this will work for the purpose we want it for, both the use of randomness, as well as the unknown order of records passed through the sampler means that we cannot guarantee that the result of running the same query twice will return the same data.

query/logicalplan/builder.go

query/physicalplan/sampler.go

asubiotto · 2024-05-30T11:11:47Z

I think we had agreed to not enforce determinism across query runs since it seems difficult to do and we're unsure it's a hard requirement currently: https://docs.google.com/document/d/1t26jWIT7kvmgyzfQld_KtXzrSw6BJ76cxrU2qIPryD8/edit#heading=h.lb631ysiorah

brancz · 2024-05-30T13:27:26Z

So while there are other ways to achieve this, the requirements in Parca/Polar Siganls Cloud that must continue to hold true are:

sharing a link must result in users seeing exactly the same result
data across visualizations must be consistent (eg. viewing data in an icicle graph must show the same values as tables)
the binary names returned from the new metadata call must be consistent with what's shown in the icicle graph

The only way I can see this continuing to be true would be if we cache results from a query indefinitely, and only render reports based on the cached result.

thorfour · 2024-05-31T14:40:02Z

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     390µs ± 0%  -0.40%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       707 ± 0%  +2.72%  (p=0.000 n=10+10)

Aggregation benchmarks when sampling with a reservoir that is > total rows. So there should be little impact to queries that don't need to be sampled.

thorfour · 2024-06-04T18:08:38Z

Latest benchmarks with the optimization commits; the aggregate sum test where a sampler is added that exceeds the number of samples (i.e logical no-op)

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     395µs ± 1%  +0.91%  (p=0.000 n=9+8)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       704 ± 0%  +2.40%  (p=0.000 n=10+10)

Same benchmark but now sampling half of the samples. Obviously there's overhead but it's a lot less than it was before the optimizations

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     898µs ± 0%   +129.34%  (p=0.000 n=9+7)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     736kB ± 0%    +83.08%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     13731 ± 0%  +1895.84%  (p=0.000 n=10+10)

Before optimizations is below:

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%    3839µs ± 1%   +880.17%  (p=0.000 n=9+10)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%    2875kB ± 0%   +615.38%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     35264 ± 0%  +5025.61%  (p=0.000 n=10+10)

This should speed up sampling when the reserviour is larger than the dataset

Only 3 rand calls are needed per sampled row

This allows us to perform O(1) replacement of samples instead of a O(n*m) search to find the record to replace. It also avoids unecessary allocations with NewSlice until the very end where we know what we need to return. name old time/op new time/op delta _Sampler/10%_10_000_x10-10 7.13ms ± 1% 0.75ms ± 0% -89.51% (p=0.000 n=9+10) name old alloc/op new alloc/op delta _Sampler/10%_10_000_x10-10 4.84MB ± 0% 0.31MB ± 0% -93.65% (p=0.000 n=10+8) name old allocs/op new allocs/op delta _Sampler/10%_10_000_x10-10 22.0k ± 0% 7.0k ± 0% -67.90% (p=0.000 n=9+8)

It's too expensive for the normal case where the reserviour is larger than the sample size

thorfour requested a review from asubiotto May 29, 2024 18:23

brancz reviewed May 30, 2024

View reviewed changes

query/logicalplan/builder.go Show resolved Hide resolved

query/physicalplan/sampler.go Outdated Show resolved Hide resolved

query/physicalplan/sampler.go Outdated Show resolved Hide resolved

thorfour force-pushed the reserviour-sampler branch from 34da44f to bd300cd Compare May 31, 2024 14:22

thorfour force-pushed the reserviour-sampler branch from bf295bf to 5a5cfe5 Compare May 31, 2024 16:12

thorfour added 19 commits June 4, 2024 13:10

Reservoir sampler

360e62c

lint

0e43450

lint

bdeefe4

Wait to slice rows until sampling happens

6a2fe4b

This should speed up sampling when the reserviour is larger than the dataset

Fix validation of queries with Sample

95bc12f

retain reservoir record

6a8bb64

optimize replacing singular records

362881d

sampler randomness check

7409fa1

lint

e9023f0

benchmark for sampling

58ed75c

sampling benchmark fixes

4440730

benchmark changes

7997053

benchmark

dc15dce

fix the sampling algorithm to be optimal.

c53f111

Only 3 rand calls are needed per sampled row

WIP: Don't preallocate until initial fill

f156fa9

It's too expensive for the normal case where the reserviour is larger than the sample size

slice reserviour

2c2ab3e

fix record retain

6796898

remove benchmark as it was moved into sampler path

1b04c98

thorfour force-pushed the reserviour-sampler branch from e841762 to 1b04c98 Compare June 4, 2024 18:10

thorfour merged commit f4615c3 into main Jun 4, 2024
8 checks passed

thorfour deleted the reserviour-sampler branch June 4, 2024 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reservoir sampler #884

Reservoir sampler #884

thorfour commented May 29, 2024

brancz left a comment

asubiotto commented May 30, 2024

brancz commented May 30, 2024

thorfour commented May 31, 2024

thorfour commented Jun 4, 2024

Reservoir sampler #884

Reservoir sampler #884

Conversation

thorfour commented May 29, 2024

brancz left a comment

Choose a reason for hiding this comment

asubiotto commented May 30, 2024

brancz commented May 30, 2024

thorfour commented May 31, 2024

thorfour commented Jun 4, 2024