Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reservoir sampler #884

Merged
merged 19 commits into from
Jun 4, 2024
Merged

Reservoir sampler #884

merged 19 commits into from
Jun 4, 2024

Conversation

thorfour
Copy link
Contributor

Adds a new operator to the query engine to perform random sampling of the rows.

Initially this was implemented in the physical conversion layer but it required us to push down the filter into the physical layer which created a significant amount of overhead as filtering on parquet values is slow. That approach has been abandoned for now but can be revisited in the future.

@thorfour thorfour requested a review from asubiotto May 29, 2024 18:23
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the implementation looks fine (except for the scalability concerns).

I'm unconvinced this will work for the purpose we want it for, both the use of randomness, as well as the unknown order of records passed through the sampler means that we cannot guarantee that the result of running the same query twice will return the same data.

query/logicalplan/builder.go Show resolved Hide resolved
query/physicalplan/sampler.go Outdated Show resolved Hide resolved
query/physicalplan/sampler.go Outdated Show resolved Hide resolved
@asubiotto
Copy link
Member

I think we had agreed to not enforce determinism across query runs since it seems difficult to do and we're unsure it's a hard requirement currently: https://docs.google.com/document/d/1t26jWIT7kvmgyzfQld_KtXzrSw6BJ76cxrU2qIPryD8/edit#heading=h.lb631ysiorah

@brancz
Copy link
Member

brancz commented May 30, 2024

So while there are other ways to achieve this, the requirements in Parca/Polar Siganls Cloud that must continue to hold true are:

  1. sharing a link must result in users seeing exactly the same result
  2. data across visualizations must be consistent (eg. viewing data in an icicle graph must show the same values as tables)
  3. the binary names returned from the new metadata call must be consistent with what's shown in the icicle graph

The only way I can see this continuing to be true would be if we cache results from a query indefinitely, and only render reports based on the cached result.

@thorfour thorfour force-pushed the reserviour-sampler branch from 34da44f to bd300cd Compare May 31, 2024 14:22
@thorfour
Copy link
Contributor Author

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     390µs ± 0%  -0.40%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       707 ± 0%  +2.72%  (p=0.000 n=10+10)

Aggregation benchmarks when sampling with a reservoir that is > total rows. So there should be little impact to queries that don't need to be sampled.

@thorfour thorfour force-pushed the reserviour-sampler branch from bf295bf to 5a5cfe5 Compare May 31, 2024 16:12
@thorfour
Copy link
Contributor Author

thorfour commented Jun 4, 2024

Latest benchmarks with the optimization commits; the aggregate sum test where a sampler is added that exceeds the number of samples (i.e logical no-op)

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     395µs ± 1%  +0.91%  (p=0.000 n=9+8)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       704 ± 0%  +2.40%  (p=0.000 n=10+10)

Same benchmark but now sampling half of the samples. Obviously there's overhead but it's a lot less than it was before the optimizations

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     898µs ± 0%   +129.34%  (p=0.000 n=9+7)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     736kB ± 0%    +83.08%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     13731 ± 0%  +1895.84%  (p=0.000 n=10+10)

Before optimizations is below:

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%    3839µs ± 1%   +880.17%  (p=0.000 n=9+10)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%    2875kB ± 0%   +615.38%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     35264 ± 0%  +5025.61%  (p=0.000 n=10+10)

thorfour added 19 commits June 4, 2024 13:10
This should speed up sampling when the reserviour is larger than the
dataset
Only 3 rand calls are needed per sampled row
This allows us to perform O(1) replacement of samples instead of
a O(n*m) search to find the record to replace. It also avoids unecessary
allocations with NewSlice until the very end where we know what we need
to return.

name                        old time/op    new time/op    delta
_Sampler/10%_10_000_x10-10    7.13ms ± 1%    0.75ms ± 0%  -89.51%  (p=0.000 n=9+10)

name                        old alloc/op   new alloc/op   delta
_Sampler/10%_10_000_x10-10    4.84MB ± 0%    0.31MB ± 0%  -93.65%  (p=0.000 n=10+8)

name                        old allocs/op  new allocs/op  delta
_Sampler/10%_10_000_x10-10     22.0k ± 0%      7.0k ± 0%  -67.90%  (p=0.000 n=9+8)
It's too expensive for the normal case where the reserviour
is larger than the sample size
@thorfour thorfour force-pushed the reserviour-sampler branch from e841762 to 1b04c98 Compare June 4, 2024 18:10
@thorfour thorfour merged commit f4615c3 into main Jun 4, 2024
8 checks passed
@thorfour thorfour deleted the reserviour-sampler branch June 4, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants