-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reservoir sampler #884
Reservoir sampler #884
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the implementation looks fine (except for the scalability concerns).
I'm unconvinced this will work for the purpose we want it for, both the use of randomness, as well as the unknown order of records passed through the sampler means that we cannot guarantee that the result of running the same query twice will return the same data.
I think we had agreed to not enforce determinism across query runs since it seems difficult to do and we're unsure it's a hard requirement currently: https://docs.google.com/document/d/1t26jWIT7kvmgyzfQld_KtXzrSw6BJ76cxrU2qIPryD8/edit#heading=h.lb631ysiorah |
So while there are other ways to achieve this, the requirements in Parca/Polar Siganls Cloud that must continue to hold true are:
The only way I can see this continuing to be true would be if we cache results from a query indefinitely, and only render reports based on the cached result. |
34da44f
to
bd300cd
Compare
thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name old time/op new time/op delta
Aggregation/sum-10 392µs ± 0% 390µs ± 0% -0.40% (p=0.000 n=9+9)
name old alloc/op new alloc/op delta
Aggregation/sum-10 402kB ± 0% 403kB ± 0% +0.27% (p=0.000 n=8+9)
name old allocs/op new allocs/op delta
Aggregation/sum-10 688 ± 0% 707 ± 0% +2.72% (p=0.000 n=10+10) Aggregation benchmarks when sampling with a reservoir that is > total rows. So there should be little impact to queries that don't need to be sampled. |
bf295bf
to
5a5cfe5
Compare
Latest benchmarks with the optimization commits; the aggregate sum test where a sampler is added that exceeds the number of samples (i.e logical no-op) thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name old time/op new time/op delta
Aggregation/sum-10 392µs ± 0% 395µs ± 1% +0.91% (p=0.000 n=9+8)
name old alloc/op new alloc/op delta
Aggregation/sum-10 402kB ± 0% 403kB ± 0% +0.27% (p=0.000 n=8+9)
name old allocs/op new allocs/op delta
Aggregation/sum-10 688 ± 0% 704 ± 0% +2.40% (p=0.000 n=10+10) Same benchmark but now sampling half of the samples. Obviously there's overhead but it's a lot less than it was before the optimizations thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name old time/op new time/op delta
Aggregation/sum-10 392µs ± 0% 898µs ± 0% +129.34% (p=0.000 n=9+7)
name old alloc/op new alloc/op delta
Aggregation/sum-10 402kB ± 0% 736kB ± 0% +83.08% (p=0.000 n=8+10)
name old allocs/op new allocs/op delta
Aggregation/sum-10 688 ± 0% 13731 ± 0% +1895.84% (p=0.000 n=10+10) Before optimizations is below: thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name old time/op new time/op delta
Aggregation/sum-10 392µs ± 0% 3839µs ± 1% +880.17% (p=0.000 n=9+10)
name old alloc/op new alloc/op delta
Aggregation/sum-10 402kB ± 0% 2875kB ± 0% +615.38% (p=0.000 n=8+10)
name old allocs/op new allocs/op delta
Aggregation/sum-10 688 ± 0% 35264 ± 0% +5025.61% (p=0.000 n=10+10) |
This should speed up sampling when the reserviour is larger than the dataset
Only 3 rand calls are needed per sampled row
This allows us to perform O(1) replacement of samples instead of a O(n*m) search to find the record to replace. It also avoids unecessary allocations with NewSlice until the very end where we know what we need to return. name old time/op new time/op delta _Sampler/10%_10_000_x10-10 7.13ms ± 1% 0.75ms ± 0% -89.51% (p=0.000 n=9+10) name old alloc/op new alloc/op delta _Sampler/10%_10_000_x10-10 4.84MB ± 0% 0.31MB ± 0% -93.65% (p=0.000 n=10+8) name old allocs/op new allocs/op delta _Sampler/10%_10_000_x10-10 22.0k ± 0% 7.0k ± 0% -67.90% (p=0.000 n=9+8)
It's too expensive for the normal case where the reserviour is larger than the sample size
e841762
to
1b04c98
Compare
Adds a new operator to the query engine to perform random sampling of the rows.
Initially this was implemented in the physical conversion layer but it required us to push down the filter into the physical layer which created a significant amount of overhead as filtering on parquet values is slow. That approach has been abandoned for now but can be revisited in the future.