-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distribution metric to semantic-core and ffwd-reporter #79
Conversation
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListenerV2.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java
Outdated
Show resolved
Hide resolved
Hey @ao2017 - could you provide a Use Case-like description please? With links to the actual US/Issue and supporting links e.g. T-digest, codehale. Also, I'm quite confused about |
/** | ||
* A no-op implementation of {@link MetricRegistryListener}. | ||
*/ | ||
abstract class Base implements MetricRegistryListener { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this NoOpMetricRegistryListener, or something else meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That 's the original name Let me see if changing the name won't brake anything.
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java
Outdated
Show resolved
Hide resolved
* @param name the meter's name | ||
* @param derivingMeter the meter | ||
*/ | ||
void onDerivingMeterAdded(String name, DerivingMeter derivingMeter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a distribution a Deriving Meter? I can't quite tell. Here's the definition of a deriving meter:
* A meter that takes the derivative of a value that is expected to be almost monotonically
* increasing. A typical use case is to get the rate of change of a counter of the total number of
* occurrences of something.
* <p>
* Implementations will ignore updates that are a decrease of the counter value. The rationale is
* that the counter is expected to be monotonically increasing between infrequent resets (such as
* when a process has been restarted). Thus, negative values should only happen on restart, and it
* should be safe to discard those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it is not. Distribution captures the recorded data distribution. It is used to compute percentile in a distributed environment. Can one use distribution to compute rate ? I doubt it won't be accurate since distribution is just capturing an approximation not the actual count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do distributions relate to histograms? https://github.com/infusionsoft/yammer-metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Histogram.java
Do we need to have both? Is the newly added Distribution a superset of histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So distribution is a name for the type of Histogram that computes rank related stat using actual data distribution.
Dropwizard histogram doesn't support that feature. For more info, take a look at this issue spotify/heroic#476
Do we need to have both ?
At the moment we think yes. Some users may have a preference for local histogram. And I think it is the standard in the industry to support both at least DataDog and OpenTelemetry support both types.
@@ -107,6 +108,29 @@ public void onTimerRemoved(String name) { | |||
tryRemove(name); | |||
} | |||
|
|||
|
|||
@Override | |||
public void onDerivingMeterAdded(String name, DerivingMeter derivingMeter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about onHistogramRemoved()
and onHistogramAdded()
above. Are they a different "kind" of histogram or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand this question :).
core/src/main/java/com/spotify/metrics/core/DistributionImpl.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** | ||
* Returns the current count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not terribly useful :) Either remove or explain what's being counted, cheers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, something is better than guessing :) . I will update.
private final AtomicReference<TDigest> distRef; | ||
|
||
protected DistributionImpl() { | ||
this.distRef = new AtomicReference<>(TDigest.createDigest(COMPRESSION_DEFAULT_LEVEL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the rationale behind using an AtomicReference
? With AtomicRefernce, the only guarantee is that this reference will be updated/set in a thread-safe manner and AFAICT we don't update it...?
Also, AFAICT, multiple threads could still overwrite this.distRef
here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok, As long as each thread has its own copy we are good.
ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java
Show resolved
Hide resolved
ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void record(double val) { | ||
distRef.get().add(val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TDigest.add() does not appear to be threadsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am looking into this. I checked couple of Dropwizard metrics there are threadsafe so I will make this thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could either
- add synchronized around it and hope it's enough.
- Use ThreadLocal and then merge all the thread local entries somehow inside getValueAndFlush
- Push to a queue and have another thread popping them off and adding to the TDigest object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the implementation I added yesterday. I thought about adding threadLocal then change mind. ThreadLocal may introduce memory leak if getValueAndFlush isn't called often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work now I think (but it's hard to know if there will be much contention, some multithreaded JMH benchmarks might be useful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure will do. Please take a look I need a +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@malish8632 asked me to take a look since I've worked with these classes before)
The main question I have is - what is a Distribution
? I think the initial PR summary is missing the motivation behind the addition or what it means to achieve (except adding new classes).
Is it necessary to have a separate interface and implementation for Distribution
and DistributionImpl
, when the other core types (Timer, Meter, Histogram) don't follow this pattern? Those classes implement interfaces, but those are more generic interfaces that have multiple implementations (i.e. Timer
and Meter
are implementations of Metered
) which allows for abstraction in some places (and code reuse). I don't see the similar need for Distribution
unless you plan to add more implementations of it later on, in which case DistributionImpl
could use a more descriptive name.
A few suggestions/questions below - I don't think you need to go to the trouble of declaring V2
interfaces or new packages, it'd be simpler to update the version of the library to indicate changes to the API of SemanticMetricRegistry
and friends.
core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListenerV2.java
Outdated
Show resolved
Hide resolved
* under the License. | ||
*/ | ||
|
||
package com.spotify.metrics.core.codahale.metrics.ext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add the new Distribution
type to this package, instead of com.spotify.metrics.core
? There are interfaces in core
that refer to this type, so I think it would be natural to have it live in com.spotify.metrics.core
as well
If you want to add it in a subpackage, I'd still recommend against this specific name, its not like any classes live in or com.spotify.metrics.core.codahale
com.spotify.metrics.core.codahale.metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattnworb I don't see how SemanticMetricRegistry.addListener(SemanticMetricRegistryListener). is broken with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with Matt - could we use com.spotify.metrics.core
instead?
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java
Outdated
Show resolved
Hide resolved
ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java
Show resolved
Hide resolved
@@ -396,4 +338,101 @@ public void stop() { | |||
public void close() { | |||
stop(); | |||
} | |||
|
|||
public static final class Builder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd probably be better to make separate PRs for changes like reordering the structure of the class or whitespace changes, as it makes the git history harder to follow if its combined with this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this comment - pls separate this into diff PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
1. Remove SemanticMetricRegistryListenerV2 2. Added Default methods to SemanticMetricRegistryListener 3. We understand that there is a chance for SemanticMetricRegistry to call a no op methods. This can lead to confusion hopefully users of this lib will look at the source code or documentation.
core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListener.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better. Some changes should help simplify this PR and clarify type packages.
…tricRegistry.getDistribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for resolving all of my earlier comments, and I find the PR summary very helpful now - thanks for both of those.
I have a few small questions below
core/src/main/java/com/spotify/metrics/core/SemanticMetricDistribution.java
Show resolved
Hide resolved
* is reset and a new recording starts. | ||
* @return | ||
*/ | ||
ByteBuffer getValueAndFlush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still curious about the idea of exposing the "value" of the Distribution as a ByteBuffer
. It seems like it would couple this Distribution
metric to the way in which the t-digest library serializes the underlying com.tdunning.math.stats.TDigest
into bytes.
For instance, if there was an alternate implementation of Distribution
in the future, it would have to return the values in the same byte format, since FastForwardReporter
has the expectation that the results of getValueAndFlush()
must look a certain way.
Is there any other data type that could be used? Or at least could the format of the bytes be documented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use Distribution as the return type then the serializer will have to know something about Distribution implementation. For instance, FFW reporter will have an implementation of this method for every Distribution type.
https://github.com/spotify/semantic-metrics/pull/79/files/4ae53fe81b12e35376529ab79184798e982e7ade#diff-455c8027b591558e13ad0a4d245b2518R265. That will not align with the current paradigm of not exposing the detail of the metric implementation to the serializer.
Documentation for bytes format is handled by FFW. There is a version number, we will use that number upstream to determine how to handle the bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Return something immutable like ByteString
- Instead of ByteString, return something more semantically meaningful like a DistributionSnapshot wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I am not sure about DistributionSnapshot, that will expose Distribution implementation to ffwReporter as noted above.
- Let's merge this branch. Bytebuffer is transformed into ByteString before serialization in ffw-client. I will change the return type later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
private void reportDistribution(final com.spotify.ffwd.v1.Metric metric, | ||
final Distribution distribution) { | ||
ByteBuffer byteBuffer = distribution.getValueAndFlush(); | ||
Value value = Value.distributionValue(byteBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this PR: spotify/ffwd-client-java#10
Can we update the ffwd API to use ByteString instead of ByteBuffer?
I think it should be safe to change that if no one is using the distribution types yet (since they are not implemented yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the entire PR you will see that ByteBuffer is transformed into ByteString before serialization. Moving byteString up in the stack is something we can do in the next PR.
@@ -0,0 +1,88 @@ | |||
/* | |||
* Copyright (c) 2016 Spotify AB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments @ao2017 .
* Just get an instance through SemanticMetricBuilder and record data. | ||
* | ||
* <p> {@link Distribution} is a good choice if you care about percentile accuracy in | ||
* a distributed environment and you want to rely on P99 to set SLO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent comment!
* From P99.9 to P99.999 the error rate is slightly higher than 2%. | ||
* | ||
*/ | ||
public final class SemanticMetricDistribution implements Distribution { | ||
|
||
private static final int COMPRESSION_DEFAULT_LEVEL = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might we want to change this level? Should it be configuration-driven perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the time being it is better to keep every aspect of Tdigest private. Power users have the ability to extend this class and use a different compression level. The typical use case 100 is ok.
@@ -93,6 +93,16 @@ public void onDerivingMeterAdded(MetricId name, DerivingMeter derivingMeter) { | |||
@Override | |||
public void onDerivingMeterRemoved(MetricId name) { | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not a fan of this class name. Is it convention?
private TagExtractor tagExtractor; | ||
private ScheduledExecutorService executorService; | ||
|
||
private Set<Percentile> histogramPercentiles = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these particular numbers? Can we comment or const them please.
public Builder histogramQuantiles(double... quantiles) { | ||
histogramPercentiles = new HashSet<>(); | ||
for (double q : quantiles) { | ||
histogramPercentiles.add(new Percentile(q)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're not validating that the double is in range?
PR raison d'être
We are adding distribution support to resolve 2 problems.
1. Accurate Aggregated Histogram Data.
Currently percentile data provided by heroic are computed locally.
It is practically impossible to aggregate that data.
With this PR we are introducing a mechanism to record data and send data sketches
to heroic. A sketch of a dataset is a small data structure that let you approximate certain characteristics of the original
dataset. We use sketches to compute rank based statistics such as percentile. Sketches are mergeable and can be used
to compute any percentile on the entire data distribution.
2. Support sophisticated data point value.
We currently have many libraries that emit Open-census metrics. We cannot export Open-census metric to heroic because heroic supports only double data point value. With the introduction of Distribution we will be able to support sophisticated data point value such as Open-census metric distribution.
What's new ?
This PR introduces Distribution. Unlike most metric supported by SemanticMetric this metric is not a DropWizard metric.
However, Distribution implements DropWizard Metric and Counting interface. This is done so we can manage Distribution metric through Semantic Registry which is an extension of DropWizard Registry.
This PR also introduces SemanticMetricDistribution. SemanticMetricDistribution is an implementation of Distribution backed by Tdigest. Check the links below for the rational of using Tdigest.
This PR also refactors FastForwardReporter to add distribution support.
All efforts were made to ensure compatibility. This changes should be an Opt-in for Heroic Users. They shouldn't have to make any change to existing code unless they want to use distribution. At least that is the goal.
Related Issues and PR's
spotify/heroic#476
spotify/ffwd-client-java#13
#78
spotify/heroic#673
spotify/heroic#677