-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on configurable exception #6991
Retry on configurable exception #6991
Conversation
Resolves #6962 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6991 +/- ##
============================================
+ Coverage 89.88% 89.96% +0.07%
- Complexity 6655 6663 +8
============================================
Files 748 748
Lines 20077 20085 +8
Branches 1969 1970 +1
============================================
+ Hits 18047 18070 +23
+ Misses 1435 1423 -12
+ Partials 595 592 -3 ☔ View full report in Codecov by Sentry. |
RetryInterceptor::isRetryableException, | ||
e -> | ||
retryPolicy.getRetryExceptionPredicate().test(e) | ||
|| RetryInterceptor.isRetryableException(e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OR here is interesting. It means a user can choose to expand the definition of what is retryable but not reduce it. I wonder if there are any cases when you would not want to retry when the default would retry. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a user can choose to expand the definition of what is retryable but not reduce it
it's exactly the idea
I wonder if there are any cases when you would not want to retry when the default would retry
I would say no 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say no 🤔
I think we might..
Suppose we want expose options to give full control to the user over what's retryable (as I alluded to in the end of this comment), we'd probably want to do something like:
- Expose a single configurable predicate option of the form
setRetryPredicate(Predicate<Throwable>)
- Funnel all failed requests through this predicate, whether they resolved a response with a status code or ended with an exception
- This means we'd need to translate requests with a non-200 status code to an equivalent exception to pass to the predicate
- If the user doesn't define their own predicate, default to one that retriable when status is retryable (429, 502, 503, 504) or when the exception is retryable (like one of the SocketTimeoutException we've discussed).
In this case, its possible that a user doesn't want to retry on a particular response status code like 502, even when the default behavior is to retry on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not fully understand the idea
It means a user can choose to expand the definition of what is retryable but not reduce it. I wonder if there are any cases when you would not want to retry when the default would retry.
the current implementation allows only to extend retry conditions, but we can change it to override or extend so user will have this option and it's probably what you meant by this comment below
Expose a single configurable predicate option of the form setRetryPredicate(Predicate)
so we could achieve it
Funnel all failed requests through this predicate, whether they resolved a response with a status code or ended with an exception
we probably should not mix status codes and exception in the same predicate but use two predicates separately. what do you think ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably should not mix status codes and exception in the same predicate but use two predicates separately. what do you think ? 🤔
I've seen HTTP clients generate exceptions when the status code is 4xx or 5xx, and this is essentially going in that direction. All non-200 results flow flow through a Predicate<Throwable>
, and the caller can setup different rules for different Throwable
s. Supposing we translate non-200 responses to an exception class like our HttpExportException, example usage might look like:
// Override default retry policy to retry on all all 5xx errors, and on any IOException
builder.setRetryPredicate(throwable -> {
if (thowable instanceOf HttpExportException) {
int httpStatusCode = ((HttpExportException) throwable).getResponse().statusCode();
return httpStatusCode >= 500;
}
return throwable instanceOf IOException;
});
I suppose its not super important if exceptions and error responses are handled in a single predicate vs. two separate predicates, but I think my point here is still valid, which suggests we should give the user the ability to not retry on some IOException we normally retry by default:
If the user doesn't define their own predicate, default to one that retriable when status is retryable (429, 502, 503, 504) or when the exception is retryable (like one of the SocketTimeoutException we've discussed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused.
Do you want to say that currently we do not retry on default retryable status codes?
so in case of 429, 502, 503, 504 we throw exception in line 91 and line 96 is never executed
Lines 90 to 110 in 5640314
try { | |
response = chain.proceed(chain.request()); | |
} catch (IOException e) { | |
exception = e; | |
} | |
if (response != null) { | |
boolean retryable = Boolean.TRUE.equals(isRetryable.apply(response)); | |
if (logger.isLoggable(Level.FINER)) { | |
logger.log( | |
Level.FINER, | |
"Attempt " | |
+ attempt | |
+ " returned " | |
+ (retryable ? "retryable" : "non-retryable") | |
+ " response: " | |
+ responseStringRepresentation(response)); | |
} | |
if (!retryable) { | |
return response; | |
} | |
} |
Thanks for the PR!
Wondering if you could elaborate on these, since its possible that the errors aren't actually environment-specific and everyone could benefit from them. My initial inclination was that we should just update the static definition of what constitutes a retryable exception, but I'm open to being wrong. |
hey @jack-berg
3 exceptions from me:
Also recently we had network issues(retryable) when using SSL, but it was luckily solved by java upgrade so I can neglect it but it might be useful for some users with some java versions I don't mind to put all of that into "static" definition but the reason I want to have it configurable is the ability to apply a quick fix when a new retryable exception is discovered. Also I don't know all the exceptions that other people encounter in their networks so the list of exceptions is not complete So I can combine it all in static definition in addition to the current retryable exceptions, but I want to preserve the dynamic config as well Tell me your thoughts about it |
there is flaky test in of the checks, not related to my change because it's in metrics product 🙈 |
Hmm.. let's think about these. They are clearly the result of some sort of timeout occurring. We take the arguments for
The callTimeout represents the total allotted time for everything to resolve with the call, including resolving DNS, connecting, writing request body, reading response body, and any additional retry requests. So if this is exceeded, it won't do any good to retry because there's no allotted time left for the additional attempts to resolve. The connectTimeout is a little different. It represents the max amount of time connecting a TCP socket to the target host. If this is less than callTimeout, then there is still time remaining in the allotted callTimeout for a retry to exceed. And so I think its correct to retry if this occurs, and if we look at RetryInterceptor, we see there's the attempt to retry on this type of situation. So I think its appropriate to extend the condition to include the exception you're seeing:
There's two additional OkHttpClient settings that we don't configure: This still doesn't address the
This is the last one that's unaddressed. This exception is thrown when DNS lookup fails for the given host. I know that the java runtime caches DNS results, but wasn't sure what it does with DNS lookup failures. I did some searching and found that negative DNS cache TTL is controlled by a property called networkaddress.cache.negative.ttl which defaults to 10 seconds. This indicates that we should in fact retry when UnknownHostException occurs because there's a chance that the error is transient and succeeds with the next attempt.
This is a good point. One downside I can think of to adding the proposed |
thanks for sharing your ideas here we have two stack traces
in case of the first stack trace I believe we will retry on there might be more edge cases and we have to be ready for them, so I will keep watching
Could you tell why it's downside? we could start with exceptions and later in future add response codes for http and gRPC
thanks for sharing I will have a look at the docs. Just wondering if the case with exceptions should be in spec, as for me spec is not language specific, but I am not sure I fully understand the open-telemetry project. Tell me please if there was something similar in the past where we had a difference in spec using different languages or we probably have spec per language 🤔 |
I think the fasted way to get get results for you is to split up this work. Have one PR that expands the existing retry predicate to ensure that we retry on the types of exception's you've seen in the real world. Have a second PR where we propose extending Given our strong API backwards compatibility guarantees, we're generally fairly slow / methodical / careful with our API design. But I don't want this process to slow down getting concrete results for you, and I believe that it makes sense to expand the set of retryable exceptions to reflect the ones you've seen. |
c37e78d
to
a367900
Compare
I have made some changes in the PR where added predicate to RetryPolicy so we can extend or override a list of retryable exceptions. |
sdk/common/src/main/java/io/opentelemetry/sdk/common/export/RetryPolicy.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the feedback and driving this forward!
thanks for guiding me forward :) |
public abstract RetryPolicyBuilder setRetryExceptionPredicate( | ||
Predicate<IOException> retryExceptionPredicate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider a customizer instead, to allow choice of either replacing or enhancing the default predicate?
I realize it's not the friendliest API, but we've found the pattern really useful in AutoConfigurationCustomizer
public abstract RetryPolicyBuilder setRetryExceptionPredicate( | |
Predicate<IOException> retryExceptionPredicate); | |
public abstract RetryPolicyBuilder setRetryExceptionPredicateCustomizer( | |
Fuction<Predicate<IOException>, Predicate<IOException>> retryExceptionCustomizer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and later use something like this
static class MyFunction implements Function<Predicate<IOException>, Predicate<IOException>> {
private Predicate<IOException> newPredicate = e -> {
return false; // logic for my retryable condition
};
@Override
public Predicate<IOException> apply(Predicate<IOException> defaultPredicate) {
/**
* use this statement to extend
*/
return newPredicate.or(defaultPredicate); // extend
/**
* or this to override
*/
return newPredicate;
}
}
it depends how many users will want to extend it. I started the PR quickly with a thought that I need to extend rertry policy, but after some time I realized if a user wants to change the default policy it's ok just to override.
but pls tell me if you think it's nice to have the enhance option for it and I will cover it in this PR and next PR.
Please note that I also have a plan to change the defaults for retryable exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of a predicate customizer is that you can do things like:
return defaultPredicate.and(..)
return defaultPredicate.or(..)
If you're just wholesale overwriting the default predicate, you don't get anything from having a customizer:
return myCustomPredicate
This benefit is somewhat diminished if we make the default retry predicates accessible as part of our stable API (related to #6970), in which case, the and
and or
cases can be achieved without a customizer by referencing the default predicate API.
I realize it's not the friendliest API
This is the key bit to me. I'd say that API ergonomics feel really foreign to someone who hasn't written a AutoConfigurationCustomizer
. That pattern doesn't show up anywhere else in our API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that API ergonomics feel really foreign to someone who hasn't written a
AutoConfigurationCustomizer
. That pattern doesn't show up anywhere else in our API.
good point
Number of retryable exceptions is very limited in the current logic so we have data loss in case of any other(not mentioned in the current java code) IO exception happen.
As we might have different networks we might experience different exceptions. In my environment I caught a few exceptions that very likely need to be retried and they are not listed as retryable in current code.
since each environment is different I suggest to have an ability to configure retryable exceptions
the change is fully backward compatible and does not change default behaviour of the library.