Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPORTER] Support handling retry-able errors for OTLP/gRPC #3219

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

chusitoo
Copy link
Contributor

@chusitoo chusitoo commented Dec 22, 2024

Fixes #2049

Changes

This change introduces a retry policy in OTLP/gRPC exporter for select failures via the gRPC service config mechanism

  • Add support to set retry values via environment variables.
  • Enabled by default, using the same configuration values as in OTel java and dotnet.
  • Users can opt-out of the retry capabilities by zeroing out any (or all) of the retry settings.
  • Set service config JSON when creating gRPC channel if all parameters are non-zero.

The changes to support retries for OTLP/HTTP exporter are addressed in #3223

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

Copy link

netlify bot commented Dec 22, 2024

Deploy Preview for opentelemetry-cpp-api-docs ready!

Name Link
🔨 Latest commit 9349d6c
🔍 Latest deploy log https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/678096da61c6ee0008b82ef0
😎 Deploy Preview https://deploy-preview-3219--opentelemetry-cpp-api-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Dec 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.21%. Comparing base (d19eb32) to head (9349d6c).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3219      +/-   ##
==========================================
+ Coverage   88.16%   88.21%   +0.06%     
==========================================
  Files         198      198              
  Lines        6224     6259      +35     
==========================================
+ Hits         5487     5521      +34     
- Misses        737      738       +1     
Files with missing lines Coverage Δ
sdk/src/common/env_variables.cc 99.03% <100.00%> (+0.50%) ⬆️

... and 1 file with indirect coverage changes

@chusitoo chusitoo changed the title Support handling Retryable error for OTLP/gRPC exporter Support handling retry-able errors for OTLP/gRPC exporter Dec 22, 2024
{
TestTraceService(std::vector<grpc::StatusCode> status_codes) : status_codes_(status_codes) {}

inline grpc::Status Export(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem possible, at least not in a truthful manner, to test this by mocking Export() on the client side.
Instead, the real exporter is used for this test and the client retry behavior is directly observed on the server side.

@@ -317,7 +317,7 @@ if(BUILD_TESTING)
add_executable(otlp_grpc_exporter_test test/otlp_grpc_exporter_test.cc)
target_link_libraries(
otlp_grpc_exporter_test ${GTEST_BOTH_LIBRARIES} ${CMAKE_THREAD_LIBS_INIT}
${GMOCK_LIB} opentelemetry_exporter_otlp_grpc)
${GMOCK_LIB} opentelemetry_exporter_otlp_grpc gRPC::grpc++)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for retryable integration tests using real grpc server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only link grpc directly when grpc is built as a dynamic library.Or there will be conflict in some environments.

Copy link
Contributor Author

@chusitoo chusitoo Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was basing it on the fact that opentelemetry_exporter_otlp_grpc_client links against the lib higher in this cmake file.

Should this be made optional or guarded somehow? What would be the right approach?

Update: I tried linking against a static grpc++ lib and it still appears to work. Is there an easy way to reproduce this problem you describe?

@@ -357,6 +363,205 @@ TEST_F(OtlpGrpcExporterTestPeer, ConfigUnknownInsecureFromEnv)
}
# endif

# ifndef NO_GETENV
TEST_F(OtlpGrpcExporterTestPeer, ConfigRetryDefaultValues)
Copy link
Contributor Author

@chusitoo chusitoo Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults matching what is currently implemented in other flavors of OTel (verified dotnet, java, and js). I don't believe rust has a retry policy in place, though.

@chusitoo chusitoo changed the title Support handling retry-able errors for OTLP/gRPC exporter [EXPORTER] Support handling retry-able errors for OTLP/gRPC Dec 30, 2024
@chusitoo chusitoo marked this pull request as ready for review January 1, 2025 23:40
@chusitoo chusitoo requested a review from a team as a code owner January 1, 2025 23:40
std::uint32_t retry_policy_max_attempts{};

/** The initial backoff delay between retry attempts, random between (0, initial_backoff). */
float retry_policy_initial_backoff{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use std::chrono::duration<> here?

Copy link
Contributor Author

@chusitoo chusitoo Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that as a chrono duration initially, but it was not really of any use for otlp/grpc since it just gets passed down to the service config, so it was moved to otlp/http, where it is being required to perform some computations for the backoff.

FYI, implementation in previous commit was like this: cb14857

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the exporting of both otlp/http and otlp/grpc will cost much more CPU than type conversion here. I think it's more important to make it clear what this parameters means(We don't know the meaning and the unit of this variable by just the name and comments here), and also float number has EPS and is more imprecise.
What do you think about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, precision is probably very subjective since examples of normal use cases are limited to a single decimal place (and this is how it is formatted here before passing the config settings to grpc library), which seems logical given that measuring backoff in tens of milliseconds or lower is probably a very niche requirement.

I think there is some truth in that chrono duration makes the type more descriptive. Part of the reasoning I went back to float was because I could not find a common place where I could alias this to a more descriptive name without having to repeat it in at least one more header file (for instance, otlp_environment.h and http_client.h).

For now, I will revert/update this in #3223 until it is approved/merged to avoid duplicating all these work in progress changes for common code bits...

@@ -80,6 +80,34 @@ static bool GetStringDualEnvVar(const char *signal_name,
return exists;
}

static std::uint32_t GetUintEnvVarOrDefault(opentelemetry::nostd::string_view signal_env,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes in this file are also present in #3223

Planning to review and merge the OTLP HTTP PR first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes from the above PR were backported here, as well.

The only minor annoyance with the current use of the WITH_OTLP_RETRY_PREVIEW flag on CI is that none of the maintainer builds have WITH_OTLP_GRPC enabled, so the new code would not be unit tested.

Would it be reasonable also enabling this new feature flag, for instance with "cmake.exporter.otprotocol.test", in order to have at least one build that exercises this code?

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks.

More comments to follow.

# include "opentelemetry/exporters/otlp/protobuf_include_prefix.h"
# include "opentelemetry/nostd/shared_ptr.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move nostd/shared_ptr.h outside of the protobuf include prefix + suffix block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never bothered to notice that these includes carried special meaning. Good to know!

@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch 2 times, most recently from 64b875c to 0a8bc82 Compare January 7, 2025 15:08
@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch 2 times, most recently from 6a92aa0 to 7f3d420 Compare January 7, 2025 15:31
@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch from 7f3d420 to 8677f89 Compare January 7, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support handling Retryable error for OTLP exporter (OTLP/gRPC and OTLP/HTTP)
3 participants