Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Failing arrow-flight-test when enabling opentelemetry #45120

Open
h-vetinari opened this issue Dec 29, 2024 · 8 comments
Open

[C++] Failing arrow-flight-test when enabling opentelemetry #45120

h-vetinari opened this issue Dec 29, 2024 · 8 comments

Comments

@h-vetinari
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

In conda-forge/arrow-cpp-feedstock#1058, we finally enabled the test suite for libarrow in conda-forge. This has been running fine, even though @pitrou had noted in #35587:

Our async Flight support is unfortunately finicky, and noone is working on it anymore. So, yeah, there are sporadic failures on those tests...

However, while working on the opentelemetry integration (conda-forge/arrow-cpp-feedstock#1048; no windows due to #45033), I keep running into the following failure in arrow-flight-test; this time it doesn't appear to be a fluke, but has reproduced 3x in a row.

      Start 76: arrow-flight-test
76/98 Test #76: arrow-flight-test ............................***Failed    0.91 sec
Running arrow-flight-test, redirecting output into $SRC_DIR/cpp/build/build/test-logs/arrow-flight-test.txt (attempt 1/1)
$SRC_DIR/cpp/build-support/run-test.sh: line 88: 28863 Aborted                 (core dumped) $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
Running main() from /home/conda/feedstock_root/build_artifacts/gtest-split_1722457657850/work/googletest/src/gtest_main.cc
[==========] Running 97 tests from 20 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from GrpcConnectivityTest
[ RUN      ] GrpcConnectivityTest.GetPort
[       OK ] GrpcConnectivityTest.GetPort (14 ms)
[ RUN      ] GrpcConnectivityTest.BuilderHook
[       OK ] GrpcConnectivityTest.BuilderHook (2 ms)
[ RUN      ] GrpcConnectivityTest.Shutdown
[       OK ] GrpcConnectivityTest.Shutdown (21 ms)
[ RUN      ] GrpcConnectivityTest.ShutdownWithDeadline
[       OK ] GrpcConnectivityTest.ShutdownWithDeadline (1 ms)
[ RUN      ] GrpcConnectivityTest.BrokenConnection
[       OK ] GrpcConnectivityTest.BrokenConnection (4 ms)
[----------] 5 tests from GrpcConnectivityTest (44 ms total)

[----------] 18 tests from GrpcDataTest
[ RUN      ] GrpcDataTest.TestDoGetInts
[       OK ] GrpcDataTest.TestDoGetInts (7 ms)
[ RUN      ] GrpcDataTest.TestDoGetFloats
[       OK ] GrpcDataTest.TestDoGetFloats (4 ms)
[ RUN      ] GrpcDataTest.TestDoGetDicts
[       OK ] GrpcDataTest.TestDoGetDicts (5 ms)
[ RUN      ] GrpcDataTest.TestDoGetLargeBatch
[       OK ] GrpcDataTest.TestDoGetLargeBatch (149 ms)
[ RUN      ] GrpcDataTest.TestFlightDataStreamError
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20241229 06:18:49.797973 140144816412224 status.cc:155] Failed to close FlightDataStream: IOError: Expected error
[       OK ] GrpcDataTest.TestFlightDataStreamError (5 ms)
[ RUN      ] GrpcDataTest.TestOverflowServerBatch
[       OK ] GrpcDataTest.TestOverflowServerBatch (78 ms)
[ RUN      ] GrpcDataTest.TestOverflowClientBatch
[       OK ] GrpcDataTest.TestOverflowClientBatch (42 ms)
[ RUN      ] GrpcDataTest.TestDoExchange
[       OK ] GrpcDataTest.TestDoExchange (4 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeNoData
[       OK ] GrpcDataTest.TestDoExchangeNoData (4 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeWriteOnlySchema
[       OK ] GrpcDataTest.TestDoExchangeWriteOnlySchema (3 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeGet
[       OK ] GrpcDataTest.TestDoExchangeGet (4 ms)
[ RUN      ] GrpcDataTest.TestDoExchangePut
[       OK ] GrpcDataTest.TestDoExchangePut (3 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeEcho
[       OK ] GrpcDataTest.TestDoExchangeEcho (5 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeTotal
[       OK ] GrpcDataTest.TestDoExchangeTotal (3 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeError
[       OK ] GrpcDataTest.TestDoExchangeError (4 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeConcurrency
[       OK ] GrpcDataTest.TestDoExchangeConcurrency (4 ms)
[ RUN      ] GrpcDataTest.TestDoExchangeUndrained
[       OK ] GrpcDataTest.TestDoExchangeUndrained (4 ms)
[ RUN      ] GrpcDataTest.TestIssue5095
[       OK ] GrpcDataTest.TestIssue5095 (4 ms)
[----------] 18 tests from GrpcDataTest (341 ms total)

[----------] 7 tests from GrpcDoPutTest
[ RUN      ] GrpcDoPutTest.TestInts
[       OK ] GrpcDoPutTest.TestInts (3 ms)
[ RUN      ] GrpcDoPutTest.TestFloats
[       OK ] GrpcDoPutTest.TestFloats (3 ms)
[ RUN      ] GrpcDoPutTest.TestEmptyBatch
[       OK ] GrpcDoPutTest.TestEmptyBatch (3 ms)
[ RUN      ] GrpcDoPutTest.TestDicts
[       OK ] GrpcDoPutTest.TestDicts (3 ms)
[ RUN      ] GrpcDoPutTest.TestLargeBatch
[       OK ] GrpcDoPutTest.TestLargeBatch (89 ms)
[ RUN      ] GrpcDoPutTest.TestSizeLimit
[       OK ] GrpcDoPutTest.TestSizeLimit (5 ms)
[ RUN      ] GrpcDoPutTest.TestUndrained
[       OK ] GrpcDoPutTest.TestUndrained (4 ms)
[----------] 7 tests from GrpcDoPutTest (114 ms total)

[----------] 5 tests from GrpcAppMetadataTest
[ RUN      ] GrpcAppMetadataTest.TestDoGet
[       OK ] GrpcAppMetadataTest.TestDoGet (3 ms)
[ RUN      ] GrpcAppMetadataTest.TestDoGetDictionaries
[       OK ] GrpcAppMetadataTest.TestDoGetDictionaries (4 ms)
[ RUN      ] GrpcAppMetadataTest.TestDoPut
[       OK ] GrpcAppMetadataTest.TestDoPut (4 ms)
[ RUN      ] GrpcAppMetadataTest.TestDoPutDictionaries
[       OK ] GrpcAppMetadataTest.TestDoPutDictionaries (4 ms)
[ RUN      ] GrpcAppMetadataTest.TestDoPutReadMetadata
[       OK ] GrpcAppMetadataTest.TestDoPutReadMetadata (4 ms)
[----------] 5 tests from GrpcAppMetadataTest (22 ms total)

[----------] 5 tests from GrpcIpcOptionsTest
[ RUN      ] GrpcIpcOptionsTest.TestDoGetReadOptions
[       OK ] GrpcIpcOptionsTest.TestDoGetReadOptions (3 ms)
[ RUN      ] GrpcIpcOptionsTest.TestDoPutWriteOptions
[       OK ] GrpcIpcOptionsTest.TestDoPutWriteOptions (3 ms)
[ RUN      ] GrpcIpcOptionsTest.TestDoExchangeClientWriteOptions
[       OK ] GrpcIpcOptionsTest.TestDoExchangeClientWriteOptions (3 ms)
[ RUN      ] GrpcIpcOptionsTest.TestDoExchangeClientWriteOptionsBegin
[       OK ] GrpcIpcOptionsTest.TestDoExchangeClientWriteOptionsBegin (3 ms)
[ RUN      ] GrpcIpcOptionsTest.TestDoExchangeServerWriteOptions
[       OK ] GrpcIpcOptionsTest.TestDoExchangeServerWriteOptions (4 ms)
[----------] 5 tests from GrpcIpcOptionsTest (19 ms total)

[----------] 3 tests from GrpcCudaDataTest
[ RUN      ] GrpcCudaDataTest.TestDoGet
$SRC_DIR/cpp/src/arrow/flight/test_definitions.cc:1418: Skipped
Arrow was built without ARROW_CUDA

[  SKIPPED ] GrpcCudaDataTest.TestDoGet (0 ms)
[ RUN      ] GrpcCudaDataTest.TestDoPut
$SRC_DIR/cpp/src/arrow/flight/test_definitions.cc:1419: Skipped
Arrow was built without ARROW_CUDA

[  SKIPPED ] GrpcCudaDataTest.TestDoPut (0 ms)
[ RUN      ] GrpcCudaDataTest.TestDoExchange
$SRC_DIR/cpp/src/arrow/flight/test_definitions.cc:1421: Skipped
Arrow was built without ARROW_CUDA

[  SKIPPED ] GrpcCudaDataTest.TestDoExchange (0 ms)
[----------] 3 tests from GrpcCudaDataTest (0 ms total)

[----------] 5 tests from GrpcErrorHandlingTest
[ RUN      ] GrpcErrorHandlingTest.TestAsyncGetFlightInfo
[       OK ] GrpcErrorHandlingTest.TestAsyncGetFlightInfo (42 ms)
[ RUN      ] GrpcErrorHandlingTest.TestGetFlightInfo
[       OK ] GrpcErrorHandlingTest.TestGetFlightInfo (37 ms)
[ RUN      ] GrpcErrorHandlingTest.TestGetFlightInfoMetadata
[       OK ] GrpcErrorHandlingTest.TestGetFlightInfoMetadata (3 ms)
[ RUN      ] GrpcErrorHandlingTest.TestDoPut
[       OK ] GrpcErrorHandlingTest.TestDoPut (6 ms)
[ RUN      ] GrpcErrorHandlingTest.TestDoExchange
[       OK ] GrpcErrorHandlingTest.TestDoExchange (4 ms)
[----------] 5 tests from GrpcErrorHandlingTest (95 ms total)

[----------] 3 tests from GrpcAsyncClientTest
[ RUN      ] GrpcAsyncClientTest.TestGetFlightInfo
[       OK ] GrpcAsyncClientTest.TestGetFlightInfo (4 ms)
[ RUN      ] GrpcAsyncClientTest.TestGetFlightInfoFuture
[       OK ] GrpcAsyncClientTest.TestGetFlightInfoFuture (4 ms)
[ RUN      ] GrpcAsyncClientTest.TestListenerLifetime
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1735453130.232915   29468 thd.cc:184] pthread_join failed: Resource deadlock avoided
~/feedstock_root/build_artifacts/apache-arrow_1735449323515/work/cpp/build/src/arrow/flight

      Start 77: arrow-flight-sql-test

This is the only test that fails:

99% tests passed, 1 tests failed out of 98

Label Time Summary:
[...]

Total Test time (real) = 165.44 sec

The following tests FAILED:
	 76 - arrow-flight-test (Failed)                        arrow_flight unittest

Component(s)

C++, Continuous Integration, Packaging

@h-vetinari
Copy link
Contributor Author

Any inputs here @kou @pitrou @raulcd @lidavidm? I'd like to enable otel support for arrow 19 (at least on unix; windows needs more work: #45033)

@h-vetinari
Copy link
Contributor Author

For arrow 19, this now also appears without otel integration (note that arrow 15-18 have just been rebuilt with the same set of baseline dependencies, so it's not due to regressions in ambient dependencies)

      Start 76: arrow-flight-internals-test
76/98 Test #76: arrow-flight-internals-test ..................   Passed    0.20 sec
      Start 77: arrow-flight-test
77/98 Test #77: arrow-flight-test ............................***Failed    0.36 sec
Running arrow-flight-test, redirecting output into $SRC_DIR/cpp/build/build/test-logs/arrow-flight-test.txt (attempt 1/1)
$SRC_DIR/cpp/build-support/run-test.sh: line 88: 29353 Aborted                 (core dumped) $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
Running main() from /home/conda/feedstock_root/build_artifacts/gtest-split_1722457657850/work/googletest/src/gtest_main.cc
[==========] Running 96 tests from 20 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from GrpcConnectivityTest
[ RUN      ] GrpcConnectivityTest.GetPort
[       OK ] GrpcConnectivityTest.GetPort (18 ms)
[ RUN      ] GrpcConnectivityTest.BuilderHook
[       OK ] GrpcConnectivityTest.BuilderHook (2 ms)
[ RUN      ] GrpcConnectivityTest.Shutdown
[       OK ] GrpcConnectivityTest.Shutdown (22 ms)
[ RUN      ] GrpcConnectivityTest.ShutdownWithDeadline
[       OK ] GrpcConnectivityTest.ShutdownWithDeadline (1 ms)
[ RUN      ] GrpcConnectivityTest.BrokenConnection
[       OK ] GrpcConnectivityTest.BrokenConnection (6 ms)
[----------] 5 tests from GrpcConnectivityTest (51 ms total)

[----------] 18 tests from GrpcDataTest
[ RUN      ] GrpcDataTest.TestDoGetInts
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736612608.282647   29506 thd.cc:184] pthread_join failed: Resource deadlock avoided
~/feedstock_root/build_artifacts/apache-arrow_1736607738095/work/cpp/build/src/arrow/flight

      Start 78: arrow-flight-sql-test
78/98 Test #78: arrow-flight-sql-test ........................   Passed    0.39 sec

Additionally, we also see

      Start 70: arrow-dataset-file-parquet-test
70/98 Test #70: arrow-dataset-file-parquet-test ..............   Passed    6.90 sec
      Start 71: arrow-dataset-file-parquet-encryption-test
71/98 Test #71: arrow-dataset-file-parquet-encryption-test ...***Failed    0.30 sec
Running arrow-dataset-file-parquet-encryption-test, redirecting output into $SRC_DIR/cpp/build/build/test-logs/arrow-dataset-file-parquet-encryption-test.txt (attempt 1/1)
$SRC_DIR/cpp/build-support/run-test.sh: line 88: 28556 Segmentation fault      (core dumped) $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
Running main() from /home/conda/feedstock_root/build_artifacts/gtest-split_1722457657850/work/googletest/src/gtest_main.cc
[==========] Running 3 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from DatasetEncryptionTest
[ RUN      ] DatasetEncryptionTest.WriteReadDatasetWithEncryption
~/feedstock_root/build_artifacts/apache-arrow_1736607738095/work/cpp/build/src/arrow/dataset

      Start 72: arrow-filesystem-test
72/98 Test #72: arrow-filesystem-test ........................   Passed    1.02 sec

@pitrou
Copy link
Member

pitrou commented Jan 11, 2025

Is it possible for you to get a backtrace of the arrow-dataset-file-parquet-encryption-test coredump?

@h-vetinari
Copy link
Contributor Author

Not easily (as this is all happening in CI), but we could certainly try!

@pitrou
Copy link
Member

pitrou commented Jan 11, 2025

We have a helper for that in the Arrow C++ test running harness, you can take a look at the function comments here:

function print_coredumps() {
# The script expects core files relative to the build directory with unique
# names per test executable because of the parallel running. So the corefile
# patterns must be set with prefix `core.{test-executable}*`:
#
# In case of macOS:
# sudo sysctl -w kern.corefile=/tmp/core.%N.%P
# On Linux:
# sudo sysctl -w kernel.core_pattern=/tmp/core.%e.%p
#
# and the ulimit must be increased:
# ulimit -c unlimited
#
# If the tests are run in a Docker container, the instructions are slightly
# different: see the 'Coredumps' comment section in `docker-compose.yml`.
# filename is truncated to the first 15 characters in case of linux, so limit
# the pattern for the first 15 characters

@lidavidm
Copy link
Member

#ifdef GRPC_ENABLE_ASYNC
    // TODO(https://github.com/apache/arrow/issues/30949): if there are async
    // RPCs running when the client is stopped, then when they go to use the
    // garbage bin, they'll instead synchronously dispose of resources from
    // the callback thread, and will likely crash. We could instead cancel
    // them first and wait for completion before stopping the thread, but
    // tracking all of the RPCs may be unacceptable overhead for clients that
    // are making many small concurrent RPC calls, so it remains to be seen
    // whether there's a pressing need for this.
    garbage_bin_->Stop();
#endif

I'd rather just disable the test (upstream), since it's deliberately misusing the library to see how well we protect the user from gRPC. But I've come around to thinking this is misguided (wrapping gRPC or trying to make it so that the user isn't aware of gRPC means the user can't debug/control things when they want to be able to and it just adds to the maintenance burden).

@lidavidm
Copy link
Member

#45232

@h-vetinari
Copy link
Contributor Author

Just to note that #45232 solved our problems for arrow 19, as well as with the otel integration. Thanks @lidavidm! 🙏

@amoeba amoeba added this to the 19.0.1 milestone Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants