-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve remote metrics store flushing #1724
Labels
Comments
Here's an example of a benchmark where the flushing gets further and further behind: logs
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Similar to #1723, benchmarks with many concurrent/parallel tasks that generate a lot of samples can end up producing too many samples for our single-threaded metrics store flush method to handle, causing excessive memory usage on the load driver and delaying metrics delivery to the remote store.
The
Driver
already wakes up everyPOST_PROCESS_INTERVAL_SECONDS
(30) to flush the collected samples to the metrics store:rally/esrally/driver/driver.py
Lines 295 to 305 in 2470328
rally/esrally/driver/driver.py
Lines 950 to 955 in 2470328
rally/esrally/driver/driver.py
Lines 1062 to 1068 in 2470328
Once
self._client.bulk_index()
completes we clear the in-memory buffer:rally/esrally/metrics.py
Lines 924 to 942 in 2470328
We use the
elasticsearch.helper.bulk()
method to iterate over all the documents in the in-memory buffer, sending them in chunks of 5000 docs at a time:rally/esrally/metrics.py
Lines 89 to 93 in 2470328
This approach works fine for most benchmarks, but for challenges like
logging-indexing-querying
of theelastic/logs
track we generate a lot of documents, meaning that this in-memory buffer is often full of tens to hundreds of thousands of documents that are indexed by a single client controlled by theDriver
actor. This problem is exacerbated in environments where the load driver is a long way away from the metrics store (i.e. cross regional), or if the metrics store itself is overloaded, because our single client throughput is bound by the latency of each request.There’s a
parallel_bulk
helper that usesmultiprocessing.pool import ThreadPool
, but the Thespian docs specifically call out that:We could consider a few things here, but all will require extensive testing:
POST_PROCESS_INTERVAL_SECONDS
from 30 to allow more frequent flushes, intending to prevent the buffer from growing too largechunk_size
used by the helper method to send more docs at once (i.e. 10,000), but I don't think this will make a large difference because we're still using a single clientlen(self.coordinator.raw_samples)
(or similar) in addition to the timer check:rally/esrally/driver/driver.py
Line 301 in 2470328
AsyncElasticsearch
client instead of the sync client for concurrently flushing metrics, allowing us to use coroutines (and not threads or processes)The text was updated successfully, but these errors were encountered: