Improve remote metrics store flushing #1724

b-deam · 2023-06-02T04:06:01Z

Similar to #1723, benchmarks with many concurrent/parallel tasks that generate a lot of samples can end up producing too many samples for our single-threaded metrics store flush method to handle, causing excessive memory usage on the load driver and delaying metrics delivery to the remote store.

The Driver already wakes up every POST_PROCESS_INTERVAL_SECONDS (30) to flush the collected samples to the metrics store:

rally/esrally/driver/driver.py

Lines 295 to 305 in 2470328

    
           @actor.no_retry("driver")  # pylint: disable=no-value-for-parameter 
        
           def receiveMsg_WakeupMessage(self, msg, sender): 
        
               if msg.payload == DriverActor.RESET_RELATIVE_TIME_MARKER: 
        
                   self.coordinator.reset_relative_time() 
        
               elif not self.coordinator.finished(): 
        
                   self.post_process_timer += DriverActor.WAKEUP_INTERVAL_SECONDS 
        
                   if self.post_process_timer >= DriverActor.POST_PROCESS_INTERVAL_SECONDS: 
        
                       self.post_process_timer = 0 
        
                       self.coordinator.post_process_samples() 
        
                   self.coordinator.update_progress_message() 
        
                   self.wakeupAfter(datetime.timedelta(seconds=DriverActor.WAKEUP_INTERVAL_SECONDS))

rally/esrally/driver/driver.py

Lines 950 to 955 in 2470328

    
           def post_process_samples(self): 
        
               # we do *not* do this here to avoid concurrent updates (actors are single-threaded) but rather to make it clear that we use 
        
               # only a snapshot and that new data will go to a new sample set. 
        
               raw_samples = self.raw_samples 
        
               self.raw_samples = [] 
        
               self.sample_post_processor(raw_samples)

rally/esrally/driver/driver.py

Lines 1062 to 1068 in 2470328

    
           # this will be a noop for the in-memory metrics store. If we use an ES metrics store however, this will ensure that we already send 
        
           # the data and also clear the in-memory buffer. This allows users to see data already while running the benchmark. In cases where 
        
           # it does not matter (i.e. in-memory) we will still defer this step until the end. 
        
           # 
        
           # Don't force refresh here in the interest of short processing times. We don't need to query immediately afterwards so there is 
        
           # no need for frequent refreshes. 
        
           self.metrics_store.flush(refresh=False)

Once self._client.bulk_index() completes we clear the in-memory buffer:

rally/esrally/metrics.py

Lines 924 to 942 in 2470328

    
           def flush(self, refresh=True): 
        
               if self._docs: 
        
                   sw = time.StopWatch() 
        
                   sw.start() 
        
                   self._client.bulk_index(index=self._index, items=self._docs) 
        
                   sw.stop() 
        
                   self.logger.info( 
        
                       "Successfully added %d metrics documents for race timestamp=[%s], track=[%s], challenge=[%s], car=[%s] in [%f] seconds.", 
        
                       len(self._docs), 
        
                       self._race_timestamp, 
        
                       self._track, 
        
                       self._challenge, 
        
                       self._car, 
        
                       sw.total_time(), 
        
                   ) 
        
               self._docs = [] 
        
               # ensure we can search immediately after flushing 
        
               if refresh: 
        
                   self._client.refresh(index=self._index)

We use the elasticsearch.helper.bulk() method to iterate over all the documents in the in-memory buffer, sending them in chunks of 5000 docs at a time:

rally/esrally/metrics.py

Lines 89 to 93 in 2470328

    
           def bulk_index(self, index, items): 
        
               # pylint: disable=import-outside-toplevel 
        
               import elasticsearch.helpers 
        
               self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)

This approach works fine for most benchmarks, but for challenges like logging-indexing-querying of the elastic/logs track we generate a lot of documents, meaning that this in-memory buffer is often full of tens to hundreds of thousands of documents that are indexed by a single client controlled by the Driver actor. This problem is exacerbated in environments where the load driver is a long way away from the metrics store (i.e. cross regional), or if the metrics store itself is overloaded, because our single client throughput is bound by the latency of each request.

There’s a parallel_bulk helper that uses multiprocessing.pool import ThreadPool, but the Thespian docs specifically call out that:

Actors are not intended to create other processes or threads; attempts to do so may create undefined behavior

We could consider a few things here, but all will require extensive testing:

Decrease the POST_PROCESS_INTERVAL_SECONDS from 30 to allow more frequent flushes, intending to prevent the buffer from growing too large
Increase the chunk_size used by the helper method to send more docs at once (i.e. 10,000), but I don't think this will make a large difference because we're still using a single client
Add some logic to detect the size/num of docs in the in-memory buffer and flush once it grows beyond a certain bound, we'd probably add this here by checking the len(self.coordinator.raw_samples) (or similar) in addition to the timer check:

rally/esrally/driver/driver.py

Line 301 in 2470328

if self.post_process_timer >= DriverActor.POST_PROCESS_INTERVAL_SECONDS:
Make use of the AsyncElasticsearch client instead of the sync client for concurrently flushing metrics, allowing us to use coroutines (and not threads or processes)

The text was updated successfully, but these errors were encountered:

b-deam · 2023-06-02T04:07:19Z

Here's an example of a benchmark where the flushing gets further and further behind:

logs

[...]
2023-06-01 05:31:44,977 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7698 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.645342] seconds.
2023-06-01 05:32:20,291 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 6608 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.136896] seconds.
2023-06-01 05:32:54,262 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4701 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [3.870889] seconds.
2023-06-01 05:33:40,384 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 24861 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [15.725329] seconds.
2023-06-01 05:34:17,364 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 9695 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [6.802203] seconds.
2023-06-01 05:34:52,410 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 5970 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.929296] seconds.
2023-06-01 05:35:27,891 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7039 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.356641] seconds.
2023-06-01 05:36:17,925 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 31359 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [19.628063] seconds.
2023-06-01 05:36:54,653 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 9643 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [6.537706] seconds.
2023-06-01 05:37:29,353 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 5556 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.587056] seconds.
2023-06-01 05:38:03,520 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4945 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.026544] seconds.
2023-06-01 05:39:10,213 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 58085 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [35.715736] seconds.
2023-06-01 05:39:59,862 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 30212 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [19.092454] seconds.
2023-06-01 05:40:35,964 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7954 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.933280] seconds.
2023-06-01 05:41:11,689 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7043 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.595361] seconds.
2023-06-01 05:41:45,371 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4211 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [3.589514] seconds.
2023-06-01 05:43:22,645 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 109047 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [65.609153] seconds.
2023-06-01 05:44:04,594 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 16993 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [11.586287] seconds.
2023-06-01 05:45:40,952 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 107550 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [64.712352] seconds.
2023-06-01 05:46:35,25 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 37899 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [23.376981] seconds.
2023-06-01 05:47:35,444 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 48301 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [29.687247] seconds.
2023-06-01 05:49:21,694 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 122151 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [74.320019] seconds.
2023-06-01 05:51:42,449 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 177342 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [107.253065] seconds.
2023-06-01 05:52:31,254 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 28988 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [18.212773] seconds.
2023-06-01 05:53:43,532 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 68180 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [41.309341] seconds.
2023-06-01 05:55:43,475 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 143725 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [87.643339] seconds.
2023-06-01 05:59:22,203 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 304369 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [183.337449] seconds.
2023-06-01 06:01:50,601 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 189635 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [115.103130] seconds.
2023-06-01 06:05:22,154 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 291041 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [175.944680] seconds.
2023-06-01 06:09:20,74 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 335387 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [202.510297] seconds.
2023-06-01 06:13:34,667 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 361446 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [217.943404] seconds.
2023-06-01 06:18:14,785 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 401978 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [242.862935] seconds.
2023-06-01 06:25:56,174 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 694914 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [418.961541] seconds.
2023-06-01 06:31:11,312 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 442631 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [276.282012] seconds.
2023-06-01 06:42:10,776 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 963791 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [611.130494] seconds.
2023-06-01 06:54:05,328 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1045172 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [665.431251] seconds.
2023-06-01 07:06:53,874 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1114536 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [717.690627] seconds.
2023-06-01 07:19:33,335 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1104502 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [708.316650] seconds.
2023-06-01 07:38:39,865 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1694037 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1081.415942] seconds.
2023-06-01 07:57:48,677 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1703366 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1084.819145] seconds.
2023-06-01 08:05:49,595 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 683827 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [437.580140] seconds.
2023-06-01 08:22:58,518 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1526505 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [971.327037] seconds.
2023-06-01 08:36:57,624 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1231688 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [784.602905] seconds.
2023-06-01 09:00:05,333 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 2050758 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1316.636413] seconds.

b-deam · 2023-06-02T04:19:18Z

An example of the doc count for a ~20h long running logging-indexing-querying execution vs all other benchmarks ran in the previous two weeks (some executing for 7-10+ hours each):

b-deam added the enhancement Improves the status quo label Jun 2, 2023

b-deam self-assigned this Jun 2, 2023

b-deam mentioned this issue Jun 2, 2023

Rally on aarch64 appears to leak memory #1721

Closed

ebadyano added the tech debt label Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve remote metrics store flushing #1724

Improve remote metrics store flushing #1724

b-deam commented Jun 2, 2023

b-deam commented Jun 2, 2023

b-deam commented Jun 2, 2023 •

edited

Loading

Improve remote metrics store flushing #1724

Improve remote metrics store flushing #1724

Comments

b-deam commented Jun 2, 2023

b-deam commented Jun 2, 2023

b-deam commented Jun 2, 2023 • edited Loading

b-deam commented Jun 2, 2023 •

edited

Loading