Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Performance Regression using Jruby Timeouts #152

Open
yaauie opened this issue Oct 25, 2019 · 8 comments
Open

Significant Performance Regression using Jruby Timeouts #152

yaauie opened this issue Oct 25, 2019 · 8 comments
Assignees

Comments

@yaauie
Copy link
Contributor

yaauie commented Oct 25, 2019

I've run a number of performance tests to get the raw throughput per-worker to determine how different variables affect the throughput of the grok, and determined that the changes to timeouts definitely have an outsized impact on grok filters that fall through many patterns to find their match (which makes sense, because the plugin initiates a timeout on each event for each pattern on until we find a match).

6.7.1 7.4.1 JEE 7.4.1 REE
grok-single-pattern 97k 65k 57k
grok-single-pattern-NT 96k 105k 98k
grok-52-fallthrough 42k 15k 17k
grok-52-fallthrough-NT 42k 71k 68k

Caveat: In the above, NT indicates "No Timeout", or timeout_millis => 0, which we do NOT recommend for production since a single pathological input can essentially tie up CPU resources indefinitely)

I will follow up with methodology for the above, and will see if I can get a better apples-to-apples comparison with 6.7.1 using the same version of this plugin as ships with 7.x.

@colinsurprenant
Copy link
Contributor

@yaauie interesting. in #147 I did not see such a dramatic performance regression for single pattern use-case; also I did not test for multiple fall through which, as you point out, will multiply any performance regression. I wonder if using different LS versions also affect the performance tests? Did you use the same LS version for all tests with the different plugin versions?

I wonder if it would make sense to have a single global timeout for the whole fall through list? (it could be configurable to keep BWC?)

@colinsurprenant
Copy link
Contributor

Meta issue created in elastic/logstash#11302

@kares
Copy link
Contributor

kares commented Nov 18, 2019

with #153 merged and available since >= 4.2.0 the fall-through case should be better when using timeout_scope => "event". for the single pattern the reported numbers still look like a considerable regression but we have yet to see how much that matters in real-world usage.

options we have left (if more performance is needed) are the following :

  • if contention is the main issue -> custom timeout thread (or executor) per worker
  • a scheduled thread pool executor wout blocking (again - assuming contention)
    there's an option e.g. in netty that is worth trying although it might be hard to extract out
  • reverting to using a single timeout thread but this time with proper (jruby-ext) thread polling
  • implemeting native timeout in Joni itself (best option - but requires work on joni's instructions)

@percygrunwald
Copy link

We have run into this performance regression after upgrading from 6.7.1 to 7.4.2. Given the same configuration, the update to 7.4.2 approximately doubled our event latency and dropped our events emitted by around 40%:
image

There was no way to improve the performance of our pipelines after the upgrade by changing any of JVM heap size, pipeline workers or batch size. It seems that the drop in throughput was in large part due to a bottleneck in the grok filter. Given the same settings as before the upgrade, we were able to recover some of our pre-upgrade throughput by adding timeout_millis => 0 to all grok invocations in our pipeline definitions:
image

We tested timeout_scope => "event" (instead of timeout_millis => 0) on some other machines and it had a very slight positive impact on overall performance:
image

@kares
Copy link
Contributor

kares commented Jan 6, 2020

Hey Percy, thanks for the data -> helpful to know timeout_scope => event did not help your case.
We are considering options on tackling this without reverting to ('problematic' but fast) previous timeout implementation. It will take us a while to have smt shipable, we might need folks to do a 'beta' test when ready (as real-world scenarios such as yours might reflect performance changes more accurately).

@percygrunwald
Copy link

@kares thanks for the follow up, appreciate your work on this. Our setup is holding steady with just the addition of timeout_millis => 0 to all our grok filters presently, but would prefer a more permanent solution that doesn't require us to add this to each invocation of the filter.

@abdalians
Copy link

+1 on timeout_millis => 0

@percygrunwald
Copy link

@kares has there been any progress on this case? We still have a code base full of timeout_millis => 0 as a mitigation attempt for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants