-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file_matches - why so much traffic #4
Comments
Hi @indera-shsp, the plugin should be listing the bucket every That could show up as several If you see Logstash downloading the contents of the file every 60 seconds that's probably a bug. The plugin should keep a locacl cache of which objects it's already processed or mark them with a label in GCS. The label is preferred (by default it's Could you expound on your use case a little bit more (average size of file, average size of name, count of objects per bucket)? |
The bucket contains about 20 small json files and we check every 30 seconds |
The list function here can optionally accept an option object containing a property named "prefix" https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Storage.java#L973. I suspect that if Logstash allowed providing a "prefix" as a parameter (similar to how one provides file_matches) would allow Google to do pre-filtering of large lists of files and may be cheaper computationally and use less network bandwidth each list cycle. |
We have a bucket with 28716 objects (and growing). To retrieve the list of these objects plus their metadata, the resulting file is 38MB. Since our logstash |
From reading the ruby code it looks like |
It sounds like even with a prefix, we might end up back here soon if the data is going to continue growing. It seems like the pipeline is trying to index something that's near-real time. Would one of the following approaches help?
If those are too much, I'm happy to look at just adding the prefix for now if you're willing to test it so we can get a Logstash maintainer to approve the PR (they like to see at least one one real user testing it before approving a merge/release). |
@josephlewis42 Thank you for the prompt responses. We will evaluate the options you mentioned, but adding support for the |
@josephlewis42 if there is a PR enabling the use of a prefix pre-filter, we will happily test it |
@josephlewis42 how difficult would be to fix this issue for somebody not familiar with the code base? |
@indera-shsp I'm taking a stab at it right now. The codebase is a bit hairy because it's Java mixed with Ruby. Our hope was full Java because then type checks and the like are easy but I think those plans have been stalled upstream. |
@indera-shsp or @tmegow I built a version with the fix and have it published here: https://storage.googleapis.com/logstash-prereleases/logstash-input-google_cloud_storage-0.12.0-java.gem for testing. If things look good, would you mind leaving your remarks in #7 ? Here are the docs for the new field: [id="plugins-{type}s-{plugin}-file_prefix"]
===== `file_prefix`
added[0.12.0]
* Value type is <<string,string>>
* Default is: ``
A prefix filter applied server-side. Only files starting with this prefix will
be fetched from Cloud Storage. This can be useful if all the files you want to
process are in a particular folder and want to reduce network traffic.
|
We use
file_matches
as described here https://www.elastic.co/guide/en/logstash/current/plugins-inputs-google_cloud_storage.htmlto determine which files need processing.
We are observing excessive traffic initiated by logstash - is it downloading ALL the files from the bucket every 60 seconds?
I expected it to be smart and only download file names which should not add to hundreds of megabytes every hour.
The text was updated successfully, but these errors were encountered: