Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grok items not in the docs #5605

Open
chrismo opened this issue Jan 27, 2025 · 4 comments · May be fixed by #5608
Open

grok items not in the docs #5605

chrismo opened this issue Jan 27, 2025 · 4 comments · May be fixed by #5608

Comments

@chrismo
Copy link

chrismo commented Jan 27, 2025

  1. If I have at least one "named" capture pattern, the others don't need to be named (if I don't want them).
super -z -c 'yield "foo bar" | grok("%{WORD} %{WORD:bar}", this)'
{bar:"bar"}

but at least one must be named:

super -z -c 'yield "foo bar" | grok("%{WORD}", this)'
error({message:"grok(): value does not match pattern",on:"foo bar"})
  1. If I have at least one named pattern, I can use inline regexs without named patterns:
super -z -c 'yield "foo bar" | grok("\\w+ %{WORD:bar}", this)'
{bar:"bar"} 

I presume these aren't glitches of the function - if not, then could these items be added to the docs? These two things make it nicer, IMO, to use grok in more scenarios where I might otherwise try awk or sed.

@philrz
Copy link
Contributor

philrz commented Jan 27, 2025

Hi @chrismo. Thanks for pointing these out. Indeed, our grok function and the accompanying docs have been an ongoing area of investment, so user feedback is definitely helpful here.

Before I dive into your specific questions, it might help us to understand how it comes across as a possible improvement for you over awk or sed. Specifically, is it the included patterns like WORD that make it attractive? I ask because our docs (and others, such as the Logstash docs) explain its usefulness primarily in terms of being able to parse out what's found for the patterns into named fields. So your example of grok("%{WORD}", this)' where you're not specifying any fields is just something I'm not accustomed to seeing attempted with Grok, hence my guess that the appeal is just in the convenience of having those "patterns on tap" such as WORD. Correct?

As for the specific questions, something else we touch on in the docs is that there's no formal specification of Grok, so in situations like this we basically just see what Logstash's does (since it's effectively the reference implementation) and then see if we have a defensive reason for doing something different. Here's what that shows me regarding your two questions, and I'll go over these with the lead feature developer to confirm.

  1. If I have at least one "named" capture pattern, the others don't need to be named (if I don't want them) but at least one must be named.

tl;dr - Yeah, that comes across as a bug.

Here's the Logstash comparison. I'm testing with Logstash v8.17.1 and will use an example based on the one from their Grok Basics docs.

$ cat logstash.conf 
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash.conf
...

To start from the "happy path", if I paste their example log line into that window so it's processed by their stdin plugin, I can see the parsing was successful because they show what was parsed into the named fields such as client and method.

55.3.244.1 GET /index.html 15824 0.043
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
         "bytes" => "15824",
    "@timestamp" => 2025-01-27T19:45:10.251797Z,
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
       "request" => "/index.html",
      "duration" => "0.043",
        "client" => "55.3.244.1",
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
        "method" => "GET"
}

Likewise, if I type in junk that doesn't parse, in addition to the named fields not being populated, they communicate this by adding a _grokparsefailure tag.

junk
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
    "@timestamp" => 2025-01-27T19:46:19.622941Z,
         "event" => {
        "original" => "junk"
    },
          "tags" => [
        [0] "_grokparsefailure"
    ],
       "message" => "junk"
}

With that established, I'll then pivot to using this alternate config where I've dropped all the named fields, hence much like your example with just %{WORD} as the one and only pattern.

$ cat logstash-no-fields.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP} %{WORD} %{URIPATHPARAM} %{NUMBER} %{NUMBER}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-no-fields.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
    "@timestamp" => 2025-01-27T19:51:31.921195Z,
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
       "message" => "55.3.244.1 GET /index.html 15824 0.043"
}

As you can see, it did not show the _grokparsefailure tag, so it effectively looks like success. Of course there were no parsed fields in the output, but that's exactly what the config implied, so it seems everything is working as designed.

In terms of what we should probably do by comparison, an error like value does not match pattern does indeed seem incorrect. In terms of how to communicate the fact that we were successful in parsing but didn't extract anything into named fields, since the function returns a record I'm guessing returning an empty record {} might be the correct thing to do.

  1. If I have at least one named pattern, I can use inline regexs without named patterns.

tl;dr - Now that the "what happens when there's zero named fields" effect can be explained as a likely bug, the inline regexps doing what you showed seems like it's pretty much working as designed.

The parts of Grok patterns that aren't patterns and field destinations of the %{pattern:field_name} syntax are just other parts of the text that also have to match, and that can be literal text or more regexp. For example, going back to the first Logstash config, all I need to do is add an additional "space" character between the IP address and the method and that makes it fail to parse.

$ cat logstash-extra-space.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client}  %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-extra-space.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
    "@timestamp" => 2025-01-27T20:03:55.043718Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    }
}

But likewise I can put .+ regular expressions in place of the single space chars and this log message will parse the same (though it would also now be flexible to multiple space chars in those spots, of course).

$ cat logstash-regexp-multi-spaces.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client}.+%{WORD:method}.+%{URIPATHPARAM:request}.+%{NUMBER:bytes}.+%{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-regexp-multi-spaces.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
        "client" => "55.3.244.1",
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
         "bytes" => "15824",
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
          "host" => {
        "hostname" => "wpm.local"
    },
        "method" => "GET",
    "@timestamp" => 2025-01-27T20:06:00.014176Z,
      "@version" => "1",
      "duration" => "0.043",
       "request" => "/index.html"
}

As for why this wasn't already called out explicitly in the docs, I guess that's my fault. I've been supporting Grok in one place or another for so many years that I've gotten accustomed to the fact that most users who look for it are already familiar with one of the existing implementations such as Logstash's, or even if not, they're likely to do Google searches and learn a lot of the basics that way, including stuff like this. So in my last update of the docs I put that Comparison to Other Implementations stuff up top for precisely that reason. But your experience illustrates that our doc would benefit from calling that out explicitly to benefit users that don't have much prior Grok experience. Since I imagine a doc update will be necessary after we make some change related to your first question, I'll plan to attack both docs updates at the same time. If you have other feedback to share on what I've written here thus far I'd also take that into account when doing those updates.

mattnibs added a commit that referenced this issue Jan 27, 2025
@mattnibs mattnibs linked a pull request Jan 27, 2025 that will close this issue
@chrismo
Copy link
Author

chrismo commented Jan 27, 2025

The only use-case for grok("%{WORD}", this) is just when I'm building up a new grok call incrementally - this would just be temporary - but would at least could show me I'm on the right path maybe, and an empty {} would probably be nicer than the current error - or perhaps a warning saying, "Hey, you didn't name anything, so ..... I'm giving you what you asked for."

But I don't have any permanent use-case for only a nameless capture - then I'd just use grep

mattnibs added a commit that referenced this issue Jan 27, 2025
mattnibs added a commit that referenced this issue Jan 27, 2025
mattnibs added a commit that referenced this issue Jan 27, 2025
mattnibs added a commit that referenced this issue Jan 27, 2025
@philrz
Copy link
Contributor

philrz commented Jan 27, 2025

Thanks for the additional info @chrismo. As you may have noticed, @mattnibs already has a PR #5608 up for the proposed "empty record" approach, so here it is doing the right thing on that branch when faced with your example:

$ super -version
Version: v1.18.0-241-gdbf82b68

$ super -z -c 'yield "foo bar" | grok("%{WORD}", this)'
{}

As for your proposal of the warning, I see what you're getting at and will share that thought with the team, but don't be surprised if we don't pursue that idea. It's kind of an ongoing philosophical debate about how verbose we want to be in situations like this vs. letting the silence do the talking. When breaking ties in such debates someone will often bring up examples like how grep at the UNIX/Linux shell uses a combo of silence and the subtle non-zero exit code to communicate "I didn't find anything", and the debate usually ends there. 😄

@chrismo
Copy link
Author

chrismo commented Jan 28, 2025

As for your proposal of the warning

Nah, I'm fine with that. We've discussed warnings in other cases, where, like nothing at all is returned ... but here an empty record being returned is great. Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants