grok items not in the docs #5605

chrismo · 2025-01-27T01:30:17Z

If I have at least one "named" capture pattern, the others don't need to be named (if I don't want them).

super -z -c 'yield "foo bar" | grok("%{WORD} %{WORD:bar}", this)'
{bar:"bar"}

but at least one must be named:

super -z -c 'yield "foo bar" | grok("%{WORD}", this)'
error({message:"grok(): value does not match pattern",on:"foo bar"})

If I have at least one named pattern, I can use inline regexs without named patterns:

super -z -c 'yield "foo bar" | grok("\\w+ %{WORD:bar}", this)'
{bar:"bar"}

I presume these aren't glitches of the function - if not, then could these items be added to the docs? These two things make it nicer, IMO, to use grok in more scenarios where I might otherwise try awk or sed.

philrz · 2025-01-27T20:18:40Z

Hi @chrismo. Thanks for pointing these out. Indeed, our grok function and the accompanying docs have been an ongoing area of investment, so user feedback is definitely helpful here.

Before I dive into your specific questions, it might help us to understand how it comes across as a possible improvement for you over awk or sed. Specifically, is it the included patterns like WORD that make it attractive? I ask because our docs (and others, such as the Logstash docs) explain its usefulness primarily in terms of being able to parse out what's found for the patterns into named fields. So your example of grok("%{WORD}", this)' where you're not specifying any fields is just something I'm not accustomed to seeing attempted with Grok, hence my guess that the appeal is just in the convenience of having those "patterns on tap" such as WORD. Correct?

As for the specific questions, something else we touch on in the docs is that there's no formal specification of Grok, so in situations like this we basically just see what Logstash's does (since it's effectively the reference implementation) and then see if we have a defensive reason for doing something different. Here's what that shows me regarding your two questions, and I'll go over these with the lead feature developer to confirm.

If I have at least one "named" capture pattern, the others don't need to be named (if I don't want them) but at least one must be named.

tl;dr - Yeah, that comes across as a bug.

Here's the Logstash comparison. I'm testing with Logstash v8.17.1 and will use an example based on the one from their Grok Basics docs.

$ cat logstash.conf 
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash.conf
...

To start from the "happy path", if I paste their example log line into that window so it's processed by their stdin plugin, I can see the parsing was successful because they show what was parsed into the named fields such as client and method.

55.3.244.1 GET /index.html 15824 0.043
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
         "bytes" => "15824",
    "@timestamp" => 2025-01-27T19:45:10.251797Z,
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
       "request" => "/index.html",
      "duration" => "0.043",
        "client" => "55.3.244.1",
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
        "method" => "GET"
}

Likewise, if I type in junk that doesn't parse, in addition to the named fields not being populated, they communicate this by adding a _grokparsefailure tag.

junk
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
    "@timestamp" => 2025-01-27T19:46:19.622941Z,
         "event" => {
        "original" => "junk"
    },
          "tags" => [
        [0] "_grokparsefailure"
    ],
       "message" => "junk"
}

With that established, I'll then pivot to using this alternate config where I've dropped all the named fields, hence much like your example with just %{WORD} as the one and only pattern.

$ cat logstash-no-fields.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP} %{WORD} %{URIPATHPARAM} %{NUMBER} %{NUMBER}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-no-fields.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    },
    "@timestamp" => 2025-01-27T19:51:31.921195Z,
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
       "message" => "55.3.244.1 GET /index.html 15824 0.043"
}

As you can see, it did not show the _grokparsefailure tag, so it effectively looks like success. Of course there were no parsed fields in the output, but that's exactly what the config implied, so it seems everything is working as designed.

In terms of what we should probably do by comparison, an error like value does not match pattern does indeed seem incorrect. In terms of how to communicate the fact that we were successful in parsing but didn't extract anything into named fields, since the function returns a record I'm guessing returning an empty record {} might be the correct thing to do.

If I have at least one named pattern, I can use inline regexs without named patterns.

tl;dr - Now that the "what happens when there's zero named fields" effect can be explained as a likely bug, the inline regexps doing what you showed seems like it's pretty much working as designed.

The parts of Grok patterns that aren't patterns and field destinations of the %{pattern:field_name} syntax are just other parts of the text that also have to match, and that can be literal text or more regexp. For example, going back to the first Logstash config, all I need to do is add an additional "space" character between the IP address and the method and that makes it fail to parse.

$ cat logstash-extra-space.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client}  %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-extra-space.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
    "@timestamp" => 2025-01-27T20:03:55.043718Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
      "@version" => "1",
          "host" => {
        "hostname" => "wpm.local"
    }
}

But likewise I can put .+ regular expressions in place of the single space chars and this log message will parse the same (though it would also now be flexible to multiple space chars in those spots, of course).

$ cat logstash-regexp-multi-spaces.conf
input {
  stdin { }
}
filter {
  grok {
    match => { "message" => "%{IP:client}.+%{WORD:method}.+%{URIPATHPARAM:request}.+%{NUMBER:bytes}.+%{NUMBER:duration}" }
  }
}
output {
  stdout {}
}

$ bin/logstash -f logstash-regexp-multi-spaces.conf 
...
55.3.244.1 GET /index.html 15824 0.043
{
        "client" => "55.3.244.1",
         "event" => {
        "original" => "55.3.244.1 GET /index.html 15824 0.043"
    },
         "bytes" => "15824",
       "message" => "55.3.244.1 GET /index.html 15824 0.043",
          "host" => {
        "hostname" => "wpm.local"
    },
        "method" => "GET",
    "@timestamp" => 2025-01-27T20:06:00.014176Z,
      "@version" => "1",
      "duration" => "0.043",
       "request" => "/index.html"
}

As for why this wasn't already called out explicitly in the docs, I guess that's my fault. I've been supporting Grok in one place or another for so many years that I've gotten accustomed to the fact that most users who look for it are already familiar with one of the existing implementations such as Logstash's, or even if not, they're likely to do Google searches and learn a lot of the basics that way, including stuff like this. So in my last update of the docs I put that Comparison to Other Implementations stuff up top for precisely that reason. But your experience illustrates that our doc would benefit from calling that out explicitly to benefit users that don't have much prior Grok experience. Since I imagine a doc update will be necessary after we make some change related to your first question, I'll plan to attack both docs updates at the same time. If you have other feedback to share on what I've written here thus far I'd also take that into account when doing those updates.

Closes #5605

chrismo · 2025-01-27T20:56:25Z

The only use-case for grok("%{WORD}", this) is just when I'm building up a new grok call incrementally - this would just be temporary - but would at least could show me I'm on the right path maybe, and an empty {} would probably be nicer than the current error - or perhaps a warning saying, "Hey, you didn't name anything, so ..... I'm giving you what you asked for."

But I don't have any permanent use-case for only a nameless capture - then I'd just use grep

Closes #5605

philrz · 2025-01-27T22:50:06Z

Thanks for the additional info @chrismo. As you may have noticed, @mattnibs already has a PR #5608 up for the proposed "empty record" approach, so here it is doing the right thing on that branch when faced with your example:

$ super -version
Version: v1.18.0-241-gdbf82b68

$ super -z -c 'yield "foo bar" | grok("%{WORD}", this)'
{}

As for your proposal of the warning, I see what you're getting at and will share that thought with the team, but don't be surprised if we don't pursue that idea. It's kind of an ongoing philosophical debate about how verbose we want to be in situations like this vs. letting the silence do the talking. When breaking ties in such debates someone will often bring up examples like how grep at the UNIX/Linux shell uses a combo of silence and the subtle non-zero exit code to communicate "I didn't find anything", and the debate usually ends there. 😄

chrismo · 2025-01-28T02:27:49Z

As for your proposal of the warning

Nah, I'm fine with that. We've discussed warnings in other cases, where, like nothing at all is returned ... but here an empty record being returned is great. Thx!

chrismo mentioned this issue Jan 27, 2025

grok vague error message when a capture pattern has a field name #5606

Closed

philrz added the community label Jan 27, 2025

mattnibs added a commit that referenced this issue Jan 27, 2025

grok: Return empty record on keyless match

e0218cf

Closes #5605

mattnibs linked a pull request Jan 27, 2025 that will close this issue

grok: Return empty record on keyless match #5608

Open

mattnibs added a commit that referenced this issue Jan 27, 2025

grok: Return empty record on keyless match

43cc7e5

Closes #5605

mattnibs added a commit that referenced this issue Jan 27, 2025

grok: Return empty record on keyless match

f679b31

Closes #5605

mattnibs added a commit that referenced this issue Jan 27, 2025

grok: Return empty record on keyless match

d2d0d71

Closes #5605

mattnibs added a commit that referenced this issue Jan 27, 2025

grok: Return empty record on keyless match

dbf82b6

Closes #5605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grok items not in the docs #5605

grok items not in the docs #5605

chrismo commented Jan 27, 2025

philrz commented Jan 27, 2025

chrismo commented Jan 27, 2025

philrz commented Jan 27, 2025

chrismo commented Jan 28, 2025

grok items not in the docs #5605

grok items not in the docs #5605

Comments

chrismo commented Jan 27, 2025

philrz commented Jan 27, 2025

chrismo commented Jan 27, 2025

philrz commented Jan 27, 2025

chrismo commented Jan 28, 2025