A schema file specifies the delimiters and variables patterns (regular
expressions) necessary for log-surgeon
to parse log events. log-surgeon
uses
the delimiters to find tokens† (as in, strings of non-delimiters) in
the input, and categorizes any token that matches a variable pattern as a
variable. Any tokens that are not categorized as variables are treated as static
text. In essence, this allows the user to parse variables out of their
unstructured log events.
log-surgeon
also assigns types to each variable based on the variable
pattern's name in the schema file.
† Internally, log-surgeon
's lexer also treats a string of
delimiters as a token, just not one that matches a variable pattern.
A schema file essentially contains a list of rules, each of which has a name and a pattern (regular expression).
There are three types of rules in a schema file:
Syntax:
<variable-name>:<variable-pattern>
variable-name
may contain any alphanumeric characters, but may not be the reserved namesdelimiters
ortimestamp
.variable-pattern
is a regular expression using the supported syntax, but it cannot contain characters defined as delimiters.
Note that:
- A schema file may contain zero or more variable rules.
- Repeating the same variable name in another rule will
OR
the regular expressions (preform an alternation). - If a token matches multiple patterns from multiple rules, the token will be assigned the name of each rule, in the same order that they appear in the schema file.
Syntax:
delimiters:<characters>
delimiters
is a reserved name for this rulecharacters
is a set of characters that should be treated as delimiters
Note that:
- A schema file must contain a single
delimiters
rule. If multipledelimiters
rules are specified, only the last one will be used.
Syntax:
timestamp:<timestamp-pattern>
timestamp
is a reserved name for this ruletimestamp-pattern
is a regular expression using the supported syntax
Note that:
- Unlike variable patterns, timestamp patterns can contain delimiters.
- The parser uses a timestamp to denote the start of a new log event if:
- ... it appears as the first token in the input, or
- ... it appears after a newline character.
- Until a timestamp is found, the parser will use a newline character to denote the start of a new log event.
- The timestamp pattern is not used to match text inside a log event; since the pattern can contain delimiters, no token can match it.
Syntax: Comments are any text preceded by //
.
// Delimiters
delimiters: \t\r\n:,!;%
// Keywords
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}(\.\d{3}){0,1}
timestamp:\[\d{8}\-\d{2}:\d{2}:\d{2}\]
int:\-{0,1}[0-9]+
double:\-{0,1}[0-9]+\.[0-9]+
// Custom variables
hex:[a-fA-F]+
hasNumber:.*\d.*
equals:.*=.*[a-zA-Z0-9].*
delimiters: \t\r\n:,!;%
indicates that\t
,\r
,\n
,:
,,
,!
,;
,%
, and'
are delimiters. In a log file, consecutive delimiters, e.g., N consecutive spaces, will be tokenized as static text.timestamp
matches two different patterns:- 2023-04-19 12:32:08.064
- [20230419-12:32:08]
int
,double
,hex
,hasNumber
, andequals
all match different user defined variables.
The following regular expression rules are supported by the schema. When building a regular expression, the rules are applied as they appear in this list, from top to bottom.
REGEX RULE DEFINITION
ab Match 'a' followed by 'b'
a|b Match a OR b
[a-z] Match any character in the brackets (e.g., any lowercase letter)
- special characters must be escaped, even in brackets (e.g., [\.\(\\])
[^a-zA-Z] Match any character NOT in the brackets (e.g., non-alphabet character)
a* Match 'a' 0 or more times
a+ Match 'a' 1 or more times
a{N} Match 'a' exactly N times
a{N,M} Match 'a' between N and M times
(abc) Subexpression (concatenates abc)
\d Match any digit 0-9
\s Match any whitespace character (' ', '\r', '\t', '\v', or '\f')
. Match any *non-delimiter* character