Skip to content

Commit

Permalink
lark examples
Browse files Browse the repository at this point in the history
  • Loading branch information
mmoskal committed Nov 29, 2024
1 parent 9ffbb7b commit 8634606
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions parser/src/lark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,60 @@ Following are currently not supported:
- templates
- imports (other than built-in `%import common`)
- regexes use Rust `regex` crate [syntax](https://docs.rs/regex/latest/regex/#syntax), not Python's `re` (though they are similar)
- certain string syntax, see [issue](https://github.com/microsoft/llguidance/issues/54)

Following features of llguidance are currently not exposed in Lark syntax:

- per-lexeme contextual and lazy flags

## Examples

### Llama JSON tool calling

Here, we restrict the output to either normal text response,
or a tool call to either Brave or Wolfram Alpha.

```lark
start: normal_text | brave | wolfram
normal_text: /(.|\n)*/
brave: <|python_tag|> "brave_search.call(query=" JSON_STRING ")" <|eom_id|>
wolfram: <|python_tag|> "wolfram_alpha.call(query=" JSON_STRING ")" <|eom_id|>
JSON_STRING_CHAR: /(\\([\"\\\/bfnrt]|u[a-fA-F0-9]{4})|[^\"\\\x00-\x1F\x7F])/
JSON_STRING: "\"" JSON_STRING_CHAR* "\""
```

Note that just as in lark uppercase identifiers define grammar lexemes
(also often called tokens) - they can't be recursive
(they are compiled to regular expressions).
This has performance implications, in particular you should **avoid short lexemes**.
If the grammar used `json_string` not `JSON_STRING`,
then each `json_string` would consists of lexeme `"`, followed
by any number of single-character lexemes, followed by lexeme `"`.
Such grammar would be very slow to run.
With upper-case `JSON_STRING`, the whole string is a lexeme.

BTW, in this case you may want to replace the JSON string
with Python string, depending on how the model was trained.

You can also use Lark-like syntax to combine JSON schemas with regular output.
In that case, you pass the JSON schemas as additional grammars, with
the lark grammar being the top-level one.

```lark
start: normal_text | fun_call
// @fun0, @fun1 refer to other sub-grammars, see below
fun_call: <|python_tag|> ( @fun0 | @fun1 ) <|eom_id|>
normal_text: /(.|\n)*/
```

```json
{
"grammars": [
{
"lark_grammar": "...the lark above...",
},
{"name": "fun0", "json_schema": { ... }},
{"name": "fun1", "json_schema": { ... }}
]
}
```

0 comments on commit 8634606

Please sign in to comment.