Skip to content

Commit

Permalink
update integration and format status
Browse files Browse the repository at this point in the history
  • Loading branch information
mmoskal committed Dec 2, 2024
1 parent cfef3df commit 0ca091a
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Following grammar formats are supported:
- a large subset of JSON schemas (but see [issue](https://github.com/microsoft/llguidance/issues/44))
- context-free grammars in (a [subset](./parser/src/lark/README.md) of) [Lark](https://github.com/lark-parser/lark) format

The internal format is most powerful and can be generated by the following libraries:
The internal format is most powerful (though lark-like format is catching up) and can be generated by the following libraries:
- [Guidance](https://github.com/guidance-ai/guidance) (Python)
- [guidance.ts](https://github.com/mmoskal/guidance-ts) (TypeScript)
- hopefully more to come!
Expand All @@ -26,10 +26,10 @@ The library is currently integrated in:
- [Guidance](https://github.com/guidance-ai/guidance) - library for interacting with LLMs;
uses either llama.cpp or HF Tranformers
- [LLGTRT](https://github.com/guidance-ai/llgtrt) - OpenAI-compatible REST server using NVIDIA's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
- [mistral.rs](https://github.com/EricLBuehler/mistral.rs/pull/899)

The integration is ongoing in:
- onnxruntime-genai - [draft PR](https://github.com/microsoft/onnxruntime-genai/pull/1038)
- mistral.rs - [preliminary PR](https://github.com/EricLBuehler/mistral.rs/pull/899)
- llama.cpp - [preliminary PR](https://github.com/ggerganov/llama.cpp/pull/10224);
note that llama.cpp is fully integrated in Guidance above
via Python bindings
Expand All @@ -46,6 +46,8 @@ The library implements a context-free grammar parser using Earley’s algorithm

[Outlines](https://github.com/dottxt-ai/outlines) builds an automaton from constraints and then pre-computes token masks for all automaton states, making sampling fast but inherently limiting constraint complexity and introducing significant startup cost and memory overhead. Llguidance computes token masks on the fly and has essentially no startup cost. The lexer’s automata are built lazily and are typically much smaller, as the context-free grammar imposes the top-level structure.

Recently released [XGrammar](https://github.com/mlc-ai/xgrammar) follows an approach similar to llama.cpp (explicit stack-based, character-level parser) with additional pre-computation of certain token masks, similar to Outlines.

In llguidance, online mask computation takes approximately 1ms of CPU time per sequence in a batch. Thus, with 16 cores and a 10ms forward pass, the library can handle batch sizes up to 160 without slowing down the model. (Note that a 10ms forward pass for small batch sizes typically increases to 20ms+ for batch sizes of 100-200.)

## Building
Expand Down

0 comments on commit 0ca091a

Please sign in to comment.