This repository has been archived by the owner on Nov 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
28 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Support for special tokens | ||
|
||
Tokenizers typically include special tokens, such as | ||
`<|end_of_text|>`, `<|eot_id|>`, `<|python_tag|>`, `<|start_header_id|>`, etc. | ||
This library is tasked with translating between the byte sequences | ||
and tokens. | ||
If you see bytes `<|eot_id|>` in the input, you may or may not want to treat them | ||
as a special token. | ||
|
||
The library assumes that by default you want ot treat them as bytes | ||
(so they would be tokenized as `<|`, `eot`, `_`, `id`, `|>` or similar). | ||
To indicate that you want to treat them as a special token, you need to | ||
prefix them with byte 0xFF (255) (`TokTrie::SPECIAL_TOKEN_PREFIX_BYTE`). | ||
|
||
Byte FF is chosen because it is not a valid UTF-8 byte, so it should not normally | ||
occur in regular inputs. | ||
In Rust, you cannot have byte FF in `&str`, only in `&[u8]`. | ||
In Python note the difference between `b"\xFF"` and `"\xFF".encode("utf-8")` | ||
(or equivalently `"\u00FF".encode("utf-8")`), which is `b"\xC3\xBF"`. | ||
|
||
If you're constructing it manually, | ||
the token array passed to the `TokTrie` constructor should include the special tokens | ||
with the prefix byte FF. | ||
|
||
The llguidance library does not expose the FF bytes externally | ||
(except for special `tokenize_bytes_prefix` methods), so you | ||
generally don't need to worry about them, except when building the `TokTrie`. |