Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Enhancement // strategy is now a module, not function capture #12

Merged
merged 3 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ text = "Your text to be split..."
chunks = TextChunker.split(text)
```

This will chunk up your text using the default parameters - a chunk size of `1000`, chunk overlap of `200`, format of :`plaintext` and using the `RecursiveChunk` strategy.
This will chunk up your text using the default parameters - a chunk size of `1000`, chunk overlap of `200`, format of `:plaintext` and using the `RecursiveChunk` strategy.

The split method returns `Chunks` of your text. These chunks include the start and end bytes of each chunk.

Expand All @@ -67,7 +67,7 @@ If you wish to adjust these parameters, configuration can optionally be passed v

- `chunk_size` - The approximate target chunk size, as measured per code points. This means that both `a` and `👻` count as one. Chunks will not exceed this maximum, but may sometimes be smaller. **Important note** This means that graphemes *may* be split. For example, `👩‍🚒` may be split into `👩,🚒` or not depending on the split boundary.
- `chunk_overlap` - The contextual overlap between chunks, as measured per code point. Overlap is *not* guaranteed; again this should be treated as a maximum. The size of an individual overlap will depend on the semantics of the text being split.
- `format` (informs separator selection). Because we are trying to preserve meaning between the chunks, the format of the text we are splitting is important. It's important to split newlines in plain text; it's important to split `###` headings in markdown.
- `format` - What informs separator selection. Because we are trying to preserve meaning between the chunks, the format of the text we are splitting is important. It's important to split newlines in plain text; it's important to split `###` headings in markdown.

```elixir
text = """
Expand Down Expand Up @@ -103,7 +103,7 @@ iex(10)> TextChunker.split(text)
]

text = "This is a sample text. It will be split into properly-sized chunks using the TextChunker library."
opts = [chunk_size: 50, chunk_overlap: 5, format: :plaintext, strategy: &TextChunker.Strategies.RecursiveChunk.split/2]
opts = [chunk_size: 50, chunk_overlap: 5, format: :plaintext, strategy: TextChunker.Strategies.RecursiveChunk]

iex(10)> TextChunker.split(text, opts)

Expand Down
4 changes: 2 additions & 2 deletions lib/text_chunker.ex
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ defmodule TextChunker do
@default_opts [
chunk_size: 2000,
chunk_overlap: 200,
strategy: &RecursiveChunk.split/2,
strategy: RecursiveChunk,
format: :plaintext
]

Expand Down Expand Up @@ -43,6 +43,6 @@ defmodule TextChunker do
def split(text, opts \\ []) do
opts = Keyword.merge(@default_opts, opts)

opts[:strategy].(text, opts)
opts[:strategy].split(text, opts)
end
end
Loading