embedDocuments() batch support and possibly large document splitting #390

krodyrobi · 2024-12-19T17:30:57Z

Describe the Problem

OpenAi apis limit the number of elements embedded at a time and the size of each element, to work around this the js sdk and python sdk provide the following:

batching if number of documents are above 2048 (promise.all on the batch api calls)
split large (>~8000 tokens) documents in sections and then join the retrieved embeddings in a normalized fashion (here the js sdk does not do it but python one does)

Propose a Solution

Base implementation of sdk chat model on @langchain/oneai models.
For the per document split I believe this needs a custom implementation above the core sdk, didn't see it in js langchain.

Describe Alternatives

Create a wrapper over existing model.

Affected Development Phase

Development

Impact

Inconvenience

Timeline

No response

Additional Context

No response

ZhongpinWang · 2024-12-20T10:33:27Z

Hi @krodyrobi ,

Thanks for raising this feature request. Have you checked langchain textsplitters, langchain vector store, and sample code? We currently do not handle sending batch requests or join embeddings. Will this satisfy your needs since you are using langchain?

krodyrobi · 2024-12-20T11:06:52Z

Out of all the feature requests this in my opinion can be treated just as nice to have.

For the split of a single document, realistically no document should be that big to begin with as you've pointed out splitters exist, added here just for feature parity with python gen ai sdk and langchain.

For batching multiple documents however langchain-js/openai does this as a convenience. I realize we should stay clear from using something like this as if during the batch 1 request fails we lose quota for the whole promise.all set. Again this is here to provide parity with existing pre-conditions in js and python sdks.

krodyrobi added the feature request New feature or request label Dec 19, 2024

krodyrobi changed the title ~~embedDocuments model does not split documents in batches nor does it split large documents and joins the embeddings~~ embedDocuments() batche support and possibly large document splitting Dec 19, 2024

krodyrobi changed the title ~~embedDocuments() batche support and possibly large document splitting~~ embedDocuments() batch support and possibly large document splitting Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedDocuments() batch support and possibly large document splitting #390

embedDocuments() batch support and possibly large document splitting #390

krodyrobi commented Dec 19, 2024 •

edited

Loading

ZhongpinWang commented Dec 20, 2024

krodyrobi commented Dec 20, 2024 •

edited

Loading

embedDocuments() batch support and possibly large document splitting #390

embedDocuments() batch support and possibly large document splitting #390

Comments

krodyrobi commented Dec 19, 2024 • edited Loading

Describe the Problem

Propose a Solution

Describe Alternatives

Affected Development Phase

Impact

Timeline

Additional Context

ZhongpinWang commented Dec 20, 2024

krodyrobi commented Dec 20, 2024 • edited Loading

krodyrobi commented Dec 19, 2024 •

edited

Loading

krodyrobi commented Dec 20, 2024 •

edited

Loading