-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input #2612
Comments
@reuschling, Can you share some sample data for your index ? I think you should save text chunks together with embeddings to index |
This is how a document looks like, returned by an empty query. I truncated the embedding vectors for better readability. I use dynamic mappings, thus the prefixes in the field names. Do you mean to just save text chunks instead of the whole documents? In this case I can not search for the whole documents anymore. The hybrid search searches inside the whole documents in the 'classical' query part, only the embeddings rely on chunks, because the model input length is limited, and the semantic representation of the embeddings may become too generalized, depending on the used model. If I chunk the body additionally regarding LLM input token size with an ingest pipeline, I get nested fields 'bodyChunk4LLM' to the origin document, as it is currently for the paragraph and paragraph->chunkSize4Embedding fields. In the hybrid query with RAG postprocessing, following would be the situation:
I tried to generate a scripted field inside the search query where neural search returns the matched chunk offset with "inner_hits". But "inner_hits" are not returned with hybrid queries, and scripted fields doesn't have access to the inner_hits return values, only to the fields inside the origin document. I also doubt that the returned format would not be valid for the RAG processor, as the scripted fields are not appear in _source. The second possibility would be to generate additional documents with one single bodyChunk4LLM field, i.e. for each input document N new documents with chunks and a reference to the origin doc. Then using these documents for RAG, and the big origin documents for the other searches (classy, neural + hybrid). But the chunk ingest processor doesn't generate separate documents, only new fields with N values inside the origin document instead. I don't see an ingest processor who can do this, maybe you have a hint? 😄 Third, I could generate both documents (origin and N according chunks) outside OpenSearch, but this means to change code inside all possible document providers, where we often don't have access to. It would be much better if this could be done entirely inside OpenSearch configs with e.g. ingest pipelines. Also there would be a huge redundancy in terms of data size, because all embeddings, chunks, etc. must be generated on both documents, origin and bodyChunk4LLM, in order to support hybrid neural search in both cases. But this is another story. "hits": [
{
"_index": "testfiles",
"_id": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
"_score": 1.0,
"_source": {
"tns_description": "",
"tk_source": "file:/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
"paragraphs_tns_title": [
"Laptop power supplies are available in First Class only"
],
"resourceName": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
"dataEntityContentFingerprint": "1330962029000",
"paragraphs_tns_description": [],
"paragraphs_tns_body": [
"\n Code, Write, Fly\n\n",
" This chapter is being written 11,000 meters above New Foundland. \n "
],
"tns_title": "Laptop power supplies are available in First Class only",
"date_modified": "2012.03.05 16:40:29:000",
"tns_body": "\n Code, Write, Fly\n\n This chapter is being written 11,000 meters above New Foundland. \n ",
"embedding_chunked_512_paragraphs_chunks_tns_body": [
{
"knn": [
-0.016918883,
-0.0045324834,
...
]
},
{
"knn": [
0.03815532,
0.015174329,
...
]
}
],
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.CompositeParser",
"de.dfki.km.leech.parser.HtmlCrawlerParser"
],
"Content-Encoding": "windows-1252",
"dataEntityId": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
"paragraphs_chunks_tns_title": [
"Laptop power supplies are available in First Class only"
],
"paragraphs_chunks_tns_body": [
"\n Code, Write, Fly\n\n",
" This chapter is being written 11,000 meters above New Foundland. \n "
],
"embedding_512_tns_body": [
0.021791747,
0.0016991429,
...
],
"Content-Type": "text/html; charset=windows-1252",
"paragraphs_chunks_tns_description": []
}
}
] |
You are right. To be the best of my knowledge, none of the ingest processor can return multiple documents. Can you take a look at |
From my understanding, you only need the specific chunk, instead of the whole document. You use need a way for nested query to tell you which chunk gets matched. Please correct me if I am wrong. @reuschling |
yes exactly @yuye-aws , because the LLM has a limited text length that it can process. But there is still the problem that the chunks for the embedding model, that are matched against in the nested query, have normally other sizes as the chunks needed for the LLM, which are in general much bigger. Thus my though to get the neighbor chunks also. So, in terms of control over the whole searching process, I would now assume it would be best if there is a possibility to have control over the LLM chunk size, which is currently only possible over the field content size. This can now achieved only by generating separate documents with LLM related chunk sizes, created outside (logstash, other document providers) or inside OpenSearch (ingest processor). By far preferable would be inside OpenSearch - there exists millions of existing document providers in parallel to logstash, and they all have to be adjusted otherwise. I personally would look now if there is a possibility to create an own ingest processor (maybe script based?) that would create chunks like the current text chunking processor, but creates separate documents instead of additional fields instead. The idea to use logstash as a postprocessing step is also not so bad, but it is not so easy to realize for a non-static index where new document content is added frequently. |
Aligning the chunk size with the LLM is a good practice for search relevance. I have created an RFC on model-based tokenizer: opensearch-project/neural-search#794. Do you think you concern will be addressed if we can register the same tokenizer as the LLM? By the way, I have listed a few options under the RFC, along with their pros and cons. Would you like to share your valuable opinions? Any comments will be appreciated. |
I am afraid not. The ingest processor in OpenSearch only performs certain actions upon a single document. We cannot create multiple documents based on a single document. @ylwu-amzn Can you provide some suggestions? Maybe we can have a discussion together. |
How about supporting inner_hits in hybrid queries? Feel free to create an RFC so that we can discuss next-step plans. |
There is already a feature request for this: opensearch-project/neural-search#718. Otherwise this could be also a valid solution, because there would be no need for extra chunking for the LLM anymore. It would be possible to build the chunk for the LLM out of the matched embedding chunk with its neighbor chunks. The only lack is that the match would be only rely on the embedding part of the hybrid query. |
This is a better way to specify valid chunk lengths for the current model, right? I currently use the formula |
Replied in opensearch-project/neural-search#794 |
I see. Supporting inner hits from hybrid query does not suffice to resolve your problem. It may take us sometime to investigate and check valid solutions to your problem. Thanks for your patience. If we enable the nested query to return the specific chunk. Does that resolve your problem? |
Still, the rag processor doesn't support this as input. And there is also no real control to build chunks with the right size for RAG / the LLM. The question is how OpenSearch can achieve a solution to get control over the LLM chunk sizes for RAG, without the need to chunk the documents outside of OpenSearch. Outside of OpenSearch there is the same problem for chunking a document as you described it in opensearch-project/neural-search#794 For answering a single question, the ideal LLM chunk size would be:
Nevertheless, for a conversation with follow-up questions the chunk sizes have to be smaller. Is there a solution in conversational search to achieve that the input context length of the LLM won't be exceeded? I see these possibilities to get control over the LLM chunk sizes:
|
The current opensearch version does not support the feature. We will have a few discussions among our team to explore the possible solutions. |
Hi @yuye-aws, can you help take care of this issue? |
Sure |
HI @reuschling ! We are investigating possible solutions for this issue. Can you provide the postprocessing script to access the chunk offset? |
@yuye-aws I am not sure what you mean. Do you mean postprocessing as part of the query? As part of my suggested solution 2? I'm not sure how to accomplish the relationship between the hybrid query matched field and the needed llm chunk. Maybe with text offset overlaps? |
@reuschling , I'm building a solution based on Agents for another cx who have a similar problem. Is it ok to use Agent to run RAG for your case https://github.com/opensearch-project/ml-commons/blob/2.x/docs/tutorials/agent_framework/RAG_with_conversational_flow_agent.md ? |
@ylwu-amzn , thanks for your hint. I had a look on the agent framework, but there are the same circumstances as in the other possibilities to configure RAG, right? It makes no difference regarding text chunks for the LLM if RAG is configured with a conversational search template or a conversational agent. |
I mean post processing a nested search query (non-hybrid). How do you expect to use script field and inner_hits to retrieve specific field? (If you can access the inner_hits from script field) Sorry for taking long to respond. I was recently busy with other tasks. |
Well, after some investigation, I come up with the following two options. Which option do you prefer @reuschling ?
|
Here are the pros and cons of both options. I am personally in favor of the first option. Option 1Pros
Cons
Option 2Pros
Cons
|
You mean implementing my suggested solution 2 'Enabling RAG processor to somehow deal with field chunks made for LLM input ' with a search response processor, that can find the right chunk4llm_field out of the hybrid query result? With the help of another queries, so similar chunk fields are ranked by search relevance score? But what should be the input of this query. By processing a hybrid search we will have a possible term-based, classy match, and an embedding-based match against a chunk4embeddings_field. What are general criteria? Further I doubt OpenSearch doesn't retrieve single (chunk)fields as result, if there are more field values for a single field, they are assumed as concatenated for search, isn't it? And, last but not least, processing further queries per result document could be a performance issue also. Your second point with a token limit for the retrieval_augmented_generation processor sounds good, at least for throwing an error if the input exceeds the configured limit. |
Actually it's not a query. It's a search response processor: https://opensearch.org/docs/latest/search-plugins/search-pipelines/search-processors/#search-response-processors. The processor will further process the results retrieved from the nested query. To be specific, it will visit the inner_hits to see the relevance score of each chunks, rerank the chunks and then return the results to the user. I will take a look into the neural query and hybrid query in these days. |
Truncating the input is not supported. But I think your LLM can automatically do the truncation. |
The token limit solution is only feasible when we support model-based tokenizer in OpenSearch. I'm afraid it will take at least a few releases to accomplish. Perhaps you can wait for the OpenSearch 2.19 release. |
I will take into a look into this solution these days. Just have a question for you @reuschling : is it required for you to use hybrid query? Since the search response in hybrid query does not support inner_hits, the proposed search response processor may not be able to retrieve the hybrid score on each chunk. |
Yes I deal with hybrid query, I think it is not wise to not consider the core competences for search of Lucene/OpenSearch for possible solutions. Generalized solutions would be better. But everything begins with the first step :) So, if I understand you right, your suggestion is to write a search response processor that will do what I tried with a scripted field, right? To build the right chunk for the llm out of the matched embedding chunk with neigbours. This would be the third of the possible solutions I had in mind, but I am not sure anymore if this would be a valid solution.
Still, generating llm chunks on the fly would be nice of course.
Cons
Maybe it would be a better solution to get somehow the term offset of the match inside the origin field, i.e. the term offset for the term based match and the term offset for the matched embedding chunk. With this offset it would be possible to cut the chunk for the llm out of the origin field. But in case of the term-based part match, there is no clear term offset inside the origin document. There are several term match offsets only, and it is still unclear which part of the document would be the best chunk. Thus, also this solution would be possible for the neural search part only. For the term based part, I currently only see the possibility to search inside pre-chunked document parts of size right for the llm - not 'on the fly'. Here the current lack in OpenSearch is that the chunking have to be done outside of OpenSearch. It would be a huge benefit - from my point of view - if this can be done inside OpenSearch also. Possibilities would be my suggested solutions 1. and 2., i.e.
Possibility 1. sounds to me as easiest solution, but I am not aware of possible hard restrictions for ingest processors. |
Basically, the ingest pipeline in OpenSearch is a 1-on-1 mapping of the document. For any ingest document, both the input and the output should be a single document. You can check the following code snippets to have a rough idea: https://github.com/opensearch-project/OpenSearch/blob/7c9c01d8831f57b853647bebebd8d91802186778/server/src/main/java/org/opensearch/ingest/IngestDocument.java#L797-L819 |
Hi @chishui ! Do you think there is any possible method to generate multiple documents when user is ingesting a document? Intuitively, can we modify the |
It is a good ideal to reduce the overlapped tokens. I guess you are expecting to retrieve the neighbor chunks along with the matched chunk. In my opinion, it is not a hard requirement to address your problem. We just need to find the most matched chunk and return to the user. |
Will take a look at the hybrid query in the next few days. Just in case other solutions do not work. I would like to begin with the search response processor solution to support neural query first. Also, maybe you can leave an email or join the OpenSearch slack channel so that we can have meeting and respond to your messages ASAP. For your information: https://opensearch.org/slack.html |
No, here I mean not dealing with neighbor chunks. We have two different chunks, one with chunk size for the llm, one with chunk size for the embeddings. Here I think about how it could be possible to get an relationship between a matched embedding chunk from the neural search to an according pre-calculated chunk field in llm size. One possibility could be looking on the source offsets of both chunks. |
Also, could you elaborate more on why you could not generate a document for each chunk? In my opinion, you could download the existing index, and then create a new index with separate documents. You just need to consume duplicated spaces, right? |
Sure. inner_hits can do that, I will check what is the blocker to support inner_hits with hybrid query with the neural-search team today. |
For my current use case where I implemented a new importer I do so. But we have several existing document corpora that are indexed/mirrored into OpenSearch. To enable RAG for existing OpenSearch applications - where the implementation of the document import is finished, existing, maybe complicated and the code maybe not available - it is currently mandatory to write code. Just to configure OpenSearch in a different way, transparent from the import process outside of OpenSearch, is much better and maybe sometimes the only possibility.
This is right of course. Things becomes complicated I you have mirrored (several) document corpora, where you check if there is a modification inside the corpus - new, modified or deleted documents - re-indexing the delta incrementally. Someone have to implement this mirroring functionality also for the index duplication for RAG. Technically everything is possible and doable of course. But again, it is a totally different scenario - much more work and costs - against to have the possibility to just re-configure OpenSearch on top of the existing, unmodified solution. Or in other words: For building new applications the current possibilities are sufficient. For the migration of existing applications to RAG, it would be a benefit if this could be done with the server config. |
I understand. As our first step, I am implementing a prototype of search response processor. Will ping u when ready. |
Hi @reuschling . I regret to tell you a bad news :( I am running inner_hits with the neural query to search documents according to their chunks. It only returns the highest chunk in the inner_hits field. It's such a weird behavior that is different from the BM25 query. The search response processor will only return only a chunk for each different document. Suppose that the chunk relevance of both documents are: [0.9, 0.8, 0.7], [0.6, 0.5]. The search response processor can only return [0.9, 0.6]. This would definitely be unexpected result. I will open an issue in neural search repo. For the next few days, I will take deeper dive to see if there is any blocking issues. |
Hi, but isn't there the index of the chunk also? With this, the neighbor chunks could be determined, isn't it? It could be a real performance issue to return all chunks, as loading field data needs much time in general. |
You can determine neighbor chunks via offsets, but there is no guarantee that neighbor chunks is relevant to the user query.
We can also have an example [0.9, 0.2, 0.1], [0.6, 0.5]. The expected returned chunks should be [0.9, 0.8] and [0.9, 0.6] in both examples. Unfortunately, without inner hits, we cannot distinguish between them. |
You can check the bug issue in neural-search: opensearch-project/k-NN#2113. The is the current blocking issue. |
Latest update: neural-search issue transferred to k-nn repo: opensearch-project/k-NN#2113 |
@reuschling There is an RFC and a draft PR in neural-search repo, which is towards 2.19. |
I have implemented a hybrid search with according ingest and search pipeline, using text embeddings on document chunks, as the embedding models have input token size limitations of course.
The ingest pipeline follows https://opensearch.org/docs/latest/search-plugins/text-chunking/
The top results should now be used as input for RAG, I configured a search pipeline for this, following https://opensearch.org/docs/latest/search-plugins/conversational-search/ :
Now I am able to send a search request using this search pipeline to
{{ _.openSearchUrl }}/{{ _.openSearchIndex }}/_search?search_pipeline=hybrid-rag-pipeline
, which is working:Now I am falling into the issue that the documents in my index are too long for my LLM input. In OpenSearch currently the context_size and the message_size is configurable, but when the first document exceeds the input token limit, OpenSearch sends a message to the LLM provider that can not be processed.
Two things comes into my mind now:
Currently big documents are not only silently lost in RAG. Because the whole prompt exceeds the input token limit of the LLM, it is (in my setting at least) accidentally truncated, meaning that the question - which is the last part of the generated prompt - is lost. So the user question will not be answered at all.
The text was updated successfully, but these errors were encountered: