Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Neural Search field type #803

Open
asfoorial opened this issue Jun 25, 2024 · 16 comments
Open

[PROPOSAL] Neural Search field type #803

asfoorial opened this issue Jun 25, 2024 · 16 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications neural-search

Comments

@asfoorial
Copy link

asfoorial commented Jun 25, 2024

Can we mimic this feature in OpenSearch https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text

I know that a lot has been done recently in OpenSearch projects to make things headache free. I think a neural-search field type in OpenSearch would be an interesting addition. However, it should account for synonyms to avoid any fine-tuning headache.

@asfoorial asfoorial changed the title [PROPOSAL] Neural Search built-in type [PROPOSAL] Neural Search field type Jun 25, 2024
@navneet1v navneet1v added Enhancements Increases software capabilities beyond original client specifications and removed untriaged labels Jul 4, 2024
@navneet1v
Copy link
Collaborator

@asfoorial from Opensearch side we do this via a combination of ingestion processor and vector field. As there are multiple use-cases for semantic search including multi-model, this would be an interesting field to have.

But is there any specific reason you are looking for the field as compared to what is present currently. My main motive here is to know the advantages of a new field vs what is currently present in opensearch.

@asfoorial
Copy link
Author

The main reason is simplifying the process and keep the focus on the business. In fact elasticsearch had the same reason when they introduced the field.

Another reason is alignment of new features across multiple OpenSearch projects. I have noticed over the past number of releases we get new features in ml-commons and kNN. But it takes a while until we see their benefits reflected in neural-search. If they become one component (neural-search field), then that would sort of guarantee that any new feature in ml-common or kNN must be reflected in the neural-search field type before their release.

@navneet1v
Copy link
Collaborator

If they become one component (neural-search field), then that would sort of guarantee that any new feature in ml-common or kNN must be reflected in the neural-search field type before their release.

@asfoorial thanks for providing the details. I want to know little bit more on what features added in ML/k-NN doesn't make into Neural. May be there is something missing.

But I really like the idea of having a field which can encapsulate the processor information.

@navneet1v
Copy link
Collaborator

One place where having the field will be useful is nested fields. I see putting this information in the processor is very painful and not intutive.

@navneet1v
Copy link
Collaborator

@minalsha please take a look into this and please add your thoughts

@heemin32
Copy link
Collaborator

I think this is a good idea as it simplifies the use of neural search significantly. By defining a neural field, all other processes, such as the neural search pipeline, neural ingestion pipeline, KNN index creation, chunking, and more, will be handled behind the scenes.

@heemin32 heemin32 closed this as completed by moving to Backlog(Hot) in Neural Search RoadMap Dec 26, 2024
@heemin32 heemin32 moved this to Backlog(Hot) in Neural Search RoadMap Dec 26, 2024
@navneet1v
Copy link
Collaborator

@heemin32 any reason for closing this gh issue?

@heemin32 heemin32 reopened this Dec 30, 2024
@heemin32
Copy link
Collaborator

heemin32 commented Dec 30, 2024

@heemin32 any reason for closing this gh issue?

@navneet1v I think it is closed automatically when I added them in NeuralSearch RoadMap. Reopened it.

@navneet1v
Copy link
Collaborator

One case where I feel this field type will be very useful is in cases of complex nested fields. Currently with TextEmbedding processor it is always feels like we are finding different cases where the processor is not working some GH issues:

  1. [BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042
  2. [BUG] Fail to ingest document with nested list into text_embedding processor #1024
  3. [BUG] Text chunking processor not working with nested documents #895
  4. [BUG] _bulk update request failing when using text chunking processor pipeline #798
  5. [BUG] Incorrect validation logic for map type in xxxProcessor #739
  6. [BUG] error on complex types list type field [category] has empty string, cannot process it #678
  7. IllegalArgumentException when all embedding fields not shown or doing a partial update without embedding fields #73

I believe having a field type will solve this problem, in the mappers only we will call the MLCommons inference APIs to convert the text to embeddings. I think we can use the concept of properties in the mapper to have a neural field handling both text and vectors.

cc: @minalsha , @heemin32 , @vibrantvarun , @martin-gaievski

@YeonghyeonKO
Copy link

YeonghyeonKO commented Dec 30, 2024

This will also reduce the number of inference requests when multiple fields have to be embedded.

Inference requests in semantic_text fields are also batched. If you have 10 documents in a bulk API request, and each document contains 2 semantic_text fields, then that request will perform a single inference request with 20 texts to your inference service in one go, instead of making 10 separate inference requests of 2 texts each.
(https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text)

@bzhangam
Copy link
Contributor

I'll work on this item.

@YeonghyeonKO
Copy link

@bzhangam, is there room for consideration to include a minor feature? (See: opensearch-project/k-NN#2356)

Either

  • Give an warning message about mismatch between original similarity function of embedding model and space_type of indices

or

  • Suggest or fix space_type when defining mappings for an index according to the embedding model which neural_search field type will use.

@heemin32
Copy link
Collaborator

heemin32 commented Jan 1, 2025

@YeonghyeonKO, the space_type will be automatically retrieved from the model metadata, so users won't need to specify it explicitly.

@YeonghyeonKO
Copy link

@heemin32
if then, users who aren't familiar with vector spaces can easily transform text type fields to knn_vector type. Thanks for initiating this proposal @asfoorial

@dblock dblock removed the untriaged label Jan 6, 2025
@dblock
Copy link
Member

dblock commented Jan 6, 2025

[Catch All Triage - 1, 2, 3, 4]

@mingshl
Copy link

mingshl commented Jan 9, 2025

I was having similar idea earlier when I heard about a use case that wants to rewrite a match query to neural search query.

Think about this,

  • User config the mapping to have a field text defined as neural search field, along with a model id, optionally with text chunk size and model config.
  • when the document is ingested, the text field will ingest, it auto applies text chunking if needed, and internally call a ml inference processor or text embedding processor, that generates an embedding field call text_embedding which holds an array of embeddings.
  • when user are running match query using query text foo to lookup text_embedding field, the query field foo can be rewrite with embedding in a knn query. Or when when running match query with query text 'foo' to lookup text_embedding field, it rewrites to a neural search query.

This can simplify the neural search experience. But again, we will have to consider how do we handle different model input and output format. For course we can use pre and post processing function through connectors. But what if we can do it easier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications neural-search
Projects
Status: Backlog(Hot)
Development

No branches or pull requests

8 participants