-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70
Comments
Update on Proposed APIsRefactored API Design to center around model resource. First draft can be found here. Second draft can be found here. For faiss, we will introduce additional functionality to add support for faiss indices that require training. With this change, we introduce a new resource: models. A model is an empty, trained native library index that can be used to initialize another native library index during ingestion. A model will be stored as a document in the model system index, which has the following mapping:
state — Model state. Can either be CREATED, FAILED, TRAINING created_timestamp — Time at which the model was created. description — Model description a user can provide to add additional details about a model. error — Message provided to user to communicate why model is in failed state. model_blob — Base64 encoded representation of the model. engine — Engine this model was created by. space_type — Space this model was built with. dimension — Dimension this model supports. Get
model_id — [Required] Specify which model to return information for. If not specified, all model information will be returned. filter_field — Fields to include. If not specified, all fields are returned. Delete
model_id — [Required] Model to delete Upload
description — [Optional] Model description a user can provide to add additional details about a model. model_blob — Base64 encoded representation of the model. engine — Engine this model was created by. space_type — Space this model was built with. dimension — Dimension this model supports. Train
node_id — User's preference for node to execute training. train_index — OpenSearch index from which to pull the training data. train_field — Field of train_index from which to pull training data. dimension — Dimension the model should be built for. method — Method definition to produce the model. |
I have a few questions and suggestion for discussion.
a. whether a model resource belongs to a node resource |
a. can a job be updated once created? |
a. the results from get might be changed to be consistent with that from put, i.e.,
|
a. the response can be only an ack, the same as in delete, i.e.
|
Thanks for the feedback @wnbts. Let me address your comments 1 by 1:
Right,
No, node is not necessary, but it is optional.
Discussed above. They are not the same thing.
No it cannot. I think I got this backwards. I will switch to POST.
Good point, will update.
I see, I will update. Thanks for the suggestion. |
I agree. The data modeling would be more clear or natural. Training jobs are an intuitive resource. The training job request can contain a model id, or a (new generated) model id can be returned in the response. A separate question regarding the relationship between node and job, if a job is created with a node resource, is the job bound to the node? For example, if a job is /node-a/train-jobs/job-a, what would a get of /node-b/train-jobs/job-a return? |
In this case, no results would be returned. A job is bound to a node, however a model is not. |
@wnbts I decided to update APIs to center around model resource. I felt that having a separate model resource and train jobs resource did not make sense. Please take a look at the update if you get time. |
The new version also makes sense to me! I raise some details for discussion.
|
for searching model resources, if needed
|
|
Right, in the mapping, it is implicitly defined as the document id. That being said, I think in the responses, it makes sense to contain the id. I will update.
Might have been misinterpreted. filter_field is meant to filter the items returned in the body. This is similar to how GET calls work: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-documents/#url-parameters. That being said, GET calls take a "source_includes" param. I could refactor to this.
Preference also follows opensearch convention: https://opensearch.org/docs/opensearch/rest-api/document-apis/get- I thought |
Yeah, I misinterpreted the api. The input looks good then. The output can just be the same body used in PUT/POST.
I see, preference is a convention.
The difference is subtle. I see the request is made to id/_train not to id and is therefore not a standard put. Or, a request to id/a/b would be put rather than post. So having post here can make the train api simpler to users without getting into those differences. |
I see. I will update to POST. Also, I will add _search API to get multiple models. |
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Overview
Over the past year, we have received a lot of interest in supporting Facebook’s MIT licensed faiss library, as another Approximate k-Nearest Neighbor (ANN) engine, in addition to nmslib. faiss offers a diverse set of algorithms that allow users to easily make tradeoffs between indexing latency, memory usage, query latency, and recall to fit their ANN workload requirements.
Earlier this year, a member of the community made a contribution in Open Distro for Elasticsearch for initial faiss support. This contribution integrated the faiss library into the plugin and added support for faiss’s implementation of Hierarchical Navigable Small World (HNSW) graphs. We are building on top of this contribution to support additional faiss features like vector quantizers and other ANN search methods.
Supporting faiss will allow users to choose from different ANN search methods and algorithms that are not available in nmslib. In particular, we are very interested in supporting faiss’s quantization methods that can reduce the amount of memory an ANN index requires.
Additionally, while this project focuses on integrating faiss, it should also refactor the plugin so that we can support additional ANN libraries and their methods in the future.
We are developing on the feature/faiss-support branch and are planning to merge to main once all requirements have been met.
Problem Statement
Because the k-NN plugin only supports one ANN engine and method, the amount of customization a user can make to achieve a solution to their ANN workload is limited.
Specifically, one problem k-NN plugin users face is that the plugin can consume a significant amount of memory. Currently, the plugin is built on top of nmslib’s implementation of HNSW. HNSW is a fast and fairly accurate ANN method. Still, for some workloads, the HNSW algorithm’s memory consumption can be an issue. From our documentation, each vector will consume approximately
1.1 * (4 * dimension + 8 * M)
bytes. faiss implements several different algorithms that can provide ANN search using much less memory at the cost of additional compute during training. By supporting faiss, we can let users make memory based tradeoffs in order to achieve the solution that they want.Requirements
In the initial phase, the requirements are:
In the future, we may consider supporting:
Proposed Solution
In order to support faiss in the k-NN plugin, we need to:
Training Support
Several faiss features, such as IVF and PQ, require a training step before indexing can begin. Training takes a set of training vectors and creates a model that these features use to perform their functionalities.
From the plugin perspective, there are two approaches to support training: (1) Train a new model during segment creation with a subset of the segment’s index data and (2) Train a model before indexing can begin and use it to initialize the ANN library index during segment creation.
While the approaches are not mutually exclusive, initially we will only support Approach 2.
Approach 1 is easier to implement, but it significantly increases indexing latency. Every time a new segment is created, a new model needs to be trained. Additionally, because the model is trained with a subset of the segment’s data, it is difficult to guarantee the quantity and quality of the training data.
Approach 2 requires us to add additional APIs and OpenSearch utilities for a user to train a model and connect it to an OpenSearch k-NN index. However, it speeds up indexing and gives the user more control over the model produced. Additionally, it is recommended in the faiss documentation.
Model System Index
In order to persist faiss trained models and their metadata, we need to create a model system index.
During segment creation, a GET call is made to retrieve a model’s binary representation. The model is then used in the JNI layer to initialize the ANN library index. Once initialized, the vectors for the given segment are indexed into the ANN library index. After this completes, the ANN library index file is written to the OpenSearch index’s segment.
Train API
In order to support Approach 2, we need to give users the functionality to train a model in their OpenSearch cluster. To do this, we need to add a train API:
This API triggers a training workflow that reads a training set of vectors from another OpenSearch index, creates and trains an ANN library model and then serializes it into the model system index.
Upload API
One potential issue with training is that it can be very resource intensive, which could negatively impact an OpenSearch cluster that is processing a heavy workload. So, to unblock users who want to use models that require resource intensive training, we need to also provide an upload API:
This API triggers a workflow that validates the uploaded model and then serializes it to the model system index.
knn_vector Field Enhancements
In order for a user to configure an index to use faiss, we need to enhance our knn_vector field type. Currently, a user creates an index with the following mapping:
To support faiss indices that do not require training, we need to add an additional engine. This looks like:
For indices that require training, a user needs to have already trained/uploaded the model to the model index. Once they have done this, they can create an ANN OpenSearch index with the following mapping:
Feedback
We are interested in any and all feedback you may have. Please do not hesitate to comment!
Specifically, however, we are interested in:
The text was updated successfully, but these errors were encountered: