Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Pagination in Hybrid query #933

Open
vibrantvarun opened this issue Oct 15, 2024 · 8 comments
Open

[RFC] Pagination in Hybrid query #933

vibrantvarun opened this issue Oct 15, 2024 · 8 comments

Comments

@vibrantvarun
Copy link
Member

vibrantvarun commented Oct 15, 2024

Introduction

This document talks about the design to support pagination functionality in hybrid query. OpenSearch supports multiple pagination techniques, and this document primarily focuses on two key techniques: the from and size parameters technique and the scroll operation technique.

Problem Statement

Pagination is an ability to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time. Currently, hybrid query users cannot navigate in the result set as per their requirements.

Terminologies and key concepts

Here we define the key terms and concepts that are used throughout this document to ensure clarity and consistency in understanding the design approach later.

The from and size parameters

The from parameter is the document number from which you want to start showing the results. The size parameter is the number of results that you want to show. Together, they let you return a subset of the search results.

Scroll operation

It allows users to paginate over large set of data in a single search request. Think of it like a live cursor on a database.
If you need to request volumes of data larger than 1 PB from, for example, a machine learning job, use the scroll operation instead. The scroll operation allows you to request an unlimited number of results.

The important thing to note about the scroll operation is that scoring of the result set is not performed. It deals with large sets of data, and scoring such a large amount of data will lead to system outage.

Understanding search workflow in context of from and size parameters

The search process in OpenSearch for traditional queries executes primarily in two phases: Query Phase and Fetch Phase. The Query Phase is responsible for retrieving the query results from the shard as objects containing docId and score. The maximum number of query results collected during the Query Phase is determined by the sum of from and size parameters provided in the search request. The Fetch Phase is responsible for firstly trimming the topN query results obtained from the Query Phase where topN is equal to the value of from parameter. The source of the remaining query results are fetched from the shards. By changing the values of from and size params, users can navigate to another page of results.

Current Architecture

Hybrid query workflow

In the current hybrid query workflow, each sub-query executes the search process and returns the from + size equivalent number of results to the co-ordinator node. In the next step, normalization processor combines the sub-query result on each shard by using normalization and score-combination techniques. Later, the Fetch phase will combine all results from the shards and will return the search response. The number of hits present in the search response will be equal to the size parameter sent in the search request. In the entire hybrid query search process, the default value of from is 0.
However, when customized from and size values are sent to retrieve another page of search results, it fails to do so. This failure occurs due to change in ground truth of hybrid query results.
Copy of HLD_PAGINATION (1) (1)

Why ground truth changes when from > 0

Increasing from value will lead to increase in number of results retrieved for each subquery from a shard. Instead of the new results being added to the last page, they tend to appear earlier in the list, which changes the ground truth. This can happen in two scenarios:

  1. The subquery with higher weightage will rank its newly added results above others after the score combination. This means the new entries will appear earlier in the final result set, either on the first few pages or in the middle, instead of being added at the end.
  2. The newly added result can become a duplicate (check appendix section below). The combined score of a duplicate entry after score combination process will be higher thereby increasing the relevance. Here we assumed that the subqueries containing the duplicate result have equal weightage during the score combination process.

Requirements

Functional Requirements

  • The system must divide large query result sets into manageable pages to ensure efficient data retrieval.
  • Each page of results should consistently reflect the same underlying data, even when new data is added or updated, to avoid discrepancies when users navigate between pages.
  • Users should be able to navigate through pages by changing from parameter in the search request.
  • If the hybrid query returns duplicates from different queries, pagination should manage this to avoid showing the same result multiple times across different pages.
  • The system should allow dynamic page size configuration, enabling users to choose how many results are shown on each page.

Non Functional Requirements

  • The system should ensure quick response times for retrieving and rendering each page of results, even when dealing with large datasets from multiple subqueries.

Solution Overview

The proposal is to introduce a new parameter in the hybrid query clause, which will provide information on number of individual sub-query results to be retrieved from the shard. This new parameter will ensure that specified number of subquery results are retrieved, even when from and size values are changed. This process helps in maintaining the consistency by not adding new result entries when from > 0 is sent in the search request. Once the subquery results are retrieved, the rest of the hybrid query process remains unchanged. The from value will navigate to the requested page in the final result, and the size parameter will determine how many search results should be returned in the response.

Why this new parameter can only be added in the hybrid query clause?

Following places have been considered to add this new parameter

Sending it in the URL of the search request

It requires enhancement in SearchSourceBuilder of OpenSearch core to inculcate the changes related to the new parameter. Moreover, there is no use of this new parameter for other traditional query types as they are already supporting pagination functionality. This scope of this change is limited to hybrid query logic, therefore this option is not recommended.

Cluster settings or Index settings

For every hybrid query request, an additional call will be made to the OpenSearch core to get index or cluster settings. This will create overload and resource exhaustion in the cluster. Moreover, customers need to change the setting every time when they want to paginate on higher number of results. Additionally, settings change is captured by event listeners that triggers workflows for other important functionalities in OpenSearch. Therefore, this option is not recommended.

Sending it in the hybrid query clause [RECOMMENDED]

Adding in the hybrid query clause is recommended because the changes done by the new parameter are specific to hybrid query and do not impact other functionalities of OpenSearch. It helps customers to understand the results formation better by explicitly defining how many subquery results they want to process under hybrid query logic.

Changes in customer experience

The from and size parameters technique

From the customer’s perspective, the new parameter will act as an indicator on how deeply they want to apply hybridization of subquery results and support pagination on top of it.

The following example visualizes the addition of new parameter in the hybrid query clause. The query below retrieves up to 100 results of each subquery and applies normalization process on them. Then it returns the search response that contains 10 hits. The first hit in the response is the 5th entry after combining all the subquery results.

http://localhost:9200/my-nlp-index/_search?size=10&search_pipeline=nlp-search-pipeline

{
 "from":5,
 "query":{
   "hybrid":{
     (Name of the parameter will be finalized in the design doc later)
     "pagination_depth": 100, (new parameter where customer define
                               the depth of hybridization of subquery results they 
                               want OpenSearch take the depth as a reference 
                               and apply pagination on top of it)
     "queries":[
        "match":{
          "name":"foo"
        }
     ] 
   }
 }
}

The new parameter will be referenced as “pagination_depth” in the rest of document for explanatory purposes. There is open question below for the appropriate name suggestion.

Note: The default value of the pagination_depth will be 0. If the customer doesn’t want to apply pagination and just wants to perform the standard hybrid search, then they do not have to provide the depth information. In that case, the hybrid query will use the current formula of from + size for retrieving the shard results.

The scroll operation

Pagination using scroll api cannot be performed in hybrid query process. The reason being hybrid query needs scoring of documents and scroll api does not support it. The main use case of scroll operation is when users need to reindex the data to a new index. This use case can be easily achieved by any other traditional query. Therefore, an illegal argument exception will be thrown when user tries to apply scroll operation with hybrid query.

High level design

(red arrows == current behavior, green arrows == new flow)
HLD_PAGINATION (1)

Low level design

The interfaces in the following classes will be updated to inculcate the addition of new parameter.
LLD _PAGINATION (1)
As there are no component changes in the current hybrid query workflow, therefore flow diagram is not added.

Testability

Backward Compatibility

The changes made to inculcate the new parameter addition are backward compatible. If pagination_depth is not provided in the search request then the current formula of from+size will be used for the subquery results retrieval.

Unit and Integration Tests

Both the test will be added in the PR.

Open questions

Q) Possible names of the new parameter to be added?

Add the suggested names in the following list.

  1. pagination_depth
  2. subquery_results_retrieval_size
  3. pagination_size

Appendix

What is Pagination?

It is a process to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time.
There are four different type of pagination techniques supported by OpenSearch:

  1. The from and size parameters : The from parameter is the document number from which you want to start showing the results. The size parameter is the number of results that you want to show. Together, they let you return a subset of the search results.
  2. scroll search : It allows users to paginate over large set of data in a single search request. Think of it like a live cursor on a database.
  3. search after : The search_after parameter provides a live cursor that uses the previous page’s results to obtain the next page’s results. It is similar to the scroll operation in that it is meant to scroll many queries in parallel. You can use search_after only when sorting is applied.
  4. Point in Time with search_after: The search_after search results are not frozen in time, so they may be inconsistent because of concurrent document indexing or deletion. By creating a point in time reference, users can freeze the search results to guarantee the consistency.

To know more about pagination and its techniques, please read the OpenSearch documentation on it.

Duplicates

If a result entry is part of another subquery result set, it is considered a duplicate result.
For example, consider a scenario where a user sends a hybrid query request. The hybrid query contains two subqueries, Subquery A and Subquery B. If the result entry of Subquery A is also part of Subquery B’s results then it is considered a duplicate result.

Unique results

If a result entry is part of only one subquery, it is considered as unique result.

@navneet1v
Copy link
Collaborator

Quite detailed RFC. Waiting for performance benchmarks with new attribute for pagination.

@brandon-carag
Copy link

These are exciting developments--also looking forward to seeing the benchmarks.

@minalsha minalsha removed Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0 labels Nov 4, 2024
@brianf-aws
Copy link
Contributor

This is an amazing PR description!

@martin-gaievski
Copy link
Member

please add more detailed description of pagination_depth parameter, for instance need clearly state that it defines the size of the pagination window in number of documents or hits, not in number of pages.
Another question - can we go above the 10_000 limit with pagination_depth value?

@martin-gaievski
Copy link
Member

Is there a documentation request/PR for this feature?

@martin-gaievski
Copy link
Member

haven't found one, feel free to close it if it's a duplicate: opensearch-project/documentation-website#8952

@safakkbilici
Copy link

safakkbilici commented Jan 6, 2025

We would like to understand more about pagination_depth. How does it impact performance? Does setting pagination_depth to a very high value lead to latency issues similar to fetching all hits? or just only introduce latency from score normalization proportional to its size?

@vibrantvarun
Copy link
Member Author

vibrantvarun commented Jan 6, 2025

@safakkbilici logically the number of hits fetched from each shard will be equal to pagination_depth. Consider if 3 shards are there and each shard has equal distribution of data and pagination_depth is 10. Then total 3 x 10=30 hits will be catered at coordinator node. Now if you increase the pagination_depth to insanely higher number then it will increase the latency. To justify this, for deeper pagination performance will be hit for sure. But for top pages the performance will be almost same as of normal hybrid query. I will be doing a benchmarking soon to get a rough idea till what extent of pagination the performance stays the same. I will be writing a blog on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Status: 2.19.0
Development

No branches or pull requests

8 participants