Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

tchaton · 2025-01-17T13:00:22Z

Describe the bug
A clear and concise description of what the bug is.

Steps to reproduce (if applicable)
Steps to reproduce the behavior:

Ingest a dataset
Find a query with more than 20k responses

import requests
from time import time

QUERY = "paris"

t0 = time()

responses = []
MAX_HITS = 10000

session = requests.Session()
response = session.post(f"http://localhost:7280/api/v1/fineweb/search", json={"query": QUERY, "max_hits": MAX_HITS})
data = response.json()
NUM_HITS = data['num_hits']
responses.extend(data["hits"])
print(len(responses))

while len(responses) != NUM_HITS:
    response = session.post(f"http://localhost:7280/api/v1/fineweb/search", json={"query": QUERY, "max_hits": MAX_HITS, "start_offset": len(responses)})
    data = response.json()
    if "hits" not in data:
        raise Exception(data)
    responses.extend(data["hits"])
    print(len(responses))

print(len(responses))

print(time() - t0)

I am trying to search through fineweb and I want to collect all the matches. However, it doesn't seem to be possible as start_offset is capped to 10k.

Expected behavior
A clear and concise description of what you expected to happen.

I want an easy way to collect all the matches. Even better, I just want their ids.

Configuration:
Please provide:

Output of quickwit --version
The index_config.yaml

The text was updated successfully, but these errors were encountered:

trinity-1686a · 2025-01-17T13:41:51Z

Doing increasingly deep pagination based only on a start_offset isn't very efficient (to fetch 10k docs with a start_offset of 100k, you'd need to find the best 110k results, and drop the first 100k). For that reason, we don't support deep pagination that way.
Currently there isn't an alternative on the Quickwit API. If you don't mind using the ES-compatible API instead, we support both search_after and scroll, which don't suffer from that performance degradation (at least on Quickwit, scrolls are deprecated on ES).

tchaton · 2025-01-22T07:55:45Z

Oh interesting, it wasn't documented or at least, I didn't find it in the docs, not the swagger UI.

tchaton added the bug Something isn't working label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

tchaton commented Jan 17, 2025

trinity-1686a commented Jan 17, 2025

tchaton commented Jan 22, 2025 •

edited

Loading

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

Comments

tchaton commented Jan 17, 2025

trinity-1686a commented Jan 17, 2025

tchaton commented Jan 22, 2025 • edited Loading

tchaton commented Jan 22, 2025 •

edited

Loading