You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
4 409 question-answer pairs: 2 425 Web Questions and 1 984 KG Questions
For each Web Question, 50 HTML pages stored from the Brave search API (220 000 web pages in total)
five domains: Finance, Sports, Music, Movie, and Open domain
seven types of questions: questions with Conditions, Comparison questions, Aggregation questions, Multi-hop questions, Set queries, Post-processing-heavy questions, and False-premise questions
English only
3 tasks
Task 1: Retrieval Summarization. 5 web pages for each question, which are likely, but not guaranteed, to be relevant to the question.
Task 2: KG and Web Retrieval Augmentation via mock APIs
Task 3 provides both web search results (50 web pages instead of 5) and mock APIs
Air-Bench
AIR-BENCH: Automated Heterogeneous Information Retrieval Benchmark
Pretty interesting, but, starting from an initial corpus, they generate a synthetic set of positive and negative chunks, then a question whose answer is contained in the positive set of chunks. This means that with this benchmark we can test the retrieval, but not the chunking, since we don't have a full document corresponding to each question, just a set of (positive and negative) chunks.
output = long answer (similar to chunk), short answer
Limitations:
output (long answer) can be null if there is no answer on the page, or if the information required to answer the question is spread across many paragraphs. The short answer(s) can be a span or set of spans (typically entities) within the long answer that answer the question.
only english
based on Wikipedia
Observations:
can adapt this to make it more relevant for RAG:
select a sub-sample of N Wikipedia pages
how many different pages are present in the dataset?
select all questions related to those pages
chunk and embed all pages
run RAG on corpus
when N=1, this is equivalent to the original NQ benchmark, where each question is paired with its corresponding document
MMDocIR
Benchmarking Multi-Modal Retrieval for Long Documents
Related projects:
MIRACL
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages
UDA
UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis
REPLIQA
REPLIQA: A QUESTION-ANSWERING DATASET FOR BENCHMARKING LLMS ON UNSEEN REFERENCE CONTENT
CRAG
CRAG – Comprehensive RAG Benchmark
Air-Bench
AIR-BENCH: Automated Heterogeneous Information Retrieval Benchmark
GoogleNQ
MMDocIR
Benchmarking Multi-Modal Retrieval for Long Documents
MTRAG
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
CYPHERBENCH
CYPHERBENCH: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
CQUAE
CQuAE : Un nouveau corpus de question-réponse pour l’enseignement
DEXTER
DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs
The text was updated successfully, but these errors were encountered: