Retrieval / generation datasets #3543

chloedia · 2025-01-02T15:38:57Z

Related projects:

MIRACL

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

UDA

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

https://arxiv.org/pdf/2406.15187
https://huggingface.co/datasets/qinchuanhui/UDA-QA
https://github.com/qinchuanhui/UDA-Benchmark
2 965 documents
2 9590 expert-annotated Q&A pairs
English only

REPLIQA

REPLIQA: A QUESTION-ANSWERING DATASET FOR BENCHMARKING LLMS ON UNSEEN REFERENCE CONTENT

https://arxiv.org/pdf/2406.11811
https://huggingface.co/datasets/ServiceNow/repliqa
89 770 question-answer pairs based on 17 954 reference documents (5 questions per document)
each document is quite short (3 pages)
English only

CRAG

CRAG – Comprehensive RAG Benchmark

https://arxiv.org/pdf/2406.04744
https://github.com/facebookresearch/CRAG
4 409 question-answer pairs: 2 425 Web Questions and 1 984 KG Questions
For each Web Question, 50 HTML pages stored from the Brave search API (220 000 web pages in total)
five domains: Finance, Sports, Music, Movie, and Open domain
seven types of questions: questions with Conditions, Comparison questions, Aggregation questions, Multi-hop questions, Set queries, Post-processing-heavy questions, and False-premise questions

English only
3 tasks
- Task 1: Retrieval Summarization. 5 web pages for each question, which are likely, but not guaranteed, to be relevant to the question.
- Task 2: KG and Web Retrieval Augmentation via mock APIs
- Task 3 provides both web search results (50 web pages instead of 5) and mock APIs

Air-Bench

AIR-BENCH: Automated Heterogeneous Information Retrieval Benchmark

https://arxiv.org/pdf/2412.13102
https://github.com/AIR-Bench/AIR-Bench
https://huggingface.co/AIR-Bench
Different languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali
Different topics: wiki, web, news, healthcare, law, finance, arxiv, book, science
Pretty interesting, but, starting from an initial corpus, they generate a synthetic set of positive and negative chunks, then a question whose answer is contained in the positive set of chunks. This means that with this benchmark we can test the retrieval, but not the chunking, since we don't have a full document corresponding to each question, just a set of (positive and negative) chunks.

GoogleNQ

https://ai.google.com/research/NaturalQuestions
- https://huggingface.co/datasets/google-research-datasets/natural_questions
- 307K training examples, 8K examples for development, and a further 8K examples for testing
https://research.google/pubs/natural-questions-a-benchmark-for-question-answering-research/
input = question, Wikipedia page
output = long answer (similar to chunk), short answer
Limitations:
- output (long answer) can be null if there is no answer on the page, or if the information required to answer the question is spread across many paragraphs. The short answer(s) can be a span or set of spans (typically entities) within the long answer that answer the question.
- only english
- based on Wikipedia
Observations:
- can adapt this to make it more relevant for RAG:
  - select a sub-sample of N Wikipedia pages
    - how many different pages are present in the dataset?
  - select all questions related to those pages
  - chunk and embed all pages
  - run RAG on corpus
  - when N=1, this is equivalent to the original NQ benchmark, where each question is paired with its corresponding document

MMDocIR

Benchmarking Multi-Modal Retrieval for Long Documents

https://arxiv.org/pdf/2501.08828
https://huggingface.co/MMDocIR
313 long documents averaging 65.1 pages
evaluation set encompasses 1 658 questions, 2 107 page labels, and 2 638 layout labels

MTRAG

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

https://arxiv.org/pdf/2501.03468
https://github.com/ibm/mt-rag-benchmark
110 multi-turn conversations that are converted to 842 evaluation tasks

CYPHERBENCH

CYPHERBENCH: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

CQUAE

CQuAE : Un nouveau corpus de question-réponse pour l’enseignement

DEXTER

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

linear · 2025-01-02T15:38:58Z

CORE-325 First Datasets finding

StanGirard assigned chloedia Jan 20, 2025

jacopo-chevallard changed the title ~~First Datasets finding~~ RAG evaluation / benchmark datasets Jan 21, 2025

jacopo-chevallard assigned jacopo-chevallard and unassigned chloedia Jan 21, 2025

jacopo-chevallard changed the title ~~RAG evaluation / benchmark datasets~~ Retrieval datasets Jan 23, 2025

jacopo-chevallard changed the title ~~Retrieval datasets~~ Retrieval / generation datasets Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval / generation datasets #3543

Retrieval / generation datasets #3543

chloedia commented Jan 2, 2025 •

edited by jacopo-chevallard

Loading

linear bot commented Jan 2, 2025

Retrieval / generation datasets #3543

Retrieval / generation datasets #3543

Comments

chloedia commented Jan 2, 2025 • edited by jacopo-chevallard Loading

MIRACL

UDA

REPLIQA

CRAG

Air-Bench

GoogleNQ

MMDocIR

MTRAG

CYPHERBENCH

CQUAE

DEXTER

linear bot commented Jan 2, 2025

chloedia commented Jan 2, 2025 •

edited by jacopo-chevallard

Loading