Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval / generation datasets #3543

Open
chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Open

Retrieval / generation datasets #3543

chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

chloedia commented Jan 2, 2025

Related projects:

MIRACL

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Screenshot 2025-01-23 at 12.02.24.png

UDA

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

REPLIQA

REPLIQA: A QUESTION-ANSWERING DATASET FOR BENCHMARKING LLMS ON UNSEEN REFERENCE CONTENT

Screenshot 2025-01-21 at 10.14.53.png

CRAG

CRAG – Comprehensive RAG Benchmark

  • https://arxiv.org/pdf/2406.04744
  • https://github.com/facebookresearch/CRAG
  • 4 409 question-answer pairs: 2 425 Web Questions and 1 984 KG Questions
  • For each Web Question, 50 HTML pages stored from the Brave search API (220 000 web pages in total)
  • five domains: Finance, Sports, Music, Movie, and Open domain
  • seven types of questions: questions with Conditions, Comparison questions, Aggregation questions, Multi-hop questions, Set queries, Post-processing-heavy questions, and False-premise questions

Screenshot 2025-01-22 at 17.37.56.png

  • English only
  • 3 tasks
    • Task 1: Retrieval Summarization. 5 web pages for each question, which are likely, but not guaranteed, to be relevant to the question.
    • Task 2: KG and Web Retrieval Augmentation via mock APIs
    • Task 3 provides both web search results (50 web pages instead of 5) and mock APIs

Air-Bench

AIR-BENCH: Automated Heterogeneous Information Retrieval Benchmark

  • https://arxiv.org/pdf/2412.13102
  • https://github.com/AIR-Bench/AIR-Bench
  • https://huggingface.co/AIR-Bench
  • Different languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali
  • Different topics: wiki, web, news, healthcare, law, finance, arxiv, book, science
  • Pretty interesting, but, starting from an initial corpus, they generate a synthetic set of positive and negative chunks, then a question whose answer is contained in the positive set of chunks. This means that with this benchmark we can test the retrieval, but not the chunking, since we don't have a full document corresponding to each question, just a set of (positive and negative) chunks.

GoogleNQ

  • https://ai.google.com/research/NaturalQuestions
  • https://research.google/pubs/natural-questions-a-benchmark-for-question-answering-research/
  • input = question, Wikipedia page
  • output = long answer (similar to chunk), short answer
  • Limitations:
    • output (long answer) can be null if there is no answer on the page, or if the information required to answer the question is spread across many paragraphs. The short answer(s) can be a span or set of spans (typically entities) within the long answer that answer the question.
    • only english
    • based on Wikipedia
  • Observations:
    • can adapt this to make it more relevant for RAG:
      • select a sub-sample of N Wikipedia pages
        • how many different pages are present in the dataset?
      • select all questions related to those pages
      • chunk and embed all pages
      • run RAG on corpus
      • when N=1, this is equivalent to the original NQ benchmark, where each question is paired with its corresponding document

MMDocIR

Benchmarking Multi-Modal Retrieval for Long Documents

MTRAG

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Screenshot 2025-01-21 at 09.23.10.png

CYPHERBENCH

CYPHERBENCH: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

CQUAE

CQuAE : Un nouveau corpus de question-réponse pour l’enseignement

DEXTER

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

Screenshot 2025-01-21 at 09.27.33.png

Copy link

linear bot commented Jan 2, 2025

@jacopo-chevallard jacopo-chevallard changed the title First Datasets finding RAG evaluation / benchmark datasets Jan 21, 2025
@jacopo-chevallard jacopo-chevallard changed the title RAG evaluation / benchmark datasets Retrieval datasets Jan 23, 2025
@jacopo-chevallard jacopo-chevallard changed the title Retrieval datasets Retrieval / generation datasets Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants