Pyserini: BGE-base Baseline for NFCorpus

This guide contains instructions for running a BGE-base baseline for NFCorpus.

If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've first done the previous step, a conceptual framework for retrieval. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

If you've traversed the onboarding path, by now you've learned the basics of bag-of-words retrieval with BM25 using Lucene (via Anserini and Pyserini). Conceptually, you understand how it's a specific manifestation of a bi-encoder architecture where the vector representations are lexical and the weights are assigned in an unsupervised (or heuristic) manner.

In this guide, we're going to go through an example of retrieval using a learned, dense representation. These are often called "dense retrieval models" and informally referred to as "vector search". Coming back to here:

The document and query encoders are now transformer-based models that are trained on large amounts of supervised data. The outputs of the encoders are often called embedding vectors, or just embeddings for short.

For this guide, assume that we've already got trained encoders. How to actually train such models will be covered later.

Learning outcomes for this guide, building on previous steps in the onboarding path:

Be able to use Pyserini to encode documents in NFCorpus with an existing dense retrieval model (BGE-base) and to build a Faiss index on the vector representations..
Be able to use Pyserini to perform a batch retrieval run on queries from NFCorpus.
Be able to evaluate the retrieved results above.
Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.

Data Prep

In this lesson, we'll be working with NFCorpus, a full-text learning to rank dataset for medical information retrieval. The rationale is that the corpus is quite small — only 3633 documents — so the latency of CPU-based inference with neural models (i.e., the encoders) is tolerable, i.e., this lesson is doable on a laptop. It is not practical to work with the MS MARCO passage ranking corpus using CPUs.

Let's first start by fetching the data:

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
unzip collections/nfcorpus.zip -d collections

This just gives you an idea of what the corpus contains:

$ head -1 collections/nfcorpus/corpus.jsonl
{"_id": "MED-10", "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995\u20132003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08\u20139.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38\u20130.55 and HR 0.54, 95% CI 0.44\u20130.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins\u2019 effect on survival in breast cancer patients.", "metadata": {"url": "http://www.ncbi.nlm.nih.gov/pubmed/25329299"}}

We need to do a bit of data munging to get the queries into the right format (from json to tsv). Run the following Python script:

import json

with open('collections/nfcorpus/queries.tsv', 'w') as out:
    with open('collections/nfcorpus/queries.jsonl', 'r') as f:
        for line in f:
            l = json.loads(line)
            out.write(l['_id'] + '\t' + l['text'] + '\n')

Similarly, we need to munge the relevance judgments (qrels) into the right format. This command-line invocation does the trick:

tail -n +2 collections/nfcorpus/qrels/test.tsv | sed 's/\t/\tQ0\t/' > collections/nfcorpus/qrels/test.qrels

Okay, the data are ready now.

Indexing

We can now "index" these documents using Pyserini:

python -m pyserini.encode \
  input   --corpus collections/nfcorpus/corpus.jsonl \
          --fields title text \
  output  --embeddings indexes/nfcorpus.bge-base-en-v1.5 \
          --to-faiss \
  encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
          --device cpu \
          --pooling mean \
          --fields title text \
          --batch 32

We're using the BAAI/bge-base-en-v1.5 encoder, which can be found on HuggingFace. Use --device cuda for a faster computation if you have a CUDA-enabled GPU.

Try it using the Contriever model!

python -m pyserini.encode \
  input   --corpus collections/nfcorpus/corpus.jsonl \
          --fields title text \
  output  --embeddings indexes/faiss.nfcorpus.contriever-msmacro \
          --to-faiss \
  encoder --encoder facebook/contriever-msmarco \
          --device cpu \
          --pooling mean \
          --fields title text \
          --batch 32

We're using the facebook/contriever-msmarco encoder, which can be found on HuggingFace. Use --device cuda for a faster computation if you have a CUDA-enabled GPU.

Pyserini wraps Faiss, which is a library for efficient similarity search on dense vectors. That is, once all the documents have been encoded (i.e., converted into representation vectors), they are passed to Faiss to manage (i.e., for storage and for search later on). "Index" here is in quotes because, in reality we're using something called a "flat" index (FlatIP to be exact), which just stores the vectors in fixed-width bytes, one after the other. At search time, each document vector is sequentially compared to the query vector. In other words, the library just performs brute force dot products of each query vector against all document vectors.

The above indexing command takes around 30 minutes to run on a modern laptop, with most of the time occupied by performing neural inference using the CPU. Adjust the batch parameter above accordingly for your hardware; 32 is the default, but reduce the value if you find that the encoding is taking too long.

Retrieval

We can now perform retrieval in Pyserini using the following command:

python -m pyserini.search.faiss \
  --encoder-class auto --encoder BAAI/bge-base-en-v1.5 --l2-norm \
  --pooling mean \
  --index indexes/nfcorpus.bge-base-en-v1.5 \
  --topics collections/nfcorpus/queries.tsv \
  --output runs/run.beir.bge-base-en-v1.5.nfcorpus.txt \
  --batch 128 --threads 8 \
  --hits 1000

The queries are in collections/nfcorpus/queries.tsv.

If you indexed with Contriever above, try retrieval with it too:

python -m pyserini.search.faiss \
  --encoder-class contriever --encoder facebook/contriever-msmarco \
  --index indexes/faiss.nfcorpus.contriever-msmacro \
  --topics collections/nfcorpus/queries.tsv \
  --output runs/run.beir-contriever-msmarco.nfcorpus.txt \
  --batch 128 --threads 8 \
  --hits 1000

As mentioned above, Pyserini wraps the Faiss library. With the flat index here, we're performing brute-force computation of dot products (albeit in parallel and with batching). As a result, we are performing exact search, i.e., we are finding the exact top-k documents that have the highest dot products.

The above retrieval command takes only a few minutes on a modern laptop. Adjust the threads and batch parameters above accordingly for your hardware.

Evaluation

After the run finishes, we can evaluate the results using trec_eval:

python -m pyserini.eval.trec_eval \
  -c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
  runs/run.beir.bge-base-en-v1.5.nfcorpus.txt

The results will be something like:

Results:
ndcg_cut_10           	all	0.3808

And if you've been following along with Contriever:

python -m pyserini.eval.trec_eval \
  -c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
  runs/run.beir-contriever-msmarco.nfcorpus.txt

The results will be something like:

Results:
ndcg_cut_10           	all	0.3306

If you've gotten here, congratulations! You've completed your first indexing and retrieval run using a dense retrieval model.

Interactive Retrieval

The final step, as with Lucene, is to learn to use the dense retriever interactively. This contrasts with the batch run above.

Here's the snippet of Python code that does what we want:

from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder

encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True)
searcher = FaissSearcher('indexes/nfcorpus.bge-base-en-v1.5', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The FaissSearcher provides search capabilities using Faiss as its underlying implementation. The AutoQueryEncoder allows us to initialize an encoder using a HuggingFace model.

 1 MED-4555 0.791379
 2 MED-4560 0.710725
 3 MED-4421 0.688938
 4 MED-4993 0.686238
 5 MED-4424 0.686214
 6 MED-1663 0.682199
 7 MED-3436 0.680585
 8 MED-2750 0.677033
 9 MED-4324 0.675772
10 MED-2939 0.674646

You'll see that the ranked list is the same as the batch run you performed above:

$ grep PLAIN-3074 runs/run.beir.bge-base-en-v1.5.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 0.791378 Faiss
PLAIN-3074 Q0 MED-4560 2 0.710725 Faiss
PLAIN-3074 Q0 MED-4421 3 0.688938 Faiss
PLAIN-3074 Q0 MED-4993 4 0.686238 Faiss
PLAIN-3074 Q0 MED-4424 5 0.686214 Faiss
PLAIN-3074 Q0 MED-1663 6 0.682199 Faiss
PLAIN-3074 Q0 MED-3436 7 0.680585 Faiss
PLAIN-3074 Q0 MED-2750 8 0.677033 Faiss
PLAIN-3074 Q0 MED-4324 9 0.675772 Faiss
PLAIN-3074 Q0 MED-2939 10 0.674647 Faiss

Again with Contriever!

Here's the snippet of Python code that does what we want:

from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder

encoder = AutoQueryEncoder('facebook/contriever-msmarco', device='cpu', pooling='mean')
searcher = FaissSearcher('indexes/faiss.nfcorpus.contriever-msmacro', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The FaissSearcher provides search capabilities using Faiss as its underlying implementation. The AutoQueryEncoder allows us to initialize an encoder using a HuggingFace model.

 1 MED-4555 1.472201
 2 MED-3180 1.125014
 3 MED-1309 1.067153
 4 MED-2224 1.059536
 5 MED-4423 1.038440
 6 MED-4887 1.032622
 7 MED-2530 1.020758
 8 MED-2372 1.016142
 9 MED-1006 1.013599
10 MED-2587 1.010811

You'll see that the ranked list is the same as the batch run you performed above:

$ grep PLAIN-3074 runs/run.beir-contriever-msmarco.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 1.472201 Faiss
PLAIN-3074 Q0 MED-3180 2 1.125014 Faiss
PLAIN-3074 Q0 MED-1309 3 1.067153 Faiss
PLAIN-3074 Q0 MED-2224 4 1.059537 Faiss
PLAIN-3074 Q0 MED-4423 5 1.038440 Faiss
PLAIN-3074 Q0 MED-4887 6 1.032622 Faiss
PLAIN-3074 Q0 MED-2530 7 1.020758 Faiss
PLAIN-3074 Q0 MED-2372 8 1.016142 Faiss
PLAIN-3074 Q0 MED-1006 9 1.013599 Faiss
PLAIN-3074 Q0 MED-2587 10 1.010811 Faiss

And that's it!

The next lesson will provide a deeper dive into dense and sparse representations. Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.

Reproduction Log*

Results reproduced by @sahel-sh on 2023-08-04 (commit 19da81c)
Results reproduced by @Mofetoluwa on 2023-08-05 (commit 6a2088b)
Results reproduced by @Andrwyl on 2023-08-26 (commit d9da49e)
Results reproduced by @yilinjz on 2023-08-30 (commit 42b3549)
Results reproduced by @UShivani3 on 2023-09-01 (commit 42b3549)
Results reproduced by @Edward-J-Xu on 2023-09-05 (commit 8063322)
Results reproduced by @mchlp on 2023-09-07 (commit d8dc5b3)
Results reproduced by @lucedes27 on 2023-09-10 (commit 54014af)
Results reproduced by @MojTabaa4 on 2023-09-14 (commit d4a829d)
Results reproduced by @Kshama on 2023-09-24 (commit 7d18f4b)
Results reproduced by @MelvinMo on 2023-09-24 (commit 7d18f4b)
Results reproduced by @ksunisth on 2023-09-27 (commit 142c774)
Results reproduced by @maizerrr on 2023-10-01 (commit bdb9504)
Results reproduced by @Stefan824 on 2023-10-04 (commit 4f3da10)
Results reproduced by @shayanbali on 2023-10-13 (commit f1d623c)
Results reproduced by @gituserbs on 2023-10-19 (commit f1d623c)
Results reproduced by @shakibaam on 2023-11-04 (commit 01889cc)
Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit 01889cc)
Results reproduced by @Melissa1412 on 2023-11-05 (commit acd969f)
Results reproduced by @oscarbelda86 on 2023-11-13 (commit 086e16b)
Results reproduced by @salinaria on 2023-11-14 (commit 086e16b)
Results reproduced by @aliranjbari on 2023-11-15 (commit b02ac99)
Results reproduced by @Seun-Ajayi on 2023-11-16 (commit 5d63bc5)
Results reproduced by @AndreSlavescu on 2023-11-28 (commit 1219cdb)
Results reproduced by @tudou0002 on 2023-11-28 (commit 723e06c)
Results reproduced by @alimt1992 on 2023-11-29 (commit e6700f6)
Results reproduced by @golnooshasefi on 2023-11-29 (commit 1219cdb)
Results reproduced by @sueszli on 2023-12-01 (commit 170e271)
Results reproduced by @kdricci on 2023-12-01 (commit a2049c4)
Results reproduced by @ljk423 on 2023-12-04 (commit 35002ad)
Results reproduced by @saharsamr on 2023-12-14 (commit 039c137)
Results reproduced by @Panizghi on 2023-12-17 (commit 0f5db95)
Results reproduced by @AreelKhan on 2023-12-22 (commit f75adca)
Results reproduced by @wu-ming233 on 2023-12-31 (commit 38a571f)
Results reproduced by @Yuan-Hou on 2024-01-02 (commit 38a571f)
Results reproduced by @himasheth on 2024-01-10 (commit a6ed27e)
Results reproduced by @Tanngent on 2024-01-13 (commit 57a00cf)
Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit d4ea011)
Results reproduced by @ia03 on 2024-01-18 (commit 05ee8ef)
Results reproduced by @AlexStan0 on 2024-01-20 (commit 833ee19)
Results reproduced by @charlie-liuu on 2024-01-23 (commit 87a120e)
Results reproduced by @dannychn11 on 2024-01-28 (commit 2f7702f)
Results reproduced by @ru5h16h on 2024-02-20 (commit 758eaaa)
Results reproduced by @ASChampOmega on 2024-02-23 (commit 442e7e1)
Results reproduced by @16BitNarwhal on 2024-02-26 (commit 19fcd3b)
Results reproduced by @HaeriAmin on 2024-02-27 (commit 19fcd3b)
Results reproduced by @17Melissa on 2024-03-03 (commit a9f295f)
Results reproduced by @devesh-002 on 2024-03-05 (commit 84c6742)
Results reproduced by @chloeqxq on 2024-03-07 (commit 19fcd3b)
Results reproduced by @xpbowler on 2024-03-11 (commit 19fcd3b)
Results reproduced by @jodyz0203 on 2024-03-12 (commit 280e009)
Results reproduced by @kxwtan on 2024-03-12 (commit 2bb342a)
Results reproduced by @syedhuq28 on 2024-03-28 (commit 2bb342a)
Results reproduced by @khufia on 2024-03-29 (commit 2bb342a)
Results reproduced by @Lindaaa8 on 2024-03-29 (commit 7dda9f3)
Results reproduced by @th13nd4n0 on 2024-04-05 (commit df3bc6c)
Results reproduced by @a68lin on 2024-04-12 (commit 7dda9f3)
Results reproduced by @DanielKohn1208 on 2024-04-22 (commit 184a212)
Results reproduced by @emadahmed19 on 2024-04-28 (commit 9db2584)
Results reproduced by @CheranMahalingam on 2024-05-05 (commit f817186)
Results reproduced by @billycz8 on 2024-05-08 (commit c945c50)
Results reproduced by @KenWuqianhao on 2024-05-11 (commit c945c50)
Results reproduced by @hrouzegar on 2024-05-13 (commit bf68fc5)
Results reproduced by @Yuv-sue1005 on 2024-05-15 (commit '9df4015')
Results reproduced by @RohanNankani on 2024-05-17 (commit a91ef1d)
Results reproduced by @IR3KT4FUNZ on 2024-05-25 (commit a6f4d6)
Results reproduced by ＠bilet-13 on 2024-06-01 (commit b0c53f3)
Results reproduced by ＠SeanSong25 on 2024-06-05 (commit b7e1da3)
Results reproduced by ＠alireza-taban on 2024-06-11 (commit d814290)
Results reproduced by ＠hosnahoseini on 2024-06-18 (commit 49d8c43)
Results reproduced by @FaizanFaisal25 on 2024-07-07 (commit 3b9d541)
Results reproduced by ＠Feng-12138 on 2024-07-11(commit 3b9d541)
Results reproduced by @XKTZ on 2024-07-13 (commit 544046e)
Results reproduced by @MehrnazSadeghieh on 2024-07-19 (commit 26a2538)
Results reproduced by @alireza-nasirian on 2024-07-19 (commit 544046e)
Results reproduced by @MariaPonomarenko38 on 2024-07-19 (commit d4509dc)
Results reproduced by @valamuri2020 on 2024-08-02 (commit 3f81997)
Results reproduced by @daisyyedda on 2024-08-06 (commit d814290)
Results reproduced by @emily-emily on 2024-08-16 (commit 1bbf7a7)
Results reproduced by @nicoella on 2024-08-19 (commit e65dd95)
Results reproduced by @natek-1 on 2024-08-19 ( commit e65dd95)
Results reproduced by @setarehbabajani on 2024-09-01 (commit 0dd5fa7)
Results reproduced by @anshulsc on 2024-09-07 (commit 2e4fa5d)
Results reproduced by @r-aya on 2024-09-08 (commit 2e4fa5d)
Results reproduced by @Amirkia1998 on 2024-09-20 (commit 83537a3)
Results reproduced by @pjyi2147 on 2024-09-20 (commit f511655)
Results reproduced by @krishh-p on 2024-09-21 (commit f511655)
Results reproduced by @andrewxucs on 2024-09-22 (commit dd57b7d)
Results reproduced by @Hossein-Molaeian on 2024-09-22 (commit bc13901)
Results reproduced by @AhmedEssam19 on 2024-09-30 (commit 07f04d4)
Results reproduced by @sisixili on 2024-10-01 (commit 07f04d4)
Results reproduced by @alirezaJvh on 2024-10-05 (commit 3f76099)
Results reproduced by @Raghav0005 on 2024-10-09 (commit 7ed8369)
Results reproduced by @Pxlin-09 on 2024-10-26 (commit af2d3c5)
Results reproduced by @Samantha-Zhan on 2024-11-17 (commit a95b0e0)
Results reproduced by @Divyajyoti02 on 2024-11-24 (commit f6f8ecc)
Results reproduced by @b8zhong on 2024-11-24 (commit 778968f)
Results reproduced by @vincent-4 on 2024-11-24 (commit 576fdaf)
Results reproduced by @ShreyasP20 on 2024-11-28 (commit 576fdaf)
Results reproduced by @nihalmenon on 2024-11-30 (commit 94492de)
Results reproduced by @zdann15 on 2024-12-04 (commit 5e66e98)
Results reproduced by @sherloc512 on 2024-12-05 (commit 5e66e98)
Results reproduced by @Alireza-Zwolf on 2024-12-18 (commit 6cc23d5)
Results reproduced by @Linsen-gao-457 on 2024-12-20 (commit 10606f0)
Results reproduced by @robro612 on 2025-01-05 (commit 9268591)
Results reproduced by @nourj98 on 2025-01-07 (commit 6ac07cc)
Results reproduced by @mithildamani256 on 2025-01-13 (commit ad41512)
Results reproduced by @ezafar on 2025-01-15 (commit e1a3386)
Results reproduced by @ErfanSadraiye on 2025-01-16 (commit cb14c93)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-nfcorpus.md

experiments-nfcorpus.md

Pyserini: BGE-base Baseline for NFCorpus

Data Prep

Indexing

Retrieval

Evaluation

Interactive Retrieval

Reproduction Log*

Files

experiments-nfcorpus.md

Latest commit

History

experiments-nfcorpus.md

File metadata and controls

Pyserini: BGE-base Baseline for NFCorpus

Data Prep

Indexing

Retrieval

Evaluation

Interactive Retrieval

Reproduction Log*