Skip to content

Latest commit

 

History

History
424 lines (352 loc) · 33.9 KB

experiments-nfcorpus.md

File metadata and controls

424 lines (352 loc) · 33.9 KB

Pyserini: BGE-base Baseline for NFCorpus

This guide contains instructions for running a BGE-base baseline for NFCorpus.

If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've first done the previous step, a conceptual framework for retrieval. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

If you've traversed the onboarding path, by now you've learned the basics of bag-of-words retrieval with BM25 using Lucene (via Anserini and Pyserini). Conceptually, you understand how it's a specific manifestation of a bi-encoder architecture where the vector representations are lexical and the weights are assigned in an unsupervised (or heuristic) manner.

In this guide, we're going to go through an example of retrieval using a learned, dense representation. These are often called "dense retrieval models" and informally referred to as "vector search". Coming back to here:

The document and query encoders are now transformer-based models that are trained on large amounts of supervised data. The outputs of the encoders are often called embedding vectors, or just embeddings for short.

For this guide, assume that we've already got trained encoders. How to actually train such models will be covered later.

Learning outcomes for this guide, building on previous steps in the onboarding path:

  • Be able to use Pyserini to encode documents in NFCorpus with an existing dense retrieval model (BGE-base) and to build a Faiss index on the vector representations..
  • Be able to use Pyserini to perform a batch retrieval run on queries from NFCorpus.
  • Be able to evaluate the retrieved results above.
  • Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.

Data Prep

In this lesson, we'll be working with NFCorpus, a full-text learning to rank dataset for medical information retrieval. The rationale is that the corpus is quite small — only 3633 documents — so the latency of CPU-based inference with neural models (i.e., the encoders) is tolerable, i.e., this lesson is doable on a laptop. It is not practical to work with the MS MARCO passage ranking corpus using CPUs.

Let's first start by fetching the data:

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
unzip collections/nfcorpus.zip -d collections

This just gives you an idea of what the corpus contains:

$ head -1 collections/nfcorpus/corpus.jsonl
{"_id": "MED-10", "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995\u20132003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08\u20139.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38\u20130.55 and HR 0.54, 95% CI 0.44\u20130.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins\u2019 effect on survival in breast cancer patients.", "metadata": {"url": "http://www.ncbi.nlm.nih.gov/pubmed/25329299"}}

We need to do a bit of data munging to get the queries into the right format (from json to tsv). Run the following Python script:

import json

with open('collections/nfcorpus/queries.tsv', 'w') as out:
    with open('collections/nfcorpus/queries.jsonl', 'r') as f:
        for line in f:
            l = json.loads(line)
            out.write(l['_id'] + '\t' + l['text'] + '\n')

Similarly, we need to munge the relevance judgments (qrels) into the right format. This command-line invocation does the trick:

tail -n +2 collections/nfcorpus/qrels/test.tsv | sed 's/\t/\tQ0\t/' > collections/nfcorpus/qrels/test.qrels

Okay, the data are ready now.

Indexing

We can now "index" these documents using Pyserini:

python -m pyserini.encode \
  input   --corpus collections/nfcorpus/corpus.jsonl \
          --fields title text \
  output  --embeddings indexes/nfcorpus.bge-base-en-v1.5 \
          --to-faiss \
  encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
          --device cpu \
          --pooling mean \
          --fields title text \
          --batch 32

We're using the BAAI/bge-base-en-v1.5 encoder, which can be found on HuggingFace. Use --device cuda for a faster computation if you have a CUDA-enabled GPU.

Try it using the Contriever model!
python -m pyserini.encode \
  input   --corpus collections/nfcorpus/corpus.jsonl \
          --fields title text \
  output  --embeddings indexes/faiss.nfcorpus.contriever-msmacro \
          --to-faiss \
  encoder --encoder facebook/contriever-msmarco \
          --device cpu \
          --pooling mean \
          --fields title text \
          --batch 32

We're using the facebook/contriever-msmarco encoder, which can be found on HuggingFace. Use --device cuda for a faster computation if you have a CUDA-enabled GPU.


Pyserini wraps Faiss, which is a library for efficient similarity search on dense vectors. That is, once all the documents have been encoded (i.e., converted into representation vectors), they are passed to Faiss to manage (i.e., for storage and for search later on). "Index" here is in quotes because, in reality we're using something called a "flat" index (FlatIP to be exact), which just stores the vectors in fixed-width bytes, one after the other. At search time, each document vector is sequentially compared to the query vector. In other words, the library just performs brute force dot products of each query vector against all document vectors.

The above indexing command takes around 30 minutes to run on a modern laptop, with most of the time occupied by performing neural inference using the CPU. Adjust the batch parameter above accordingly for your hardware; 32 is the default, but reduce the value if you find that the encoding is taking too long.

Retrieval

We can now perform retrieval in Pyserini using the following command:

python -m pyserini.search.faiss \
  --encoder-class auto --encoder BAAI/bge-base-en-v1.5 --l2-norm \
  --pooling mean \
  --index indexes/nfcorpus.bge-base-en-v1.5 \
  --topics collections/nfcorpus/queries.tsv \
  --output runs/run.beir.bge-base-en-v1.5.nfcorpus.txt \
  --batch 128 --threads 8 \
  --hits 1000

The queries are in collections/nfcorpus/queries.tsv.

If you indexed with Contriever above, try retrieval with it too:
python -m pyserini.search.faiss \
  --encoder-class contriever --encoder facebook/contriever-msmarco \
  --index indexes/faiss.nfcorpus.contriever-msmacro \
  --topics collections/nfcorpus/queries.tsv \
  --output runs/run.beir-contriever-msmarco.nfcorpus.txt \
  --batch 128 --threads 8 \
  --hits 1000

As mentioned above, Pyserini wraps the Faiss library. With the flat index here, we're performing brute-force computation of dot products (albeit in parallel and with batching). As a result, we are performing exact search, i.e., we are finding the exact top-k documents that have the highest dot products.

The above retrieval command takes only a few minutes on a modern laptop. Adjust the threads and batch parameters above accordingly for your hardware.

Evaluation

After the run finishes, we can evaluate the results using trec_eval:

python -m pyserini.eval.trec_eval \
  -c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
  runs/run.beir.bge-base-en-v1.5.nfcorpus.txt

The results will be something like:

Results:
ndcg_cut_10           	all	0.3808
And if you've been following along with Contriever:
python -m pyserini.eval.trec_eval \
  -c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
  runs/run.beir-contriever-msmarco.nfcorpus.txt

The results will be something like:

Results:
ndcg_cut_10           	all	0.3306

If you've gotten here, congratulations! You've completed your first indexing and retrieval run using a dense retrieval model.

Interactive Retrieval

The final step, as with Lucene, is to learn to use the dense retriever interactively. This contrasts with the batch run above.

Here's the snippet of Python code that does what we want:

from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder

encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True)
searcher = FaissSearcher('indexes/nfcorpus.bge-base-en-v1.5', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The FaissSearcher provides search capabilities using Faiss as its underlying implementation. The AutoQueryEncoder allows us to initialize an encoder using a HuggingFace model.

 1 MED-4555 0.791379
 2 MED-4560 0.710725
 3 MED-4421 0.688938
 4 MED-4993 0.686238
 5 MED-4424 0.686214
 6 MED-1663 0.682199
 7 MED-3436 0.680585
 8 MED-2750 0.677033
 9 MED-4324 0.675772
10 MED-2939 0.674646

You'll see that the ranked list is the same as the batch run you performed above:

$ grep PLAIN-3074 runs/run.beir.bge-base-en-v1.5.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 0.791378 Faiss
PLAIN-3074 Q0 MED-4560 2 0.710725 Faiss
PLAIN-3074 Q0 MED-4421 3 0.688938 Faiss
PLAIN-3074 Q0 MED-4993 4 0.686238 Faiss
PLAIN-3074 Q0 MED-4424 5 0.686214 Faiss
PLAIN-3074 Q0 MED-1663 6 0.682199 Faiss
PLAIN-3074 Q0 MED-3436 7 0.680585 Faiss
PLAIN-3074 Q0 MED-2750 8 0.677033 Faiss
PLAIN-3074 Q0 MED-4324 9 0.675772 Faiss
PLAIN-3074 Q0 MED-2939 10 0.674647 Faiss
Again with Contriever!

Here's the snippet of Python code that does what we want:

from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder

encoder = AutoQueryEncoder('facebook/contriever-msmarco', device='cpu', pooling='mean')
searcher = FaissSearcher('indexes/faiss.nfcorpus.contriever-msmacro', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The FaissSearcher provides search capabilities using Faiss as its underlying implementation. The AutoQueryEncoder allows us to initialize an encoder using a HuggingFace model.

 1 MED-4555 1.472201
 2 MED-3180 1.125014
 3 MED-1309 1.067153
 4 MED-2224 1.059536
 5 MED-4423 1.038440
 6 MED-4887 1.032622
 7 MED-2530 1.020758
 8 MED-2372 1.016142
 9 MED-1006 1.013599
10 MED-2587 1.010811

You'll see that the ranked list is the same as the batch run you performed above:

$ grep PLAIN-3074 runs/run.beir-contriever-msmarco.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 1.472201 Faiss
PLAIN-3074 Q0 MED-3180 2 1.125014 Faiss
PLAIN-3074 Q0 MED-1309 3 1.067153 Faiss
PLAIN-3074 Q0 MED-2224 4 1.059537 Faiss
PLAIN-3074 Q0 MED-4423 5 1.038440 Faiss
PLAIN-3074 Q0 MED-4887 6 1.032622 Faiss
PLAIN-3074 Q0 MED-2530 7 1.020758 Faiss
PLAIN-3074 Q0 MED-2372 8 1.016142 Faiss
PLAIN-3074 Q0 MED-1006 9 1.013599 Faiss
PLAIN-3074 Q0 MED-2587 10 1.010811 Faiss

And that's it!

The next lesson will provide a deeper dive into dense and sparse representations. Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.

Reproduction Log*