This guide presents a conceptual framework for a representational approach to information retrieval that integrates dense and sparse representations into the same underlying (bi-encoder) architecture.
If you're a Waterloo student traversing the onboarding path, make sure you've first done all the exercises leading up to this guide, starting here. The previous step in the onboarding path is to reproduce BM25 baselines for the MS MARCO passage ranking task in Pyserini. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
Learning outcomes for this guide, building on previous steps in the onboarding path:
- Understand how sparse and dense representations can be viewed as variations in a bi-encoder architecture.
- Be able to identify correspondences between Lucene indexing and retrieval operations within the above conceptual framework.
- Be able to extract the BM25 vector representation of a document from a Lucene index and compute its inner product with respect to a query.
- Understand the difference between dense and sparse representations, and between supervised (learned) and unsupervised (heuristic) representations.
As a recap from here, this is the "core retrieval" problem that we're trying to solve:
Given an information need expressed as a query q, the text retrieval task is to return a ranked list of k texts {d1, d2 ... dk} from an arbitrarily large but finite collection of texts C = {di} that maximizes a metric of interest, for example, nDCG, AP, etc.
How might we tackle the challenge? One approach, known as a bi-encoder (or dual-encoder) architecture, is presented below:
The approach is conceptually quite simple: Let's say we have two "encoders":
- The document encoder takes a document and generates a representation of the document.
- The query encoder takes a query and generates a representation of the query.
In addition, we have a comparison function that takes two representations (one from a document, the other from a query) and produces an estimate of relevance, i.e., the degree to which the document is relevant to the query. In other words, the comparison function produces a relevance score; alternatively, we call this a query-document score.
Let's assume that the encoders generate representations in the form of vectors, and that the score is computed in terms of the inner product (or dot product) between the document and query vectors.
Let's further assume that the encoders have been designed or built in such a way that the larger the score (i.e., inner product), the more relevant the document is to the query. (Exactly how, we'll see later.)
Given this setup, how would we build a retrieval system? Well, here's one obvious way:
Step (1). Let's take the document collection (or corpus), i.e., C = {di}, and encode each document. So we have a bunch of vectors now, each corresponding to one document.
Step (2). When a query arrives, we need to encode it, i.e., generate its vector representation.
Step (3). In the final step, we need to find the k document vectors that have the highest query-document scores in terms of the inner product of their vector representations. We say k because in nearly all settings, k is specified externally, i.e., the user says, give me the top 10 hits. Hence, top-k retrieval.
Step (1) and step (2) are relatively straightforward given the document and query encoders. Encoding the document collection is embarrassingly parallel and encoding the query happens at search time.
Step (3) has a very naive implementation: take the query vector, compute its score with the vector of the first document, compute its score with the vector of the second document, compute its score with the vector of the third document, etc. Repeat until we iterate through all documents, keeping track of the top-k scores along the way (e.g., in a heap). In other words, compute all inner products in a brute-force manner.
Don't laugh, this isn't as ridiculous as it sounds!
(For later, this is in fact what's happening with a FlatIP
index in Faiss.)
However, researchers have developed more efficient data structures and top-k retrieval algorithms for vectors of different types.
As a preview: for sparse vectors, we use inverted indexes, and for dense vectors, we use HNSW.
Now, okay, what does that have to do with retrieval using BM25?
Well, BM25 is simply an "instantiation" of the above bi-encoder framework, where BM25 is the document encoder and the query encoder generates a so-called multi-hot vector, like this:
Wait, seriously?
Yup!
Let's consider docid
7187158, the answer to the query about Paula Deen's brother:
Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�
This is the BM25 vector representation for that document:
{
"be": 2.637899875640869,
"brother": 4.09124231338501,
"bubba": 7.102361679077148,
"bubba's\u00e2": 11.091651916503906,
"deen": 7.4197235107421875,
"earl": 5.663764953613281,
"former": 3.8262834548950195,
"gener": 2.2932770252227783,
"her": 2.7393782138824463,
"hier": 8.24051284790039,
"manag": 2.832794189453125,
"paula": 6.438521862030029,
"su": 5.404428005218506,
"uncl": 5.362298488616943,
"w": 3.9339818954467773
}
This requires a bit of explanation. As previously mentioned, BM25 is a so-called "bag-of-words" model, which means that the representation of the document is a sparse vector, which each dimension corresponds to a term in the vocabulary space. Usually, we say "terms" (or sometimes "tokens") instead of "words" because there's some amount of processing to tokenize the input document. Typically, words are converted into lower case, "stopwords" are discarded, and some form of stemming is applied (e.g., to remove suffixes).
In a bag-of-words representation, the dimension of the sparse vector is the size of the vocabulary space, i.e., the number of unique terms in the entire collection (i.e., all documents). It's called a "bag of words" because we're only keeping track of the individual terms in a document, neglecting other important aspects of "meaning" such as the order of the terms, linguistic relationships, etc. Sometimes, these are called sparse lexical vectors, to emphasize that their dimensions correspond to lexical items (i.e., terms in the vocabulary).
Since most of the entries in a sparse vector (representing a document) are zero, it can be conveniently represented in JSON, with the keys denoting the terms that have non-zero weights (or scores). This is what we've done above. The weights (or scores) of those terms are determined by the BM25 scoring function, which is a function of various term statistics such as the term frequency (the number of times a term appears in a document), the term's document frequency (the number of documents in the collection that contains the term), the document length, etc.
However, the high-level idea is that "important" terms get high weights, and "unimportant" terms get low weights. BM25 is just one (of many) scoring functions that attempt to capture this intuition.
Who came up with the BM25 scoring function? Its origins date back to the 1970s, but you can find a good overview from 2009 here.
So, the indexing phase with Lucene, as in the previous exercise you've done, corresponds to the green box below:
Conceptually, you've computed the BM25 document vector of every document in the collection and stored them in a data structure called an inverted index. (Actually, in reality, the inverted index only stores component statistics that allow you to reconstruct the BM25 document vectors.)
With the LuceneIndexReader
class in Pyserini, you can materialize (i.e., reconstruct) the BM25 document vector for a particular document:
from pyserini.index.lucene import LuceneIndexReader
import json
index_reader = LuceneIndexReader('indexes/lucene-index-msmarco-passage')
tf = index_reader.get_document_vector('7187158')
bm25_weights = \
{term: index_reader.compute_bm25_term_weight('7187158', term, analyzer=None) \
for term in tf.keys()}
print(json.dumps(bm25_weights, indent=4, sort_keys=True))
The above snippet of code will generate exactly the same JSON above.
Once you've built an index, the retrieval stage corresponds to the purple box below:
The below snippet of code generates the query representation:
from pyserini.analysis import Analyzer, get_lucene_analyzer
analyzer = Analyzer(get_lucene_analyzer())
query_tokens = analyzer.analyze('what is paula deen\'s brother')
multihot_query_weights = {k: 1 for k in query_tokens}
The query tokens (query_tokens
) are:
['what', 'paula', 'deen', 'brother']
The query representation is simply a sparse vector where all the query tokens get a score (weight) of one.
This is also known as a "multi-hot vector".
We represent the query vector as a Python dictionary in multihot_query_weights
.
As described above, the top-k retrieval problem is to find the k documents from the collection that have the highest inner product (between a document vector and the query vector). Without getting into details, the inverted index allows this top-k to be computed efficiently.
Here, we can manually compute the inner product between the query vector and the document vector:
import numpy as np
# Gather up the dimensions (i.e., the combined dictionary).
terms = set.union(set(bm25_weights.keys()), set(multihot_query_weights.keys()))
bm25_vec = np.array([ bm25_weights.get(t, 0) for t in terms ])
multihot_qvec = np.array([ multihot_query_weights.get(t, 0) for t in terms ])
np.dot(multihot_qvec, bm25_vec)
The dot product is 17.949487686157227
.
In the code snippet above, we're first creating numpy vectors for the document (bm25_vec
) and the query (multihot_qvec
).
The dimensions of the vectors are the (unique) union of the terms from the query and the document.
Then we use numpy's dot product method to compute the query-document score.
In the approach above, we perform the operations explicitly, but it's a bit roundabout. Alternatively we can compute the dot product as follows:
sum({term: bm25_weights[term] \
for term in bm25_weights.keys() & \
multihot_query_weights.keys()}.values())
Here, we are using a dictionary comprehension to generate a new dictionary that only contains the keys (terms) that appear in both the query and the document.
Then we sum up the values (i.e., the term weights) to arrive at the query-document score.
This is exactly the inner product of the multihot_qvec
and bm25_vec
vectors since multihot_qvec
is a vector of zeros and ones.
You should get exactly the same query-document score.
Let's try searching with the same query using Lucene:
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage')
hits = searcher.search('what is paula deen\'s brother')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')
You'll see that hit 1, docid
7187158, has a score of 17.94950
, which is the same as the score we computed by hand (modulo rounding).
Taking stock of where we're at: This section explains how retrieval with BM25 (e.g., with Lucene), is an instantiation of the bi-encoder architecture. Repeat the same exercise above for a few other queries (make them up) to convince yourself.
Let's go back and look at the bi-encoder architecture again.
With BM25, the document encoder generates sparse lexical (or bag-of-words) vectors. We can imagine encoders generating other types of vector representations.
The two major axes of variations are:
- The basis of the representation vectors (sparse vs. dense). They could be sparse vectors (as in the case of BM25) or they could be dense vectors. In the case of dense vectors, the dimensions of the vectors putatively capture some "latent semantic space". Thus, we often contrast sparse lexical vector with dense semantic vectors.
- Whether the representations are learned (supervised vs. unsupervised). In the case of BM25, there is no "learning" (modulo minor parameter tuning): the BM25 scoring function specifies the weights that each term should receive. BM25 was derived from a probabilistic model of retrieval and has evolved over many decades (see here), but it is not the product of a (supervised) machine-learning approach. Alternatively, document and query representations can be learned from large amounts of data. So, we have a contrast between unsupervised (or heuristic) representations and supervised (or learned) representations.
Indeed, dense retrieval models usually refer to encoders that learn to produce dense semantic representations using, you guessed it, transformers models! Where do the training data come from? Again, you got it: datasets such as MS MARCO.
So the picture for dense retrieval looks like this:
When people say "vector search" or "semantic search" these days, they're referring to the picture above. Often, the outputs of the encoders are called embedding vectors, or just embeddings for short.
Again, to summarize: We often say that dense retrieval models generate learned dense representations. In contrast, BM25 generates unsupervised (or heuristic) sparse representations. To complete the possibilities, yes, there are learned sparse representations and unsupervised dense representations as well.
We'll save a more complete exploration of this design space for some other time, but they're sketched out in this article (on which this guide is based) if you want to learn more:
Jimmy Lin. A Proposed Conceptual Framework for a Representational Approach to Information Retrieval. arXiv:2110.01529, October 2021.
Okay, that's it for this lesson.
Next, you're going to play with an actual dense retrieval model.
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd
, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.
Reproduction Log*
- Results reproduced by @sahel-sh on 2023-07-23 (commit
9619502
) - Results reproduced by @Mofetoluwa on 2023-08-05 (commit
6a2088b
) - Results reproduced by @Andrwyl on 2023-08-26 (commit
d9da49e
) - Results reproduced by @yilinjz on 2023-08-30 (commit
42b3549
) - Results reproduced by @UShivani3 on 2023-09-01 (commit
42b3549
) - Results reproduced by @Edward-J-Xu on 2023-09-04 (commit
8063322
) - Results reproduced by @mchlp on 2023-09-07 (commit
d8dc5b3
) - Results reproduced by @lucedes27 on 2023-09-10 (commit
54014af
) - Results reproduced by @MojTabaa4 on 2023-09-14 (commit
d4a829d
) - Results reproduced by @Kshama on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @MelvinMo on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @ksunisth on 2023-09-27 (commit
142c774
) - Results reproduced by @maizerrr on 2023-10-01 (commit
bdb9504
) - Results reproduced by @Stefan824 on 2023-10-04 (commit
4f3da10
) - Results reproduced by @shayanbali on 2023-10-13 (commit
f1d623c
) - Results reproduced by @gituserbs on 2023-10-18 (commit
f1d623c
) - Results reproduced by @shakibaam on 2023-11-04 (commit
01889cc
) - Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit
01889cc
) - Results reproduced by @Melissa1412 on 2023-11-05 (commit
acd969f
) - Results reproduced by @salinaria on 2023-11-12 (commit
086e16b
) - Results reproduced by @Seun-Ajayi on 2023-11-13 (commit
086e16b
) - Results reproduced by @oscarbelda86 on 2023-11-13 (commit
086e16b
) - Results reproduced by @aliranjbari on 2023-11-14 (commit
c344786
) - Results reproduced by @AndreSlavescu on 2023-11-28 (commit
1219cdb
) - Results reproduced by @tudou0002 on 2023-11-28 (commit
723e06c
) - Results reproduced by @alimt1992 on 2023-11-29 (commit
e6700f6
) - Results reproduced by @golnooshasefi on 2023-11-29 (commit
1219cdb
) - Results reproduced by @sueszli on 2023-12-01 (commit
170e271
) - Results reproduced by @kdricci on 2023-12-01 (commit
a2049c4
) - Results reproduced by @ljk423 on 2023-12-04 (commit
35002ad
) - Results reproduced by @saharsamr on 2023-12-14 (commit
039c137
) - Results reproduced by @Panizghi on 2023-12-17 (commit
0f5db95
) - Results reproduced by @AreelKhan on 2023-12-22 (commit
f75adca
) - Results reproduced by @wu-ming233 on 2023-12-31 (commit
38a571f
) - Results reproduced by @Yuan-Hou on 2024-01-02 (commit
38a571f
) - Results reproduced by @himasheth on 2024-01-10 (commit
a6ed27e
) - Results reproduced by @Tanngent on 2024-01-13 (commit
57a00cf
) - Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit
d4ea011
) - Results reproduced by @ia03 on 2024-01-18 (commit
05ee8ef
) - Results reproduced by @AlexStan0 on 2024-01-20 (commit
833ee19
) - Results reproduced by @charlie-liuu on 2024-01-23 (commit
87a120e
) - Results reproduced by @dannychn11 on 2024-01-28 (commit
2f7702f
) - Results reproduced by @ru5h16h on 2024-02-20 (commit
758eaaa
) - Results reproduced by @ASChampOmega on 2024-02-23 (commit
442e7e1
) - Results reproduced by @16BitNarwhal on 2024-02-26 (commit
19fcd3b
) - Results reproduced by @HaeriAmin on 2024-02-27 (commit
19fcd3b
) - Results reproduced by @17Melissa on 2024-03-03 (commit
a9f295f
) - Results reproduced by @devesh-002 on 2024-03-05 (commit
84c6742
) - Results reproduced by @chloeqxq on 2024-03-07 (commit
19fcd3b
) - Results reproduced by @xpbowler on 2024-03-11 (commit
19fcd3b
) - Results reproduced by @jodyz0203 on 2024-03-12 (commit
280e009
) - Results reproduced by @kxwtan on 2024-03-12 (commit
2bb342a
) - Results reproduced by @syedhuq28 on 2024-03-28 (commit
2bb342a
) - Results reproduced by @khufia on 2024-03-29 (commit
2bb342a
) - Results reproduced by @Lindaaa8 on 2024-03-29 (commit
7dda9f3
) - Results reproduced by @th13nd4n0 on 2024-04-05 (commit
df3bc6c
) - Results reproduced by @a68lin on 2024-04-12 (commit
7dda9f3
) - Results reproduced by @DanielKohn1208 on 2024-04-22 (commit
184a212
) - Results reproduced by @emadahmed19 on 2024-04-28 (commit
9db2584
) - Results reproduced by @CheranMahalingam on 2024-05-05 (commit
f817186
) - Results reproduced by @billycz8 on 2024-05-08 (commit
c945c50
) - Results reproduced by @KenWuqianhao on 2024-05-11 (commit
c945c50
) - Results reproduced by @hrouzegar on 2024-05-13 (commit
bf68fc5
) - Results reproduced by @Yuv-sue1005 on 2024-05-14 (commit '9df4015')
- Results reproduced by @RohanNankani on 2024-05-17 (commit a91ef1d)
- Results reproduced by @IR3KT4FUNZ on 2024-05-25 (commit
a6f4d6
) - Results reproduced by @bilet-13 on 2024-06-01 (commit
b0c53f3
) - Results reproduced by @SeanSong25 on 2024-06-05 (commit
b7e1da3
) - Results reproduced by @alireza-taban on 2024-06-11 (commit
d814290
) - Results reproduced by @hosnahoseini on 2024-06-18 (commit
49d8c43
) - Results reproduced by @Feng-12138 on 2024-06-23 (commit
3b9d541
) - Results reproduced by @FaizanFaisal25 on 2024-07-06 (commit
3b9d541
) - Results reproduced by @XKTZ on 2024-07-13 (commit
544046e
) - Results reproduced by @MehrnazSadeghieh on 2024-07-19 (commit
26a2538
) - Results reproduced by @alireza-nasirian on 2024-07-19 (commit
544046e
) - Results reproduced by @MariaPonomarenko38 on 2024-07-19 (commit
d4509dc
) - Results reproduced by @valamuri2020 on 2024-08-03 (commit
3f81997
) - Results reproduced by @daisyyedda on 2024-08-06 (commit
d814290
) - Results reproduced by @emily-emily on 2024-08-16 (commit
1bbf7a7
) - Results reproduced by @nicoella on 2024-08-18 (commit
e65dd95
) - Results reproduced by @natek-1 on 2024-08-19 ( commit
e65dd95
) - Results reproduced by @setarehbabajani on 2024-08-31 (commit
0dd5fa7
) - Results reproduced by @anshulsc on 2024-09-07 (commit
2e4fa5d
) - Results reproduced by @r-aya on 2024-09-08 (commit
2e4fa5d
) - Results reproduced by @Amirkia1998 on 2024-09-20 (commit
83537a3
) - Results reproduced by @pjyi2147 on 2024-09-20 (commit
f511655
) - Results reproduced by @krishh-p on 2024-09-21 (commit
f511655
) - Results reproduced by @andrewxucs on 2024-09-22 (commit
dd57b7d
) - Results reproduced by @Hossein-Molaeian on 2024-09-22 (commit
bc13901
) - Results reproduced by @AhmedEssam19 on 2024-09-30 (commit
07f04d4
) - Results reproduced by @sisixili on 2024-10-01 (commit
07f04d4
) - Results reproduced by @alirezaJvh on 2024-10-05 (commit
3f76099
) - Results reproduced by @Raghav0005 on 2024-10-09 (commit
7ed8369
) - Results reproduced by @Pxlin-09 on 2024-10-26 (commit
af2d3c5
) - Results reproduced by @Samantha-Zhan on 2024-11-17 (commit
a95b0e0
) - Results reproduced by @Divyajyoti02 on 2024-11-24 (commit
f6f8ecc
) - Results reproduced by @b8zhong on 2024-11-24 (commit
778968f
) - Results reproduced by @vincent-4 on 2024-11-28 (commit
576fdaf
) - Results reproduced by @ShreyasP20 on 2024-11-28 (commit
576fdaf
) - Results reproduced by @nihalmenon on 2024-11-30 (commit
94492de
) - Results reproduced by @zdann15 on 2024-12-04 (commit
5e66e98
) - Results reproduced by @sherloc512 on 2024-12-05 (commit
5e66e98
) - Results reproduced by @Alireza-Zwolf on 2024-12-18 (commit
6cc23d5
) - Results reproduced by @Linsen-gao-457 on 2024-12-20 (commit
10606f0
) - Results reproduced by @robro612 on 2025-01-05 (commit
9268591
) - Results reproduced by @nourj98 on 2025-01-07 (commit
6ac07cc
) - Results reproduced by @mithildamani256 on 2025-01-13 (commit
ad41512
) - Results reproduced by @ezafar on 2025-01-15 (commit
e1a3386
) - Results reproduced by @ErfanSadraiye on 2025-01-16 (commit
cb14c93
)