Realising Relevance Ranking
I stumbled across Artem Krylysov's Let's Build a Full-Text Search Engine and thought it would be fun to take the project a step further. This post is an unofficial part 2, where we implement relevance ranking functions such as TF–IDF and BM25.
I started with rough feature parity for the search engine, opting for Python instead of Go. I'm using a different Wikipedia dump (2023-12-27) and unfortunately, catopumas are extinct in mine:
Search: small wild cat
Found 1 results in 6.03ms
1. Wildcat: The wildcat is a species complex comprising two sm...
One of Artem's suggested improvements was to extend the boolean queries to support OR, making the query disjunctive (small OR wild OR cat
) rather than conjunctive (small AND wild AND cat
). This is quite straightforward to achieve: rather than intersecting the postings lists, we take their union (for those currently interview prepping, here's a similar problem).
Using a disjunctive query, we now get 4523 candidate results instead of just 1:
Search: small wild cat
Found 4523 results in 9.05ms
1. Archipelago: An archipelago ( ), sometimes called an island gro...
2. Applet: In computing, an applet is any small application t...
3. Lory (disambiguation): A Lory is a small to medium-sized arboreal parrot....
4. Amaryllis: Amaryllis () is the only genus in the subtribe Ama...
5. Ancyra (planthopper): Ancyra is a small genus of planthoppers of the fam...
...
Recall has improved significantly, but precision still needs work — applets have nothing to do with cats. For those unfamiliar with search terminology, recall is the fraction of relevant results that are returned and precision is the fraction of results that are relevant, or as Daniel Tunkelang pithily puts it:
Strive to tell the whole truth (recall) and nothing but truth (precision)
There is generally a trade-off between recall and precision but thankfully we can dodge it by using another of Artem's suggestions: "sort results by relevance". As stated in Relevant Search, "Relevance ranking is how you cheat the system. Relevance ranking helps you improve precision for the top N results".
How do we determine a relevance score for a given document? Well, we can start by using TF–IDF (Term Frequency–Inverse Document Frequency). In a nutshell, TF–IDF says that documents are more relevant if they contain more occurrences of the query's terms (high term frequency) and if these terms are rarer in the corpus (low document frequency = high IDF).
def raw_tf_idf_strategy(idx: Index, query_terms: list[str], doc_id: int) -> float:
score = 0.0
for term in query_terms:
postings_list = idx.inverted.get(term, {})
tf = postings_list.get(doc_id) # Calculated at index time
if tf is not None:
df = len(postings_list)
idf_squared = 1 / (df * df) # Query IDF * document IDF
score += tf * idf_squared
return score
For each candidate document, a partial score is calculated for every query term that is present and then the total score is taken as the final relevance score used to sort the candidates. Here, we're actually calculating the (cosine) similarity between a query vector and a document vector. Vectors?! Where did they come from? Well, consider the documents represented as multi-dimensional vectors where each dimension corresponds to a unique term in the corpus and has a value signifying the relative importance of that term (= TF–IDF score in our case). This is known as the vector space model.
query doc
------------------
cat | (0.68) (0.37)
sat | (0.34) . (0.00)
.. | ( .. ) ( .. )
mat | (0.00) (0.54)
If a term is absent in the document, its value is 0. Since most documents only contain a fraction of the entire corpus, we end up with vectors containing many 0s or sparse vectors. This is in contrast to the dense vectors used in semantic search, a hot topic right now given the AI summer we're in. Whilst semantic search preserves the context of the original document, our keyword/lexical search does not — we are simply treating the document as a "bag-of-words".
Using TF–IDF has improved precision, but our ideal document from earlier ("wildcat") doesn't even crack the top 5:
1. Ocicat: The Ocicat is an all-domestic breed of cat which r...
2. The Unadulterated Cat: The Unadulterated Cat by Terry Pratchett, illustra...
3. Bengal cat: The Bengal cat is a domesticated cat breed created...
4. Australian Mist: The Australian Mist (formerly known as Spotted Mis...
5. Bicolor cat: A bicolor cat or [color]-and-white cat is a cat wi...
...
Let's instead switch to classic similarity scoring from Apache Lucene, the underlying library used by both Elasticsearch and Solr. The classic similarity algorithm builds on TF–IDF, first dampening both TF and IDF because relevance does not increase proportionally with these 2 factors.
tf_dampened = math.sqrt(tf)
idf_dampened = 1 + math.log(n_docs / (df + 1))
Next, let's introduce field_norm
to account for different document lengths. The idea is that a short article containing "cat" many times is a stronger indicator of relevance than a longer article with the same number of occurrences. As an aside, Lucene lossily stores the field_norm
as a single byte which caused me several hours of frustration when comparing my results to Elasticsearch's...
field_length = idx.docs[doc_id]
field_norm = 1 / math.sqrt(field_length)
score += tf_dampened * idf_dampened_squared * field_norm
We also want to punish documents that are missing query terms. To do this, another new parameter, coord
, is thrown into the mix which acts as a multiplier on the final document score. Finally, query_norm
is used to normalise query scores, making them more comparable across different queries and indexes.
coord = n_query_terms_found / len(query_terms)
final_score = score * coord * query_norm
Putting this all together, we have our new algorithm (source) and our ideal document is now in 2nd place:
1. [3.72] Lynx (disambiguation): A lynx is a type of wild cat....
2. [3.36] Wildcat: The wildcat is a species complex comprising two sm...
3. [3.03] Bobcat (disambiguation): Bobcat is a species of wild cat in North America....
4. [2.39] Ocicat: The Ocicat is an all-domestic breed of cat which r...
5. [2.15] HMS Lynx: Ten Royal Navy ships have been named HMS Lynx afte...
In case it wasn't obvious from the name, Lucene classic similarity is no longer in use. It was replaced as the default scoring strategy in Elasticsearch v6 by Okapi BM25. BM25 is a probabilistic model ("what is the probability that a document is relevant?"), rather than a vector space model. However, it doesn't look too different in practice with only a couple of changes of note:
- An asymptote is introduced for TF so increasing it results in diminishing returns. The saturation point is tunable via a new parameter
k1
(yes, naming is hard) - Document lengths are normalised, like with
field_norm
, but relative to the corpus' average document length. This is also tunable using another tersely-named parameterb
Using BM25 (source) with the default parameters (k1=1.2, b=0.75
):
1. [20.85] Lynx (disambiguation): A lynx is a type of wild cat....
2. [18.88] Bobcat (disambiguation): Bobcat is a species of wild cat in North America....
3. [17.76] Ocicat: The Ocicat is an all-domestic breed of cat which r...
4. [16.51] Wildcat: The wildcat is a species complex comprising two sm...
5. [14.71] HMS Lynx: Ten Royal Navy ships have been named HMS Lynx afte...
Wildcat has fallen to 4th! However, by tweaking the parameters and setting k1=0.4
...
1. [28.71] Wildcat: The wildcat is a species complex comprising two sm...
2. [28.22] Lynx (disambiguation): A lynx is a type of wild cat....
3. [26.95] Bobcat (disambiguation): Bobcat is a species of wild cat in North America....
4. [26.16] Ocicat: The Ocicat is an all-domestic breed of cat which r...
5. [23.74] HMS Lynx: Ten Royal Navy ships have been named HMS Lynx afte...
We've overfitted our way to 1st place 🥇!
In summary, we started with a simple search engine and introduced relevance ranking using several algorithms, from basic TF–IDF to the more advanced BM25.
The full source code can be found here — I look forward to seeing someone else publish a part 3!