A scoring function takes a document and a query and retwns a numeric score; the rnost relevant documents have the highest scores. In the BM25 ftmction, the score is a linear weighted combination of scores for each of the words that make up the query. Three factors affect the weight of a query tenn:
- First, the frequency with which a query term appears in a document (also known as TF for term frequency). For the query [farming in Kansas], documents that mention "farming" frequently will have higher scores.
- Second, the inverse document frequency of the term, or IDF. The word "in" appears in almost every document, so it has a high document frequency, and thus a low inverse doctunent frequency, and thus it is not as important to the query as "farming" or "Kansas."
- Third, the length of the document. A million-word document will probably mention all the query words, but may not actually be about the query. A short document that mentions all the words is a much better candidate.
The BM25 function takes all three of these into account. We assume we have created an index of the N documents in the corpus so that we can look up TF( qi, dj ), the count of the number of times word qi appears in document dj We also assume a table of document frequency counts, DF(qi) that gives the number of documents that contain the word qi then given a document dj and a query consisting of the words q1:N we have.
where |dj| is the length of document dj in words, and L is the average document length in the corpus:
We have two parameters, k and b, that can be turned by cross-validation; typical values are k = 2.0 and b = 0. 75.
So IDF(qi) is the inverse document frequency of word qi given by
Of course, it would be impractical to apply the BM25 scoring function to every documentin the corpus. Instead, systems create an index ahead of time that lists, for each vocabulary word, the documents that contain the word. This is called the hit list for the word. Then when given a query, we intersect the hit lists of the query words and only score the documents in the intersection.