Information Retrieval : Scoring Functions

Most Information Retrieval systems have abandoned the Boolean model and use models based on the statistics of word counts. We describe the BM25 scoring function, which comes from the Okapi project of Stephen Robertson and Karen Sparck Jones at London's City College, and has been used in search engines such as the open-source Lucene project.
A scoring function takes a document and a query and retwns a numeric score; the rnost relevant documents have the highest scores. In the BM25 ftmction, the score is a linear weighted combination of scores for each of the words that make up the query. Three factors affect the weight of a query tenn:

First, the frequency with which a query term appears in a document (also known as TF for term frequency). For the query [farming in Kansas], documents that mention "farming" frequently will have higher scores.
Second, the inverse document frequency of the term, or IDF. The word "in" appears in almost every document, so it has a high document frequency, and thus a low inverse doctunent frequency, and thus it is not as important to the query as "farming" or "Kansas."
Third, the length of the document. A million-word document will probably mention all the query words, but may not actually be about the query. A short document that mentions all the words is a much better candidate.

The BM25 function takes all three of these into account. We assume we have created an index of the N documents in the corpus so that we can look up TF( q_i, d_j ), the count of the number of times word q_i appears in document d_j We also assume a table of document frequency counts, DF(q_i) that gives the number of documents that contain the word q_i then given a document d_j and a query consisting of the words q_1:N we have.

where |d_j| is the length of document d_j in words, and L is the average document length in the corpus:

We have two parameters, k and b, that can be turned by cross-validation; typical values are k = 2.0 and b = 0. 75.
So IDF(q_i) is the inverse document frequency of word q_i given by

Of course, it would be impractical to apply the BM25 scoring function to every documentin the corpus. Instead, systems create an index ahead of time that lists, for each vocabulary word, the documents that contain the word. This is called the hit list for the word. Then when given a query, we intersect the hit lists of the query words and only score the documents in the intersection.

Information Retrieval : Scoring Functions

www.CodeNirvana.in