Featured

Information Retrieval : System Evaluation

How do we know whether an I R system is performing well?
In recent years the evaluation of Information Retrieval Systems and techniques for indexing, sorting, searching and retrieving information have become increasingly important. This growth in interest is due to two major reasons: the growing number of retrieval systems being used and additional focus on evaluation methods themselves. The Internet is an example of an information space (infospace) whose text content is growing exponentially along with products to find information of value. Information retrieval technologies are the basis behind the search of information on the Internet. In parallel with the commercial interest, the introduction of a large standardized test databases and a forum for yearly analysis via TREC and other conferences has provided a methodology for evaluating the performance of algorithms and systems. There are many reasons to evaluate the effectiveness of an Information Retrieval System:
  • To aid in the selection of a system to procure
  • To monitor and evaluate system effectiveness
  • To evaluate the system to determine improvements
  • To provide inputs to cost-benefit analysis of an information system
  • To determine the effects of changes made to an existing information system.

From an academic perspective, measurements are focused on the specific effectiveness of a system and usually are applied to determining the effects of changing a system’s algorithms or in comparing algorithms among systems. When evaluating systems for commercial use measurements are also focused on availability and reliability.
In an operational system there is less concern over 55% versus 65% precision than 99% versus 89% availability. For academic purposes, controlled environments can be created that minimize errors in data. In operational systems, there is no control over the users and care must be taken to ensure the data collected are meaningful.
We undertake an experiment in which the system is given a set of queries and the result sets are scored with respect to human relevance judgments. Traditionally, there have been two measures used in the scoring: recall and precision. We explain them with the help of an example. Imagine that an IR system has retumed a result set for a single query, for which we know which documents are and are not relevant, out of a corpus of 100 documents. The docwnent cmmts in each category are given in the following table:
  IN RESULT SET       NOT IN RESULT SET    
 RELEVAN
 30
20
 NO TRELEVAN  
10
40

Precision measures the propm1ion of documents in the result set that are actually relevant. In our example, the precision is 30/(30 + 10) = 0.75. The false positive rate is 1-0.75 =0.25. Recall measures the proportion of all the relevant docwnents in the collection that are in the result set. In our example, recall is 30/(30 + 20) = 0.60
The false negative rate is 1-0.60 = 0.40. In a very large document collection, such as the World Wide Web, recall is difficult to compute, because there is no easy way to examine every page on the Web for relevance.
All we can do is either estimate recall by sampling or ignore recall completely and just judge precision. In the case of a Web search engine, there may be thousands of docwnents in the result set, so it makes more sense to measure precision for several different sizes, such as "P@10" (precision in the top 10 results) or "P@50" rather than to estimate precision in the entire result set.
It is possible to trade off precision against recall by varying the size of the result set retumed. In the extreme, a system that retwns every document in the document collection is guaranteed a recall of 100%, but will have low precision. Alternately, a system could retum a single document and have low recall, but a decent chance at 100% precision. A summary of both measm-es is the H score, a single number that is the hannonic mean of precision and recall, 2PR/(P + R).



www.CodeNirvana.in

Copyright © Computer Science | Blogger Templates | Designed By Code Nirvana