Featured

Text Mining vs Text Retrieval

It is important to differentiate between Text Mining (TM) and Text Retrieval (or information retrieval, as it is more widely known). The goal of information retrieval is to help users find documents that satisfy their information needs (Baeza-Yates and Ribeiro-Neto, 1999). The standard procedure is analogous to looking for needles in a needle stack - the problem isn’t so much that the desired information is not known, but rather that the desired information coexists with many other valid pieces of information (Hearst, 1999). The outcome of information retrieval process is documents.
The goal of TM is to discover or derive new information from text, finding patterns across datasets, and/or separating signal from noise. The fact that an information retrieval system can return a document that contains the information a user requested does not imply that a new discovery has been made: the information had to have already been known to the author of the text; otherwise the author could not have written it down.

Figure IR.06 Key Steps in Information Retrieval

What is the principal computer specialty for processing documents and text? Many experts would respond “Information Retrieval.” The task of information retrieval, or IR as its practitioners call it, is to retrieve relevant documents in response to a query. Figure IR.06 illustrates the objectives of information retrieval of documents, where :
  • a general description is given of the query,
  • the document collection is searched, and
  • subsets of relevant documents are returned.


Figure IR.07 Key Steps in Predictive Text Mining

These seem like objectives far afield from predictive text mining. For prediction, the objectives are to :
  • examine a collection of documents,
  • learn decision criteria for classification, and
  • apply these criteria to new documents. The goals of predictive text mining are illustrated in Figure IR.07. These goals do not appear to match the goals of information retrieval.

The fundamental technique of information retrieval is measuring similarity. A query is examined and transformed into a vector of values to be compared with the measurements taken over the stored documents. The prediction problem is not solved directly by finding patterns in the collection of documents.
Rather, similar documents are retrieved. We then look at these retrieved documents and only then measure their properties. Because we are interested in classification, we count the number of their class labels to see which label should be assigned to a new, unlabeled document. We can now see that our objectives can be posed in the form of an information retrieval model, where documents are retrieved that are relevant to a query. Our query will be a new document. Like all documents, the query will be posed in terms of a word vector model. The query will be matched to all the stored documents, and a subset of documents will be retrieved. To make predictions, we add another step, as illustrated in Figure IR.08.We must examine the properties of the retrieved documents, typically by simple criteria such as their labels.


Figure IR.08 Predicting from Retrieved Documents

Because information retrieval has been studied as a separate endeavor, it would be wise to examine these "details" from both the IR perspective and the prediction perspective. This will allow us to make the best choices in applying a prediction method that measures document similarity. Although we have posed the prediction problem as a variant of information retrieval, there are differences and similarities among the full range of information retrieval techniques. We want to be aware of this relationship between information retrieval and prediction.



www.CodeNirvana.in

Copyright © Computer Science | Blogger Templates | Designed By Code Nirvana