How to Mine the Data

Let us see now what the process of "mining" the data means. Schematically, we can identify three characteristic steps of the data mining process:

Exploring data, consisting of data "cleansing", data transformation, dimensionality reduction, feature subset selection, etc.;
Building the model and its validation, referring to the analysis of various models and choosing the one who has the best performance of forecast -competitive evaluation of models;
Applying the model to new data to produce correct forecasts/estimates for the problems investigated.

The core process of data mining consists in building a particular model to represent the dataset that is ‘mined’ in order to solve some concrete problems of real-life.We will briefly review some of the most important issues that require the application of data mining methods, methods underlying the construction of the model. Let us list below two data mining goals to distinguish more clearly its area of application (Velicov, 2000):

Predictive objectives, In this class of methods (also called asymmetrical, supervised or direct) the aim is to describe one or more of the variables in relation to all the others. This is done by looking for rules of classification or prediction based on the data. These rules help predict or classify the future result of one or more response or target variables in relation to what happens to the explanatory or input variables. The main methods of this type are those developed in the field of machine learning such as neural networks (multilayer perceptrons) and decision trees, but also classic statistical models such as linear and logistic regression models.
Descriptive objectives, the main objective of this class of methods (also called symmetrical, unsupervised or indirect) is to describe groups of data in a succinct way. This can concern both the observations, which are classified into groups not known beforehand (cluster analysis, Kohonen maps) as well as the variables that are connected among themselves according to links unknown beforehand (association methods, log-linear models, graphical models). In descriptive methods there are no hypotheses of causality among the available variables.

Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer that carries both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.

Figure Dm.02 Data mining as a step in the process of knowledge discovery
Many people treat data mining as a synonymfor another popularly used term, Knowledge Discovery fromData, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in Figure Dm.02 and consists of an iterative sequence of the following steps:

Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved fromthe database)
Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)
Data mining (an essential process where intelligent methods are applied in order to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)
Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

www.CodeNirvana.in