Data Mining

However, before attempting a definition of data mining (sometimes called data or knowledge discovery), let us emphasize some aspects of its genesis. Data Mining, has three generic roots, from which it borrowed the techniques and terminology.

Statistics-its oldest root, without which data mining would not have existed. The classical Statistics brings well-defined techniques that we can summarize in what is commonly known as Exploratory Data Analysis (EDA), used to identify systematic relationships between different variables, when there is no sufficient information about their nature. Among EDA classical techniques used in DM, we can mention:
Computational methods: descriptive statistics (distributions, classical statistical parameters (mean, median, standard deviation, etc.), correlation, multiple frequency tables, multivariate exploratory techniques (cluster analysis, factor analysis, principal components & classification analysis, canonical analysis, discriminant analysis, classification trees, correspondence analysis), advanced linear/non-linear models
Data visualization aims to represent information in a visual form, and can be regarded as one of the most powerful and, at the same time, attractivemethods of data exploration. Among the most common visualization techniques, we can find: histograms of all kinds (column, cylinders, cone, pyramid, pie, bar, etc.), box plots, scatter plots, contour plots, matrix plots, icon plots, etc.
Artificial Intelligence (AI), unlike Statistics, is built on heuristics. Thus, AI contributes with information processing techniques, based on human reasoning model, towards data mining development. Closely related to AI, Machine Learning (ML) represents an extremely important scientific discipline in the development of data mining, using techniques that allow the computer to learn with "training". In this context, we can also consider Natural Computing (NC) as a solid additional root for data mining.
Database systems (DBS) are considered the third root of data mining, providing information to be "mined" using the methods mentioned above.

The necessity of "mining" the data can be thus summarized, seen in the light of important real-life areas in need of such investigative techniques:

Economics (business-finance), there is a huge amount of data already collected in various areas such as:Web data, e-commerce, super/hypermarkets data, financial and banking transactions, etc., ready for analyzing in order to take optimal decisions;
Health care, there are currently many and different databases in the health care domain (medical and pharmaceutical), which were only partially analyzed, especially with specific medical means, containing a large information yet not explored sufficiently;
Scientific research, there are huge databases gathered over the years in various fields (astronomy, meteorology, biology, linguistics, etc.), which cannot be explored with traditional means.

So, by data mining we mean (equivalent approaches):

The automatic search of patterns in huge databases, using computational techniques from statistics, machine learning and pattern recognition;
The non-trivial extraction of implicit, previously unknown and potentially useful information from data;
The science of extracting useful information from large datasets or databases;
The automatic or semi-automatic exploration and analysis of large quantities of data, in order to discover meaningful patterns;
The automatic discovery process of information. The identification of patterns and relationships "hidden" in data.

We saw above what data mining means. In this context, it is interesting to see what data mining is not. We present below four different concrete situations which eloquently illustrates what data mining is not compared with what it could be :

What is not data mining ?
Searching for particular information on Internet (e.g., about cooking on Google)
What data mining could be ?
Grouping together similar information in a certain context (e.g., about French cuisine, Italian cuisine, etc., found on Google);
What is not data mining ?
A physician seeking a medical register for analyzing the record of a patient with a certain disease. What data mining could be: Medical researchers finding a way of grouping patients with the same disease, based on a certain number of specific symptoms;
What is not data mining ?
Looking up spa resorts in a list of place names. What data mining could be: Grouping together spa resorts that are more relevant for curing certain diseases (gastrointestinal, urology, etc.);
What is not data mining ?
The analysis of figures in a financial report of a trade company. What data mining could be: Using the trade company database concerning sales, to identify the customers’ main profiles

As it is seen from the above examples, we cannot equate a particular search (research) of an individual object (of any kind) and data mining research. In the latter case, the research does not seek individualities, but sets of individualities, which, in one way or another, can be grouped by certain criteria. Metaphorically speaking once more, the difference between a simple search and a data mining process is that of looking for a specific tree and the identification of a forest (hence the well-known proverb “Can’t see the forest for the trees” used when the research is not sufficiently lax regarding constraints).

www.CodeNirvana.in