Computer Science

Analysis : Cluster , Outlier , Evolution

Cluster Analysis

Unlike classification and prediction, which analyze class-labeled data objects, clustering analyzes data objects without consulting a known class label. In general, the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, fromwhich rules can be derived. Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes that group similar events together.

Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining. Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.

Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of timerelated data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
Example : Suppose that you have the major stock market (time-series) data of the last several years available from the New York Stock Exchange and you would like to invest in shares of high-tech industrial companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing to your decision making regarding stock investments.
Cluster analysis can be performed on AllElectronics customer data in order to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing. Figure Dm.07 shows a 2-D plot of customers with respect to customer locations in a city. Three clusters of data points are evident.

Figure Dm.07. a 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster “center” is marked with a "+"

Data Mining Tools

The analytical techniques used in data mining are often well-known mathematical algorithms and techniques. What is new is the application of those techniques to general business problems made possible by the increased availability of data, and inexpensive storage and processing power. Also, the use of graphical interface has led to tools becoming available that business experts can easily use.
In addition to using a particular data mining tool, internal auditors can choose from a variety of data mining techniques. The most commonly used techniques include artificial neural networks, decision trees, rule induction, genetic algorithms and the nearest-neighbor method. Each of these techniques analyzes data in different ways :

Artificial Neural Networks :Nonlinear predictive models that learn through training and resemble biological neural networks in structure.
Decision trees : Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset.
Rule induction : The extraction of useful if-then rules from databases on statistical significance.
Genetic Algorithms : Optimization techniques based on the concepts of genetic combination, mutation, and natural selection.
Nearest Neighbor : A classification technique that classifies each record based on the records most similar to it in a historical database.

Each of these approaches brings different advantages and disadvantages that need to be considered prior to their use. Neural networks, which are difficult to implement, require all input and resultant output to be expressed numerically, thus needing some sort of interpretation depending on the nature of the data-mining exercise. The decision tree technique is the most commonly used methodology, because it is simple and straightforward to implement. Finally, the nearest-neighbor method relies more on linking similar items and, therefore, works better for extrapolation rather than predictive enquiries.

Integration Testing

Integration testing is a systematic technique for constructing the program structure while at the same time conducting tests to uncover errors associated with interfacing. In this, many unit tested modules are combined into subsystems, which are then tested. The goal here is to see if the modules can be integrated properly. Hence, the emphasis is on testing interfaces between modules. This testing activity can be considered testing the design.
A major problem that arises during integration testing is localising errors. There are complex interactions between the system components and, when an anomalous output is discovered, you may find it hard to identify where the error occurred. To make it easier to locate errors, you should always use an incremental approach to system integration and testing. Initially, you should integrate a minimal system configuration and test this system. You then add components to this minimal configuration and test after each added increment.

Figure st.02 Incremental Integration Testing
In the example shown in Figure st.02, A, B, C and D are components and T1 to T5 are related sets of tests of the features incorporated in the system. T1, T2 and T3 are first run on a system composed of component A and component B (the minimal system). If these reveal defects, they are corrected. Component C is integrated and T1, T2 and T3 are repeated to ensure that there have not been unexpected interactions with A and B. If problems arise in these tests, this probably means that they are due to interactions with the new component. The source of the problem is localised, thus simplifying defect location and repair. Test set T4 is also run on the system. Finally, component D is integrated and tested using existing and new tests (T5).

System Testing

Here the entire software system is tested. The reference document for this process is the requirements document, and the goal is to see if the software meets its requirements. This is often a large exercise, which for large projects may last many weeks or months. This is essentially a validation exercise, and in many situations it is the only validation activity. The testing process is concerned with finding errors that result from unanticipated interactions between subsystems and system components. It is also concerned with validating that the system meets its functional and non-functional requirements. There are essentially three main kinds of system testing

Alpha Testing. Alpha testing refers to the system testing carried out by the test team within the development organization. The alpha test is conducted at the developer’s site by the customer under the project team’s guidance. In this test, users test the software on the development platform and point out errors for correction. However, the alpha test, because a few users on the development platform conduct it, has limited ability to expose errors and correct them. Alpha tests are conducted in a controlled environment. It is a simulation of real-life usage. Once the alpha test is complete, the software product is ready for transition to the customer site for implementation and development.
Beta Testing. Beta testing is the system testing performed by a selected group of friendly customers. If the system is complex, the software is not taken for implementation directly. It is installed and all users are asked to use the software in testing mode; this is not live usage. This is called the beta test. In this test, end users record their observations, mistakes, errors, and so on and report them periodically. In a beta test, the user may suggest a modification, a major change, or a deviation. The development has to examine the proposed change and put it into the change management system for a smooth change from just developed software to a revised, better software. It is standard practice to put all such changes in subsequent version releases.
Acceptance Testing. Acceptance testing is the system testing performed by the customer to determine whether to accept or reject the delivery of the system. When customer software is built for one customer, a series of acceptance tests are conducted to enable the customer to validate all requirements. Conducted by the end-user rather than the software engineers, an acceptance test can range from an informal ‘test drive’ to a planned and systematically executed series of tests. In fact, acceptance testing can be conducted over a period of weeks or months, thereby uncovering cumulative errors that might degrade the system over time.

How to Mine the Data

Let us see now what the process of "mining" the data means. Schematically, we can identify three characteristic steps of the data mining process:

Exploring data, consisting of data "cleansing", data transformation, dimensionality reduction, feature subset selection, etc.;
Building the model and its validation, referring to the analysis of various models and choosing the one who has the best performance of forecast -competitive evaluation of models;
Applying the model to new data to produce correct forecasts/estimates for the problems investigated.

The core process of data mining consists in building a particular model to represent the dataset that is ‘mined’ in order to solve some concrete problems of real-life.We will briefly review some of the most important issues that require the application of data mining methods, methods underlying the construction of the model. Let us list below two data mining goals to distinguish more clearly its area of application (Velicov, 2000):

Predictive objectives, In this class of methods (also called asymmetrical, supervised or direct) the aim is to describe one or more of the variables in relation to all the others. This is done by looking for rules of classification or prediction based on the data. These rules help predict or classify the future result of one or more response or target variables in relation to what happens to the explanatory or input variables. The main methods of this type are those developed in the field of machine learning such as neural networks (multilayer perceptrons) and decision trees, but also classic statistical models such as linear and logistic regression models.
Descriptive objectives, the main objective of this class of methods (also called symmetrical, unsupervised or indirect) is to describe groups of data in a succinct way. This can concern both the observations, which are classified into groups not known beforehand (cluster analysis, Kohonen maps) as well as the variables that are connected among themselves according to links unknown beforehand (association methods, log-linear models, graphical models). In descriptive methods there are no hypotheses of causality among the available variables.

Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer that carries both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.

Figure Dm.02 Data mining as a step in the process of knowledge discovery
Many people treat data mining as a synonymfor another popularly used term, Knowledge Discovery fromData, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in Figure Dm.02 and consists of an iterative sequence of the following steps:

Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved fromthe database)
Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)
Data mining (an essential process where intelligent methods are applied in order to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)
Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)