Data Classification

Classification is a type of data analysis which can help people predict the class labels of the samples to be classified. A wide variety of classification techniques have been proposed in fields such as machine learning, expert systems and statistics. Normally, classification models are trained first on a historical dataset (i.e., the training set) with their class labels already known. Then, these trained classifiers are applied to predict the class labels of new samples. The process of classification is based on four fundamental components:

Class, the dependent variable of the model- which is a categorical variable representing the 'label' put on the object after its classification. Examples of such classes are: presence of myocardial infarction, customer loyalty, class of stars (galaxies), class of an earthquake (hurricane), etc.
Predictors, the independent variables of the model- represented by the characteristics (attributes) of the data to be classified and based on which classification is made. Examples of such predictors are: smoking, alcohol consumption, blood pressure, frequency of purchase, marital status, characteristics of (satellite) images, specific geological records, wind and speed direction, season, location of phenomenon occurrence, etc.
Training Dataset, which is the set of data containing values for the two previous components, and is used for ‘training’ the model to recognize the appropriate class, based on available predictors.
Testing Dataset, containing new data that will be classified by the (classifier) model constructed above, and the classification accuracy (model performance) can be thus evaluated. The terminology of the classification process includes the following words:
- The dataset of records/tuples/vectors/instances/objects/samples forming the training set;
- Each record/tuple/vector/instance/object/sample contains a set of attributes (i.e., components/features) of which one is the class (label);
- The classification model (the classifier) which, in mathematical terms, is a function whose variables (arguments) are the values of the attributes (predictive/independent), and its value is the corresponding class;
- The testing dataset, containing data of the same nature as the dataset of training and on which the model’s accuracy is tested.

The purpose of supervised learning is to predict the value (output) of the function for any new object/sample (input) after the completion of the training process. The classification technique, as a predictive method, is such an example of supervised machine learning technique, assuming the existence of a group of labeled instances for each category of objects. Summarizing, a classification process is characterized by:

Input: a training dataset containing objects with attributes, of which one is the class label
Output: a model (classifier) that assigns a specific label for each object (classifies the object in one category), based on the other attributes
The classifier is used to predict the class of new, unknown objects. A testing dataset is also used to determine the accuracy of the model.

Illustration in Figure cl.01, graphically, the design stages of building a classification model for the type of car that can be bought by different people. It is what one would call the construction of a car buyer profile.

Figure cl.01 Stages of Building a Classification model (cars retailer)
Summarizing, we see from the drawing above that in the first phase we build the classification model (using the corresponding algorithm), by training the model on the training set. Basically, at this stage the chosen model adjusts its parameters, starting from the correspondence between input data (age and monthly income) and corresponding known output (type of car). Once the classification function identified, we verify the accuracy of the classification using the testing set by comparing the expected (forecasted) output with that observed in order to validate the model or not (accuracy rate = % of items in the testing set correctly classified).

Once a classification model built, it will be compared with others in order tochoose the best one. Regarding the comparison of classifiers (classification models), we list below some key elements which need to be taken into account.
Predictive accuracy, referring to the model's ability to correctly classify every new, unknown object;
Speed, which refers to how quickly the model can process data;
Robustness, illustrating the model’s ability to make accurate predictions even in the presence of "noise" in data;
Scalability, referring mainly to the model’s ability to process increasingly larger volume of data; econdly, it might refer to the ability of processing data from different fields;
Interpretability, illustrating the feature of the model to be easily understood, interpreted;
Simplicity, which relates to the model’s ability to be not too complicated, despite its effectiveness. In principle, we choose the simplest model that can effectively solve a specific problem - just as in Mathematics, where the most elegant demonstration is the simplest one.

Among the most popular classification models (methods), we could mention, although they are used, obviously, for other purposes too:

www.CodeNirvana.in