Nearest Neighbour Classification

Nearest Neighbour methods are a flexible class of predictive data mining methods based on a combination of local models. The available variables are divided into the explanatory variables (x) and the target variable (y). A sample of observations in the form (x, y) is collected to form a training data set. For this training data set, a distance function is introduced between the x values of the observations.
This can be used to define, for each observation, a neighbourhood formed by the observations that are the closest to it, in terms of the distance between the x values. For a continuous response variable, the nearest-neighbour fitted value for each observation’s response value y_i is defined by :

This is the arithmetic mean of all response values, whose corresponding x values are contained in the neighbourhood of x_i, N(x_i ). Furthermore, k is a fixed constant, that specifies the number of elements to be included in each neighbourhood.
Nearest Neighbour classification is mainly used when all attribute values are continuous, although it can be modified to deal with categorical attributes. The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define. Supposing we have a training set with just two instances such as the following

a	b	c	d	e	ƒ	Class
yes	No	No	6.4	8.3	low	negative
yes	Yes	Yes	18.2	4.7	high	positive

There are six attribute values, followed by a classification (positive or negative). We are then given a third instance

yes

6.6

8.0

low

???

What should its classification be?
Even without knowing what the six attributes represent, it seems intuitively obvious that the unseen instance is nearer to the first instance than to the second. In the absence of any other information, we could reasonably predict its classification using that of the first instance, i.e. as ‘negative’. In practice there are likely to be many more instances in the training set but the same principle applies. It is usual to base the classification on those of the k nearest neighbours (where k is a small integer such as 3 or 5), not just the nearest one. The method is then known as k-Nearest Neighbour or just k-NN classification (Figure Nr.01).

Basic k-Nearest Neighbour Classification Algorithm

– Find the k training instances that are closest to the unseen instance.

– Take the most commonly occurring classification for these k instances.

Figure Nr.01 The Basic k-Nearest Neighbour Classification Algorithm
We can illustrate k-NN classification diagrammatically when the dimension (i.e. the number of attributes) is small. The following example illustrates the case where the dimension is just 2. In real-world data mining applications it can of course be considerably larger.
Figure Nr.02 shows a training set with 20 instances, each giving the values of two attributes and an associated classification. How can we estimate the classification for an ‘unseen’ instance where the first and second attributes are 9.1 and 11.0, respectively? For this small number of attributes we can represent the training set as 20 points on a two-dimensional graph with values of the first and second attributes measured along the horizontal and vertical axes, respectively.

Attribute 1	Attribute 2	Class
0.8	6.3	-
1.4	8.1	-
2.1	7.4	-
2.6	14.3	+
6.8	12.6	-
8.8	9.8	+
9.2	11.6	-
10.8	9.6	+
11.8	9.9	+
12.4	6.5	+
12.8	1.1	-
14.0	19.9	-
14.2	18.5	-
15.6	17.4	-
15.8	12.2	-
16.6	6.7	+
17.4	4.5	+
18.2	6.9	+
19.0	3.4	-
19.6	11.1	+

Figure Nr.02 Training Set for k-Nearest Neighbour Example
Each point is labelled with a + or − symbol to indicate that the classification is positive or negative, respectively. The result is shown in Figure Nr.03.A circle has been added to enclose the five nearest neighbours of the unseen instance, which is shown as a small circle close to the centre of the larger one. The five nearest neighbours are labelled with three + signs and two - signs, so a basic 5-NN classifier would classify the unseen instance as "positive" by a form of majority voting.

Figure Nr.03 Two-dimensional Representation
We can represent two points in two dimensions ("in two-dimensional space" is the usual term) as (a₁, a₂) and (b₁,b₂) and visualise them as points in a plane. When there are three attributes we can represent the points by (a₁, a₂, a₃) and (b₁, b₂, b₃) and think of them as points in a room with three axes at right angles. As the number of dimensions (attributes) increases it rapidly becomes impossible to visualise them, at least for anyone who is not a physicist (and most of those who are).
When there are n attributes, we can represent the instances by the points (a₁, a₂, . . . , a_n) and (b₁, b₂, . . . , b_n) in "n-dimensional space".

Nearest Neighbour Classification

www.CodeNirvana.in