This can be used to define, for each observation, a neighbourhood formed by the observations that are the closest to it, in terms of the distance between the x values. For a continuous response variable, the nearest-neighbour fitted value for each observation’s response value yi is defined by :
This is the arithmetic mean of all response values, whose corresponding x values are contained in the neighbourhood of xi, N(xi ). Furthermore, k is a fixed constant, that specifies the number of elements to be included in each neighbourhood.
Nearest Neighbour classification is mainly used when all attribute values are continuous, although it can be modified to deal with categorical attributes. The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define. Supposing we have a training set with just two instances such as the following
a
|
b
|
c
|
d
|
e
|
ƒ
|
Class
|
yes
|
No
|
No
|
6.4
|
8.3
|
low
|
negative
|
yes
|
Yes
|
Yes
|
18.2
|
4.7
|
high
|
positive
|
There are six attribute values, followed by a classification (positive or negative). We are then given a third instance
yes
|
No
|
No
|
6.6
|
8.0
|
low
|
???
|
What should its classification be?
Even without knowing what the six attributes represent, it seems intuitively obvious that the unseen instance is nearer to the first instance than to the second. In the absence of any other information, we could reasonably predict its classification using that of the first instance, i.e. as ‘negative’. In practice there are likely to be many more instances in the training set but the same principle applies. It is usual to base the classification on those of the k nearest neighbours (where k is a small integer such as 3 or 5), not just the nearest one. The method is then known as k-Nearest Neighbour or just k-NN classification (Figure Nr.01).
Basic k-Nearest Neighbour Classification Algorithm
– Find the k training instances that are closest to the unseen instance.
– Take the most commonly occurring classification for these k instances.
|
We can illustrate k-NN classification diagrammatically when the dimension (i.e. the number of attributes) is small. The following example illustrates the case where the dimension is just 2. In real-world data mining applications it can of course be considerably larger.
Figure Nr.02 shows a training set with 20 instances, each giving the values of two attributes and an associated classification. How can we estimate the classification for an ‘unseen’ instance where the first and second attributes are 9.1 and 11.0, respectively? For this small number of attributes we can represent the training set as 20 points on a two-dimensional graph with values of the first and second attributes measured along the horizontal and vertical axes, respectively.
Attribute 1
|
Attribute 2
|
Class
|
0.8
|
6.3
|
-
|
1.4
|
8.1
|
-
|
2.1
|
7.4
|
-
|
2.6
|
14.3
|
+
|
6.8
|
12.6
|
-
|
8.8
|
9.8
|
+
|
9.2
|
11.6
|
-
|
10.8
|
9.6
|
+
|
11.8
|
9.9
|
+
|
12.4
|
6.5
|
+
|
12.8
|
1.1
|
-
|
14.0
|
19.9
|
-
|
14.2
|
18.5
|
-
|
15.6
|
17.4
|
-
|
15.8
|
12.2
|
-
|
16.6
|
6.7
|
+
|
17.4
|
4.5
|
+
|
18.2
|
6.9
|
+
|
19.0
|
3.4
|
-
|
19.6
|
11.1
|
+
|
Each point is labelled with a + or − symbol to indicate that the classification is positive or negative, respectively. The result is shown in Figure Nr.03.A circle has been added to enclose the five nearest neighbours of the unseen instance, which is shown as a small circle close to the centre of the larger one. The five nearest neighbours are labelled with three + signs and two - signs, so a basic 5-NN classifier would classify the unseen instance as "positive" by a form of majority voting.
Figure Nr.03 Two-dimensional Representation
We can represent two points in two dimensions ("in two-dimensional space" is the usual term) as (a1, a2) and (b1,b2) and visualise them as points in a plane. When there are three attributes we can represent the points by (a1, a2, a3) and (b1, b2, b3) and think of them as points in a room with three axes at right angles. As the number of dimensions (attributes) increases it rapidly becomes impossible to visualise them, at least for anyone who is not a physicist (and most of those who are).
When there are n attributes, we can represent the instances by the points (a1, a2, . . . , an) and (b1, b2, . . . , bn) in "n-dimensional space".
Related Posts :
Nearest Neighbour: Normalisation
Nearest Neighbour: Distance Measures
Data Classification
Nearest Neighbour: Normalisation
Nearest Neighbour: Distance Measures
Data Classification