Mileage
(miles)
|
Number
of doors
|
Age
(years)
|
Number
of owners
|
18,457
|
2
|
12
|
8
|
26,292
|
4
|
3
|
1
|
When the distance of these instances from an unseen one is calculated, the mileage attribute will almost certainly contribute a value of several thousands squared, i.e. several millions, to the sum of squares total. The number of doors will probably contribute a value less than 10. It is clear that in practice the only attribute that will matter when deciding which neighbours are the nearest using the Euclidean distance formula is the mileage.
This is unreasonable as the unit of measurement, here the mile, is entirely arbitrary. We could have chosen an alternative measure of distance travelled such as millimetres or perhaps light years. Similarly we might have measured age in some other unit such as milliseconds or millennia. The units chosen should not affect the decision on which are the nearest neighbours.
To overcome this problem we generally normalise the values of continuous attributes. The idea is to make the values of each attribute run from 0 to 1. Suppose that for some attribute A the smallest value found in the training data is −8.1 and the largest is 94.3. First we adjust each value of A by adding 8.1 to it, so the values now run from 0 to 94.3 + 8.1 = 102.4.
The spread of values from highest to lowest is now 102.4 units, so we divide all values by that number to make the spread of values from 0 to 1. In general if the lowest value of attribute A is min and the highest value is max, we convert each value of ;A, say a, to (a − min)/(max − min).
Using this approach all continuous attributes are converted to small numbers from 0 to 1, so the effect of the choice of unit of measurement on the outcome is greatly reduced.
Note that it is possible that an unseen instance may have a value of A that is less than min or greater than max. If we want to keep the adjusted numbers in the range from 0 to 1 we can just convert any values of A that are less than min or greater than max to 0 or 1, respectively. Another issue that occurs with measuring the distance between two points is the weighting of the contributions of the different attributes.
We may believe that the mileage of a car is more important than the number of doors it has (although no doubt not a thousand times more important, as with the unnormalised values). To achieve this we can adjust the formula for Euclidean distance to
where w1, w2, . . . , wn are the weights. It is customary to scale the weight values so that the sum of all the weights is one.
Dealing with Categorical Attributes
One of the weaknesses of the nearest neighbour approach to classification is that there is no entirely satisfactory way of dealing with categorical attributes. One possibility is to say that the difference between any two identical values of the attribute is zero and that the difference between any two different values is 1. Effectively this amounts to saying (for a colour attribute)
red − red = 0, red − blue = 1, blue − green = 1, etc.
Sometimes there is an ordering (or a partial ordering) of the values of an attribute, for example we might have values good, average and bad. We could treat the difference between good and average or between average and bad as 0.5 and the difference between good and bad as 1. This still does not seem completely right, but may be the best we can do in practice.
Related Posts :
Nearest Neighbour Classification
Nearest Neighbour: Distance Measures
Data Classification
Nearest Neighbour Classification
Nearest Neighbour: Distance Measures
Data Classification