K-Nearest Neighbor (KNN) Algorithm
Last updated
Last updated
The k- Nearest Neighbor (kNN) algorithm is one of the simplest supervised learning methods for solving both classification and regression problems. The kNN algorithm has only one parameter, called k, to be tuned and is easy to understand and implement. To make a prediction for a given data point, it relies on the labels of the closest k points in its vicinity. Therefore, the underlying assumption is that points with similar values for their response variable (e.g., same labels) will be in close proximity. The input vector or input space can have any number of dimensions d.
To implement this algorithm, the number of neighbors (i.e., k) needs to be selected by setting it to a positive integer. Once k is known, a prediction can be made by relying on the training samples. For example, in solving a binary classification problem, the kNN algorithm runs the following steps to predict the label of a new observation Xs.
Find the k-training examples that are closest to the sth observation, and name them Nk(Xs), k-neighbors of a d-dimensional Xs. Since there are only two classes, the label of each training sample i can be represented by a zero or one, i.e., yi0,1.
Using the labels of these k-neighbors, we then predict the label of the sth observation, ys, as follows:
The equation above simply yields the fraction of neighbors that have class 1 as their labels. If ys>0.5 then, the label of the sth observation is 1 otherwise 0 (i.e., majority vote wins). As an illustration, Figure 2-1 shows 2-dimensional data with binary labels. The class of a new point, depicted by a star, will be predicted as B when k is set to 3 since there are more class B points within its k-neighborhood.
Figure 2-1 Sample data with two classes and three nearest neighbors for a given point
In implementing a kNN classification algorithm for a new data point these five main steps are typically followed: (i) set k to a positive integer; (ii) calculate the distances (e.g., Euclidian) between the new observation and all the training data; (iii) sort the distances and determine k-nearest neighbors; (iv) get the categories (classes) of these neighbors; and (v) estimate the category of the new observation based on the majority vote. If a regression problem is being solved, only step (v) needs to be modified. In that case, the mean or median of the k-neighbors can be computed to estimate the dependent variable for the new observation.
As indicated above, kNN method has a single parameter, k, that needs to be tuned. Like in any other ML algorithm, the value selected for the parameter will impact the performance of the model. Figure 2-2 shows how the decision boundary will change depending on the selected value for k. Making k too small results in irregular boundaries and overfitting to the training data. On the other hand, a very large k may eliminate important local nonlinearities as a larger number of samples will influence the decision boundary and make it smoother. In Figure 2-2, when k = 1, all the training samples are correctly classified whereas for k = 15 there are some misclassifications. However, the training error should not be the criteria to select the best model parameter. Otherwise, k will always be set to 1. Instead, the performance of the model under varying k should be assessed with a validation dataset. The k value that minimizes the error under validation data is expected to perform and generalize better.
Given its simplicity, kNN method is a commonly used method and often performs well, especially when the decision boundary is very irregular [1]. Since finding the k-nearest-neighbors requires computing distance to each one of the training data points, kNN algorithm is not computationally efficient. However, algorithms that can improve the computation speed have been developed [1].
Figure 2-2 The effects of k on decision boundaries for solving binary classification problems