k-Nearest Neighbors
Supervised learning
Given a training set consisting of inputs and corresponding labels.
Input vectors
To handle different types of data, we need to represent the input as input vector in \(\mathbb R^d\),
i.e. Representation mapping to another space that's easy to manipulate
The training set consists of a collection of pairs of an input vector \(\vec x \in \mathbb R^d\) and its corresponding target (label) \(t\) where \(t\) can be.
Regression \(t\in\mathbb R\)
Classification \(t\in \{1,2,...,C\}\)
More structured object
The training set is denoted as \(\{(\vec x^{(i)}, t^{(i)})| i\in\ \{1,2,...,N\}\}\)
k-Nearest Neighbors
Given a novel input \(\vec x\) to be classified, to find the nearest input vector to \(\vec x\) in the training set and copy its label.
nearest for example, this can be formalized by the Euclidean distance \(\|\vec x^{(a)} - \vec x^{(b)}\|_2\)
Decision boundary the boundary between regions of input space assigned to different categories
1NN
Find example \((\vec x^*, t^*)\) (from the stored training set) closest to \(\vec x\). i.e.
Then output \(t^*\)
Note that \(d\) can be directly computed from the square, since we only want the argmin
Problem
it is sensitive to noise of mis-labeled data
kNN
To smooth the noise effect, we can having \(k\) nearest neighbors to vote
Find \(k\) example \(\{\vec x^{(i)}, t^{(i)}\}\) closest to the test instance \(\vec x\), then classification output is majority class
Source code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
# import some data to play with
iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# we create an instance of Neighbours Classifier and fit the data.
fig, axs = plt.subplots(2, 2, figsize=(8, 8))
k_set = [1, 5, 15, 50]
for i in range(4):
clf = neighbors.KNeighborsClassifier(k_set[i], weights='distance')
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
axs[i//2][i%2].pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
axs[i//2][i%2].scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
axs[i//2][i%2].set_xlim(xx.min(), xx.max())
axs[i//2][i%2].set_ylim(yy.min(), yy.max())
axs[i//2][i%2].set_title("(k = %i)" % k_set[i])
plt.tight_layout()
fig.savefig("../assets/knn.jpg")
Pitfalls
Tradeoffs with k
Small k is good at capturing fine-grained patterns
May overfit, i.e. be sensitive to random idiosyncrasies in the training data
Large k makes stable predictions by averaging over lots of examples
May underfit, i.e. fail to capture important regularities
Rule of thumb \(k < \sqrt n\), where \(n\) is the number of training examples
Training set, test set, and validation set
To generalize the algorithm to data haven't been seen before, we can measure the generalization error using a test set.
Note that one test set can only be used to "test" the final error rate, it cannot be used to improve the algorithm, otherwise it will be the same as a training set, hence meaninglessly cheating.
\(k\) is an example of a hyperparameter, i.e. not fit as part of the learning algorithm itself. So to get the "better" \(k\), we can use a validation set.
training set | validation set | test set |
---|---|---|
train with k1 | err = 7.3 | |
train with k2 | err = 1.1 | we can test on this |
train with k3 | err = 10.5 |
The Curse of Dimensionality
Low-D visualizations are misleading, in high dimensions, "most" points are far apart. If we want the nearest neighbor to be closer than \(\epsilon\) (i.e. including the non-categorized option), then since each single ball centered at \(\vec x\) have \(V(B_\epsilon(\vec x)) = O(\epsilon^d)\) and the total volume of \([0,1]^d\) is 1. Therefore, \(O(\epsilon^{-d})\) balls are needed to cover the volume.
In high dimensions, "most" points are approximately the same distance. However, at many times, data points are not uniformly distributed in the space, we may have low intrinsic dimension i.e. lies on or near a low-dimensional manifold. So nearest neighbors sometimes still works in high dimensions.
For example, considering the digit identification case, each image of number is given as \(64\times 64\)px image, while the corners will never have ink, hence it is not uniformly distributed and we can reduce the dimensionality.
Normalization
The units of each dimension are often arbitrary, i.e. kNN can be sensitive to the ranges of different features.
Fix: normalize each dimension with \(N(0, 1)\) i.e. for each dimension, compute \(\mu_j, \sigma_j\), then \(\tilde x_j = \frac{x_j - \mu_j}{\sigma_j}\)
Computation cost
training time: 0
test time per query (naive algorithm):
Calculate D-dimensional Euclidean distance with \(N\) data points \(\in O(ND)\)
Sort the distance \(\in O(N\log N)\)
Also, have to store all the dataset in the memory, for high dimension datasets that will be even more huge.
Conclusion
Simple algorithm that does not have "learning"
Can control the complexity by varying k
Curse of dimensionality