k-Nearest Neighbors

Supervised learning

Given a training set consisting of inputs and corresponding labels.

Input vectors

To handle different types of data, we need to represent the input as input vector in \(\mathbb R^d\),
i.e. Representation mapping to another space that's easy to manipulate

The training set consists of a collection of pairs of an input vector \(\vec x \in \mathbb R^d\) and its corresponding target (label) \(t\) where \(t\) can be.
Regression \(t\in\mathbb R\)
Classification \(t\in \{1,2,...,C\}\)
More structured object

The training set is denoted as \(\{(\vec x^{(i)}, t^{(i)})| i\in\ \{1,2,...,N\}\}\)

k-Nearest Neighbors

Given a novel input \(\vec x\) to be classified, to find the nearest input vector to \(\vec x\) in the training set and copy its label.

nearest for example, this can be formalized by the Euclidean distance \(\|\vec x^{(a)} - \vec x^{(b)}\|_2\)

Decision boundary the boundary between regions of input space assigned to different categories

1NN

Find example \((\vec x^*, t^*)\) (from the stored training set) closest to \(\vec x\). i.e.

\[\vec x^* = arg\min_{\vec x^{(i)}\in T} d(\vec x^{(i)}, \vec x)\]

Then output \(t^*\)

Note that \(d\) can be directly computed from the square, since we only want the argmin

Problem

it is sensitive to noise of mis-labeled data

kNN

To smooth the noise effect, we can having \(k\) nearest neighbors to vote

Find \(k\) example \(\{\vec x^{(i)}, t^{(i)}\}\) closest to the test instance \(\vec x\), then classification output is majority class

\[y = arg\max_{t^{(z)}}\sum_{r=1}^k \mathbb I(t^{(z)}, t^{(r)})\]

Source code

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

# import some data to play with
iris = datasets.load_iris()

# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we create an instance of Neighbours Classifier and fit the data.
fig, axs = plt.subplots(2, 2, figsize=(8, 8))
k_set = [1, 5, 15, 50]
for i in range(4):
    clf = neighbors.KNeighborsClassifier(k_set[i], weights='distance')
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    axs[i//2][i%2].pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    axs[i//2][i%2].scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                    edgecolor='k', s=20)
    axs[i//2][i%2].set_xlim(xx.min(), xx.max())
    axs[i//2][i%2].set_ylim(yy.min(), yy.max())
    axs[i//2][i%2].set_title("(k = %i)" % k_set[i])
plt.tight_layout()
fig.savefig("../assets/knn.jpg")

Pitfalls

Tradeoffs with k

Small k is good at capturing fine-grained patterns
May overfit, i.e. be sensitive to random idiosyncrasies in the training data

Large k makes stable predictions by averaging over lots of examples
May underfit, i.e. fail to capture important regularities

Rule of thumb \(k < \sqrt n\), where \(n\) is the number of training examples

Training set, test set, and validation set

To generalize the algorithm to data haven't been seen before, we can measure the generalization error using a test set.

Note that one test set can only be used to "test" the final error rate, it cannot be used to improve the algorithm, otherwise it will be the same as a training set, hence meaninglessly cheating.

\(k\) is an example of a hyperparameter, i.e. not fit as part of the learning algorithm itself. So to get the "better" \(k\), we can use a validation set.

training set	validation set	test set
train with k1	err = 7.3
train with k2	err = 1.1	we can test on this
train with k3	err = 10.5

The Curse of Dimensionality

Low-D visualizations are misleading, in high dimensions, "most" points are far apart. If we want the nearest neighbor to be closer than \(\epsilon\) (i.e. including the non-categorized option), then since each single ball centered at \(\vec x\) have \(V(B_\epsilon(\vec x)) = O(\epsilon^d)\) and the total volume of \([0,1]^d\) is 1. Therefore, \(O(\epsilon^{-d})\) balls are needed to cover the volume.

In high dimensions, "most" points are approximately the same distance. However, at many times, data points are not uniformly distributed in the space, we may have low intrinsic dimension i.e. lies on or near a low-dimensional manifold. So nearest neighbors sometimes still works in high dimensions.

For example, considering the digit identification case, each image of number is given as \(64\times 64\)px image, while the corners will never have ink, hence it is not uniformly distributed and we can reduce the dimensionality.

Normalization

The units of each dimension are often arbitrary, i.e. kNN can be sensitive to the ranges of different features.

Fix: normalize each dimension with \(N(0, 1)\) i.e. for each dimension, compute \(\mu_j, \sigma_j\), then \(\tilde x_j = \frac{x_j - \mu_j}{\sigma_j}\)

Computation cost

training time: 0
test time per query (naive algorithm):
Calculate D-dimensional Euclidean distance with \(N\) data points \(\in O(ND)\)
Sort the distance \(\in O(N\log N)\)

Also, have to store all the dataset in the memory, for high dimension datasets that will be even more huge.

Conclusion

Simple algorithm that does not have "learning"
Can control the complexity by varying k
Curse of dimensionality