Dimension Reduction - PCA and Auto Encoders
Dimension Reduction
Loss some information (e.g. spread / \(\sigma\)) by projecting higher dimensions onto lower ones. IN practice, the important features can be accurately captured in a low dimensional subspace.
Let \(\mathcal D = \{x^{(1)},...,x^{(N)}\}\subset \mathbb R^D\), so that \(N\) instances will form matrix
each row will be one observation of \(D\) faetures,
Let \(x^{(i)}\sim N(\mu, \Sigma)\)
Projection onto a subspace
Given \(\mathcal D, \hat\mu=N^{-1}\sum^N x^{(i)}\) be the sample mean.
Want to find a $K <D $ dimensional subspace \(S\subset \mathcal R^D\) s.t. \(x^{(n)}-\hat\mu\) is "well represented" by a projection onto \(K\)-dimensional \(S\).
Where projection is to find the closest point \(\tilde x\) on \(S\) s.t. \(\|x-\tilde x\|\) is minimized.
In a 2-dimensional problem, we are looking for direction \(u_1\) along with the data is well-represented, such as direction of higher variance or the direction of min difference after projection, which turns to be the same.
Euclidean Projection
\(\text{Proj}_S(x):=\) projection of \(x\) on \(S\).
In 2D case, \(S\) is the line along the unit vector \(u\) (1-D subspace). \(u\) is a basis for \(S\).
Since \(x^Tu = \|x\|\|u\|\cos\theta =\|x\|\cos\theta\)
In K-D case, we have \(K\) basis \(u_1,...,u_K \in \mathcal R^D\). and the projection will be
Center data
Centering by subtract the mean \(\hat\mu\). i.e. the mean (center) be the origin. We need to center the data since we don't want location of data to influence the calculations.
Representation/code
Combine the two above together, we have
where \(z_k = \vec u_k^T(x-\hat\mu)\)
Define matrix \(U_{D\times K} = [\vec u_1, ..., \vec u_K]\), then
We call \(UU^T\) the projector on \(S\), \(U^TU = I\)
Both \(x,\tilde x\in \mathbb R^D\) but \(\tilde x\) is a linear combination of vectors in a lower dimensional subspace with representation \(\vec z \in \mathbb R^K\).
We call \(\tilde x\) reconstruction of \(x\), \(\vec z\) be its representation(code).
Learning a Subspace
Since we will definitely lose partial information by dimension reduction, we want a good \(D\times K\) matrix \(U\) with orthonormal columns.
To measure how "good" the subspace is, propose two criteria:
Minimize reconstruction error: find vectors in a subspace that are closest to data points
Maximize variance of reconstructions find a subspace where data has the most variability
Noting that
So that we can still use mean of \(x\) to calculate variance of the reconstruciton
Equivalence of the criteria
lemma1 Norm of centered reconstruction is equal to norm of representation
lemma2 \(\tilde x^{(i)}-\hat\mu\) is orthogonal to \(\tilde x^{(i)} - x^{(i)}\)
Proposition The two criteria is equivalent \(\frac 1 N \sum^N\|x^{(i)}- \tilde x^{(i)}\|^2 = C - \frac1N\sum^N\|\tilde x^{(i)}-\hat\mu\|^2\)
By lemma2, since the two vectors are orthogonal, by Pythagorean Theorem
PCA
Spectral Decomposition (Eigendecomposition)
If \(A_{n\times n}\) is a symmetric matrix (so that has a full set of eigenvectors). Then, \(\exists Q_{n\times n}, \Lambda_{n\times n}\) s.t. \(A = Q\Lambda Q^T\) where \(Q\) is orthogonal matrix formed by \(n\) eigenvectors and \(\Lambda\) is diagonal with \(\lambda_1,...,\lambda_n\).
Using Eigendecomposition on the empirical convariance matrix \(\hat\Sigma = \frac1N \sum^N (x^{(i)}-\hat\mu)(x^{(i)}-\hat\mu)^T\), the optimal PCA subspace is then spanned by some \(K\) eigenvectors of \(\hat\Sigma\)
These eigencectors are called principal components, analogous to the principal axes of an ellipse.
Deriving PCA for K = 1
For the goal of maximize \(\sum^D \lambda_j a_j^2, \vec a = Q^Tu\), noting that \(\sum a_j = a^Ta = u^TQQ^Tu = u^Tu = 1\). Therefore, choosing the largest \(\lambda_k\),
And such maximum can be obtained by setting \(a_i = \mathbb I(i = k)\). Therefore, \(\vec u = Q\vec a = q_k\)
Decorrelation
Autoencoder
An autoencoder is a feed-forward neural net to take input/target pair \((\vec x, \vec x)\). In such NN, we add a bottleneck layer to reduce the dimensionality so that the weights on such layer will be our code vector.
The whole process goes through
By doing such, we learn abstract features in an unsupervised way, and can transfer to supervised tasks.
Linear autoencoders
If we use linear activations and squared error loss. Say we have 1 hidden layer of \(k\) weights, so that \(\tilde x = W_2W_1x, W_2:D\times K, W_1: K\times D\), then \(W_2\) is the PCA subspace
Nonlinear Autoencoder
If we use non-linear activations, then they can be more powerful for a given dimensionality, comparing to PCA (but also much more computational heavy in finding an optimal subspace).