Dimensionality Reduction
1. Why reduce dimensionality?
- The models we fit are typically higher-dimensional input data (2, 3,…,100-dimensional etc.)
- The model may not just consider the impact of each input feature separately, but also possible interactions of two or more features.
- Typically, each feature we wish to allow to impact our predictions requires a parameter in the model
- Thus, the model’s number of parameters grows (often super-linearly) with the dimensions.
- When we fit the model, we must find a joint setting of all these parameters. So the dimension of the optimization problem is the number of parameters.
Dimensionality reduction allows reducing the number of dimensions, and thereby the number of parameters.
Benefits of dimensionality reduction
- Computationally less expensive than fitting/using a model which dimensionality scales poorly.
- Numerically/statistically better to use lower-dimensional representations because it can often be more stable, leading to better performance.
- Interpretability/simplicity achieved by removing irrelevant/redundant, which helps expose key structures in data more clearly.
Why we can reduce dimensionality
- A key assumption in dimensionality reduction is the manifold hypothesis, that the data can be represented (approximated) well by lower-dimensional data.
- The data is said to lie on a lower-dimensional manifold.
- What follows is how to transform the higher-dimensional (real) vector-valued input data by (almost) projecting it onto a lower-dimensional subspace.
2. Principle Component Analysis (PCA)
Essentially, PCA finds lower-dimensional subspaces that describe the essential properties of our data. This allows one to project the data onto these lower-dimensional subspaces, sometimes leading to significant dimensionality reductions.
Principal components are closely related to the singular value decomposition (SVD).
PCA maximizes variance.
LDA is particularly useful in classification problems. We will discuss this classification technique later.
Change of basis
- PCA starts with finding the optimal basis (in some sense) for representing the data.
- Change of basis is implemented by multiplication by an orthogonal matrix .
We say is orthogonal/orthonormal if or
- Since ,
- Important special cases:
- is the identity matrix:
- is a permutation matrix:
- is a rotation matrix:
- Take note: Rotation preserve determinant; reflection about axis changes sign of determinant
Input/feature space
- Project a d-dimensional input (observation vector) onto another d-dimensional vector :
- If is made a unit vector, the dot product gives the length of the projection of onto .
- We call the projection length, the score of for the feature .
- Collecting a number of unit vectors as features into a matrix (one column per feature), gives a vector of scores.
- This vector of scores is called the feature vector corresponding to for the feature space defined by .
The dimensionality reduction approach looks for good ways to choose the feature matrix .
Motivation for applying PCA
- Having the machine learning techniques not depend on the axis system of the collected/recorded data.
- The key idea for choosing features in PCA is thus to remove dependence on the input axes.
How to change the axis system:
- Permuting the axes (changing the order of attributes)
- Translating the origin (e.g. Celsius vs. Fahrenheit vs. Kelvin)
- Scaling the axes (e.g. Celsius vs. Fahrenheit, imperial vs. metric)
- Reflecting the axes (e.g. % increase vs. % decrease, profit vs. loss)
- Rotating the axes (e.g. different camera orientations when capturing images)
- Projection can’t do translation, so it is typical to centre the data first and then subtract the mean from all observations.
- PCA defines a canonical set of axes (up to reflection) - ordering based on importance - which reduces dimension.
Data
- Given real vector-valued observations, each of dimension .
- The original data is collected into a matrix.
- The centered (mean-subtracted) data is collected into matrix , also with shape .
- To center the data: subtract the mean of each attribute from the value of the corresponding attribute, for each observation.
- Equivalently, subtract the mean vector from each column:
\textbf d_i = \textbf x_i - \bar{\textbf x}
D = {\textbf d_i, \dots, \textbf d_N}
Given $N$ observations $\textbf x_n \in \mathbb{R}^d, n = 1,...,N$ of dimension $d$, we will begin by trying to find the directions of **maximum variation** in the data. Consider projecting the data points onto a one-dimensional subspace spanned by a unit vector $\textbf u_1$. Thus $\textbf x_n$ is projected onto the vector $(\textbf u_1^T\textbf x_n)\textbf u_1$and in this coordinate system the projected value is just given by $\textbf u_1^T \textbf x_n$. The idea is to choose $\textbf u_1$ in such a way that the variance of the projected values is as large as possible. If the sample mean vector is given by:\bar{\textbf x} = \frac{1}{N}\sum_{n=1}^N \textbf x_n
then the mean of the projected data is $\textbf u_1^T \bar{\textbf x}$, and the **variance** of the projected data is:\begin{aligned} &=\frac{1}{N}\sum_{n=1}^N (\textbf u_1^T\textbf x_n - \textbf u_1^T\bar{\textbf x})(\textbf u_1^T\textbf x_n - \textbf u_1^T\bar{\textbf x})^T \ &= \frac{1}{N}\sum_{n=1}^N \textbf u_1^T(\textbf x_n - \bar{\textbf x})(\textbf x_n - \bar{\textbf x})^T\textbf u_1 \ &= \textbf{u}1^T\left[\frac{1}{N}\sum{n=1}^N{(\textbf{x}_n-\bar{\textbf{x}})(\textbf{x}_n-\bar{\textbf{x}})^T}\right]\textbf{u}_1 \ &= \textbf{u}_1^TS\textbf{u}_1 \end{aligned} \tag{2.2}
where $S$ is the sample covariance matrix:S = \frac{1}{N}\sum_{n=1}^N(\textbf x_n - \bar{\textbf x})(\textbf x_n - \bar{\textbf x})^T = \frac{1}{N}DD^T \tag{2.3}
#### Finding the first principal component ($\textbf u_1$) by optimization - Sample covariance matrix $S$ is a $d \times d$ matrix - $(i, j)$-th entry of S is covariance of attributes $i$ and $j$ - Consider a potential feature specified by a vector $\textbf u$. - The variance of the scores for this feature is $\textbf u^T S\textbf u$. (Note this is a scalar and is derived above) - In order to maximize this, $\textbf u$ needs to be a unit vector! #### Constrained optimization with Lagrange multiplier Optimize $\textbf{u}^TS\textbf{u}$ subject to $\textbf{u}^{T}\textbf{u} = 1$ Introduce Lagrange multiplier $\lambda$ for the constraint We optimize the Lagrangian:\mathcal{L}(\textbf{u},\lambda) = \textbf{u}^TS\textbf{u} + \lambda(1-\textbf u^T\textbf u) \tag{2.4}
Set the gradient w.r.t $\textbf u$ equal to zero:\nabla_u \mathcal{L} = \frac{\partial \mathcal{L}}{\partial\textbf u}\left[\textbf{u}^TS\textbf{u} + \lambda(1-\textbf u^T\textbf u)\right]= 2S\textbf u - 2\lambda\textbf u = 0
2S\textbf u = 2\lambda\textbf u \longrightarrow S\textbf u = \lambda\textbf u \tag{2.5}
Thus the critical points are eigenvectors, with the corresponding score variances given by the corresponding eigenvalues. For **(PC1)**, choose the eigenvector $\textbf u_1$ corresponding to the largest eigenvalue. #### Finding the rest of the principle components Since PCA requires principle components to be orthogonal, we simply add an additional constraint and **Lagrange Multiplier** to optimize the covariance. Maximize $\textbf u^TS\textbf u$ subject to $\textbf u^T\textbf u = 1$, but also $\textbf u^T\textbf u_1 = 0$ (orthogonal). Introduce Lagrange multipliers for both constraints, and optimize similarly:\mathcal{L}(\textbf{u},\lambda_1,\lambda_2) = \textbf{u}^TS\textbf{u} + \lambda_1(1-\textbf u^T\textbf u) + \lambda_2\textbf u^T\textbf u_1 \tag{2.6}
Set the gradient w.r.t $\textbf u_1$ equal to zero:\nabla_u \mathcal{L} = \frac{\partial \mathcal{L}}{\partial\textbf u}\left[\textbf{u}^TS\textbf{u} + \lambda_1(1-\textbf u^T\textbf u) + \lambda_2\textbf u^T\textbf u_1\right]= 2S\textbf u - 2\lambda_1\textbf u + \lambda_2\textbf u_1 = 0 \tag{2.7}
\frac{\partial \mathcal{L}}{\partial \lambda_1} = 1 - \textbf u^T\textbf u = 0 \tag{2.8}
\frac{\partial \mathcal{L}}{\partial \lambda_2} = \textbf u^T\textbf u_1 = 0 \tag{2.9} $$ We can prove that is zero by multiplying with
Then,
- (since and are orthogonal)
- (since and are orthogonal)
- (since is a unit vector)
Now equation can be used to show:
With this equation can be reduced to:
The solutions are once again the eigenvectors of S same result as in , however, since they must be orthogonal, it is the next biggest eigenvalue’s eigenvector.
Therefore, the nth principle component is eigenvector corresponding to the nth biggest eigenvalue.
Finding principle components in practice
- Need to find the eigenvectors and eigenvalues of the covariance matrix.
- Natural tool: eigendecomposition of a symmetric matrix (a.k.a diagonalization).
- Also consider another approach: using the Singular Value Decomposition (SVD)
- Both approaches collect the eigenvectors of interest in a matrix, and we use matrix multiplication to project the observations into the feature space.
Finding principle components by eigendecomposition
- S is symmetric, so can be factored using eigendecomposition: S= Q\Lambda Q^T \tag{2.12}
numpy.linalg.eigh
optimizes eigendecomposition for symmetric matrices, and order eigenvalues in decreasing order.- Columns of matrix are the principal components.
- Diagonal entries of matrix are the eigenvalues.
Symmetric:
Dimension reduction after eigendecomposition
- To reduce dimension, we use the first principal components by forming from the first columns of .
Notation: when we subscript a matrix by an integer value e.g. , we are limiting the number of rows/columns from to that value.
- Then the transformed data is
- When , there is no dimensionality reduction, just a change of basis specified by .
- We have , where contains largest eigenvalues on diagonal: is a rank- approximation to .
The singular value decomposition (SVD)
Any real matrix can be decomposed as , where
- is a orthogonal matrix
- is a (rectangular) diagonal matrix with non-negative entries in non-increasing order down the diagonal
- is an orthogonal matrix
Such a decomposition is called an SVD of X.
is fully determined by and its number of non-zero entries is the .
The non-zero entries of are called the singular values of .
Solving these matrices:
Note: , because is orthogonal and since is diagonal.
is an eigenvalue problem (), with , , and , thus an eigendecomposition is used to compute .
Similarly for :
More on the SVD
- The columns of and (not ) are left and right singular vectors of respectively.
- Each singular vector is associated with a corresponding singular value (in the same column) of .
- If a singular value is distinct, its left and right singular vectors are unique up to negation.
- When i.e. much more samples than input dimensions (which is always the case in this course), we can trim and of many zero columns to get the reduced SVD:
- Setting , we get the compact SVD, which trims even more zeroes from all three matrices:
Finding principal components by SVD
- We consider performing SVD on the centered data:
- Plug this into the formula and compare the results to the eigendecomposition.
Note: this is the equation
- We see that and , so we can also use singular values to calculate variance for each principal component (PC).
- Benefits of using SVD over eigendecomposition is primarily numerical stability (by avoiding computing the covariance matrix).
Discarding dimensions with the SVD
- Suppose the rank of is , so some diagonal entries of are zero.
- We can use the compact SVD, , to obtain the PCs
- We can also get fewer PCs by approximating D with the truncated SVD:
- Important: here is an approximation to D, not just retaining rows/columns of !