Principal Component Analysis

Definition:

Find principle components, i.e. orthogonal directions that are computed via the SVD of a centered data matrix
Think of a cloud of data and if we can project all of the points on only 2 lines, then we would want the line to be in the direction of the largest variances in order to capture the most information

Rank-one affine approximation:

Basic idea:

We can use SVD to to find the line passing through 0 that best approximates data points (i.e., the projection of the points on the line gives the lowest approximation error)
Rank-one approximation but we want it to be affine

Main result:

Center the data (better line)
Let $X := [x_{1}, ..., x_{m}]$ be a $n \times m$ data matrix, with each column $x_{i} \in R^{n}$ a data point
We seek to find a line $L := {x_{0} + α p : α \in R}$ with $x_{0}, p \in R^{n}$ to be determined, so that the projected points on the line are closest to the original ones.
If all the points are approximately on such a line then $x_{i} \approx x_{0} + q_{i} p$ for some $q_{i} \in R, i = 1, ..., m$ so that $X \approx x_{0} 1^{⊺} + p q^{⊺}$
- where 1 is the m-vector of ones
This leads to the problem $min_{x_{0}, p, q} ∣∣ X - x_{0} 1^{⊺} - p q^{⊺} ∣ ∣_{F}^{2}$ where $x_{0}$ and $p, q \in R^{m}$ are both variables
- and the optimal $x_{0}$ is $\overset{x}{^} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$
So the problem can be reduce to a standard rank-one problem $min_{p, q_{c}} ∣∣ X_{c} - p q_{c}^{⊺} ∣ ∣_{F}^{2}$
- where $q_{c} = q - \overset{q}{^} 1$ and $\overset{q}{^} = \frac{1}{m} \sum_{i = 1}^{m} q_{i}$ and $X_{c}$ is the centered data matrix $X_{c} = [x_{1} - \overset{x}{ˉ}, ..., x_{m} - \overset{x}{ˉ}]$

PCA

Deflation:

Once we’ve found a line close to the cloud of points, can we repeat the process and find another one that is closest again
Deflation method:
- Project data points on the hyperplane orthogonal to the direction we found
- Find a new direction of for the projected data.
- Iterate
- Stop until when got $k$ dimensions
Geomtric of deflation: project the data on a hyperplane orthogonal to the line with direction 1, then a line that is on the plane, orthogonal to the first direction, is found
Setup:
- After SVD, we have centered data matrix $n \times m$ , $X_{c} = X - \overset{x}{^} 1^{⊺} = U \tilde{S} V^{⊺} = \sum_{i = 1}^{r} σ_{i} u_{i} v_{i}^{⊺}$ where:
  - $\tilde{S}$ is a $n \times m$ diagonal matrix $\tilde{S} = diag (σ_{1}, ..., σ_{r}, 0, ..., 0)$
  - $U := [u_{1}, ..., u_{n}]$ and $V := [v_{1}, .., v_{m}]$ are orthogonal
Euclidean Projection on a line, projection of $z$ on a point $x$ on the line $L := {x_{0} + αu : α \in R}$ is given by $z = x_{0} + (u u^{⊺}) (x - x_{0})$
- we have $x = z + z_{⊥}$ where $z ⊥ z_{⊥}$ ; therefore, $z_{⊥} - x_{0} = (I - u u^{⊺}) (x - x_{0})$
After projecting the points on the closest line, the (centered) matrix of projected points is $Z^{(1)} = u_{1} u_{1}^{⊺} X_{c}$ while the points projected on the orthogonal to the line are $X_{c}^{(1)} = P_{1} X_{c}$ where $P_{1} := I - u_{1} u_{1}^{⊺}$
Since $U$ is orthonormal, the projected points $X_{c}^{(1)}$ can be written as $X_{c}^{(1)} = P_{1} \sum_{i = 1}^{r} σ_{i} u_{i} v_{i}^{⊺} = \sum_{i = 2}^{r} σ_{i} u_{i} v_{i}^{⊺} = U \tilde{S}^{(2)} V^{⊺}$
- where $\tilde{S}^{(2)} := d ia g (0, σ_{2}, ...., σ_{r}, 0, ..., 0)$
In effect we have removed the component (dyad) associated with $σ_{1}$
Note that the new data matrix is centered, and has largest singular value $σ_{2}$ , with the corresponding direction $u_{2}$ .
- Together, $u_{1}, u_{2}$ span the closest plane approximating data
After $k$ steps of the deflation process, the directions returned are $u_{1}, ..., u_{k}$ ; spanning the closest k-dimensional subspace.
- Thus we can compute all the principal components with one SVD of the original centered data matrix

Approximation error:

The sum of the squared lengths of the distances of points project to the line measures the error between points and their projections.
Explained variance:
- At the first step, the average squared error between the (centered) original points and the points projected on the closest line is $∣∣ X_{c} - Z^{(1)} ∣ ∣_{F}^{2} = ∣∣ P_{1} X_{c} ∣ ∣_{F}^{2} = σ_{2}^{2} + ... + σ_{n}^{2}$
After the k-th step, the error is thus $∣∣ X_{c} - Z^{(k)} ∣ ∣_{F}^{2} = ∣∣ (P_{k} ... P_{1}) X_{c} ∣ ∣_{F}^{2} = σ_{k + 1}^{2} + ... + σ_{n}^{2}$
The explained variance is the ratio $0 \leq ρ^{2} := \frac{σ _{k + 1}^{2} + ... + σ _{n}^{2}}{σ _{1}^{2} + ... + σ _{n}^{2}} \leq 1$

Data projection:

Once we computed the SVD of the centered data matrix, we can project the data on the plane spanned by the two vectors u1, u2.
This plane is the plane closest to data. The (centered) projected points are the 2D vectors in $Z : Z = [z_{1}, ..., z_{m}] = (u_{1}^{⊺} u_{2}^{⊺}) X_{c}$
More generally, the span of ${u_{1}, ..., u_{k}}$ is the subspace of dimension $k$ closest to the data set (this comes directly from the low-rank approximation theorem). The projected points are given by $Z = [z_{1}, ..., z_{m}] = u_{1}^{⊺} ... u_{k}^{⊺} X_{c}$

Link between SVD and eigenvalue decomposition of covariance:

Consider a centered matrix of data points $X_{c}$
The covariance matrix can be written as $C = \frac{1}{m} i = 1 \sum m (x_{i} - \overset{x}{^}) (x_{i} - \overset{x}{^})^{T} = \frac{1}{m} X_{c} X_{c}^{T}, \overset{x}{^} := \frac{1}{m} i = 1 \sum m x_{i}$
If $X_{c}$ has SVD $X_{c} = U \tilde{S} V^{T}$ , then $C = U \land U^{T}$ , where $Λ := \frac{1}{m} diag (σ_{1}^{2}, \dots, σ_{r}^{2}, 0, \dots, 0)$
Thus, the left singular vectors of $X_{c}$ are the eigenvectors of $C$ .

Variance Maximization problem:

Let $C$ be the sample $n \times n$ Covariance Matrix for a data set of $m$ points $x_{1}, ..., x_{m} \in R^{n}$ with mean $\overset{x}{^} = (1/ m) (x_{1} + ... + x_{m})$
For a given $n$ -vector $u$ , the line $L (\overset{x}{ˉ}, u)$ of points of the form $\overset{x}{^} + αu$
- for $α \in R$
- knowing that the projected data points are of the form $z_{i} = \overset{x}{^} + α_{i} u_{i}$ for $α_{i} = u^{⊺} (x_{i} - \overset{x}{ˉ})$
We seek a direction $u$ such that the variance of the projected points on the line $L (\overset{x}{^}, u)$ is maximal. Without loss of generality, we can assume $∣∣ u ∣ ∣_{2} = 1$
The variance of scores (m-vectors $α$ ) is $\frac{1}{m} \sum_{i = 1}^{m} [u^{⊺} (x_{i} - \overset{x}{^})]^{2} = y^{T} C u$
We wish to maximize $m a x_{u} u^{T} C u : ∣∣ u ∣ ∣_{2} = 1$
Solution: an optimal vector $u^{*} = u_{1}$ with $u_{1}$ an eigenvector of $C$ that corresponds to its largest eigenvalue $λ_{1}$
Projection on maximum variance line:
- The line with maximum variance is set of points of the form $\overset{x}{^} + α u_{1}$
- The projection of a generic point $x \in R^{n}$ on the line is $z = \overset{x}{^} + (u_{1}^{⊺} (x - \overset{x}{^})) u_{1} = \overset{x}{^} + u_{1} u_{1}^{T} (x - \overset{x}{^})$
- We can project all the data points at once: the (centered) matrix of projected points is the dyad: $Z^{(1)} = [z_{1}^{(1)} - \overset{x}{^}, ..., z_{m}^{(1)} - \overset{x}{^}] = u_{1} u_{1}^{T} X_{c}$
- The line of maximum variance minimizes the sum of squared Euclidean norm distance between the data points and the line.
- In fact, PCA can be interpreted two ways:
  - as a variance maximization process, relying on the eigenvalue decomposition of the covariance matrix;
  - as a low-rank approximation of the centered data matrix, relying on the SVD of that matrix

Minimum distance line:

The projection $z$ of $x$ along line $\overset{x}{^} + αu$ is $z = \overset{x}{^} + u u^{T} (x - \overset{x}{^})$
So the sum of squared distances between $x^{(i)}$ and its projection $z^{(i)}$ is
- $i \sum x^{(i)} - z^{(i)}_{2}^{2} = i \sum (x^{(i)} - \overset{x}{^}) - u u^{T} (x^{(i)} - \overset{x}{^})_{2}^{2} = i \sum x_{c}^{(i)}_{2}^{2} - 2 (u^{T} x_{c}^{(i)})^{2} + (u^{T} x_{c}^{(i)})^{2}$
- $= i \sum x_{c}^{(i)}_{2}^{2} - (u^{T} x_{c}^{(i)})^{2} = ∥ X_{c} ∥_{F}^{2} - m (u^{T} C u)$
Hence, minimizing the sum of squared distances is the same as maximizing $(1/ m) \sum_{i} (u^{T} x_{c}^{(i)})^{2}$ , that is, the variance of the projected points.

Rank-one approximation:

Consider the centered data matrix $X_{c} = [x_{c}^{(1)} \dots x_{c}^{(m)}]$ and assume we want to find the rank-one matrix matrix $σ u v^{T}$ that best approximates $X_{c}$ in the Frobenius norm sense: $min_{σ, u, v} X_{c} - σ u v^{T}_{F}^{2}, σ \geq 0, ∥ u ∥_{2} = ∥ v ∥_{2} = 1$
Since $∥ Y ∥_{F}^{2} = trace (Y Y^{T})$ , we have:
- $∥ X_{c} - σ u v^{T} ∥_{F}^{2} = trace (X_{c}^{T} - σ v u^{T}) (X_{c} - σ u v^{T}) = trace (X_{c}^{T} X_{c}) - 2 σ trace (X_{c}^{T} u v^{T}) + σ^{2} trace (v u^{T} u v^{T}) = trace (X_{c}^{T} X_{c}) - 2 σ u^{T} X_{c} v + σ^{2} ∥ u ∥_{2}^{2} ∥ v ∥_{2}^{2} = trace (X_{c}^{T} X_{c}) - 2 σ u^{T} X_{c} v + σ^{2}$
Given the SVD of $X_{c} = σ_{1} u_{1} v_{1}^{T} + \dots$ the best choice is to take $u = u_{1}$ and $v = v_{1}$ , whence $u^{T} X_{c} v = σ_{1}$ , etc…

The left singular vector $u_{1}$ :

The three problems:
- Find the line of maximal projected data variance;
- Find the line that minimizes the residual distance from the original points to their projections;
- Find a rank one approximation of the centered data matrix;
All have solutions that involve the left principal singular vector $u_{1}$ of the centered data matrix $X_{c}$ .
N.B.: this is also the principal eigenvector of the covariance matrix $C$ . Indeed, if $X_{c} = U Σ V^{T}$ , then
- $C = \frac{1}{m} X_{c} X_{c}^{T} = \frac{1}{m} U Σ V^{T} V Σ U^{T} = U (\frac{1}{m} Σ^{2}) U^{T}$ is an eigenvalue factorization for $C$

StrixTheKiet Notes

Explorer

Principal Component Analysis

Definition:

Rank-one affine approximation:

Basic idea:

Main result:

PCA

Deflation:

Approximation error:

Data projection:

Link between SVD and eigenvalue decomposition of covariance:

Variance Maximization problem:

Minimum distance line:

Rank-one approximation:

The left singular vector $u_{1}$ :

Graph View

Table of Contents

Backlinks

StrixTheKiet Notes

Explorer

Principal Component Analysis

Definition:

Rank-one affine approximation:

Basic idea:

Main result:

PCA

Deflation:

Approximation error:

Data projection:

Link between SVD and eigenvalue decomposition of covariance:

Variance Maximization problem:

Minimum distance line:

Rank-one approximation:

The left singular vector u1​:

Graph View

Table of Contents

Backlinks

The left singular vector $u_{1}$ :