Definition:

  • Find principle components, i.e. orthogonal directions that are computed via the SVD of a centered data matrix
  • Think of a cloud of data and if we can project all of the points on only 2 lines, then we would want the line to be in the direction of the largest variances in order to capture the most information

Rank-one affine approximation:

Basic idea:
  • We can use SVD to to find the line passing through 0 that best approximates data points (i.e., the projection of the points on the line gives the lowest approximation error)
  • Rank-one approximation but we want it to be affine
Main result:
  • Just center the data
  • Let be a data matrix, with each column a data point
  • We seek to find a line with to be determined, so that the projected points on the line are closest to the original ones.
  • If all the points are approximately on such a line then for some so that
    • where 1 is the m-vector of ones
  • This leads to the problem where and are both variables
    • and the optimal is
  • So the problem can be reduce to a standard rank-one problem
    • where and and is the centered data matrix

PCA

Deflation:
  • Once we’ve found a line close to the cloud of points, can we repeat the process and find another one that is closest again
  • Deflation method:
    • Project data points on the hyperplane orthogonal to the direction we found
    • Find a new direction of for the projected data.
    • Iterate
    • Stop until when got dimensions
  • Geomtric of deflation: project the data on a hyperplane orthogonal to the line with direction 1, then a line that is on the plane, orthogonal to the first direction, is found
  • Setup:
    • After SVD, we have centered data matrix , where:
      • is a diagonal matrix
      • and are orthogonal
  • Euclidean Projection on a line, projection of on a point on the line is given by
    • we have where ; therefore,
  • After projecting the points on the closest line, the (centered) matrix of projected points is while the points projected on the orthogonal to the line are where
  • Since is orthonormal, the projected points can be written as
    • where
  • In effect we have removed the component (dyad) associated with
  • Note that the new data matrix is centered, and has largest singular value , with the corresponding direction .
    • Together, span the closest plane approximating data
  • After steps of the deflation process, the directions returned are ; spanning the closest k-dimensional subspace.
    • Thus we can compute all the principal components with one SVD of the original centered data matrix
Approximation error:
  • The sum of the squared lengths of the distances of points project to the line measures the error between points and their projections.
  • Explained variance:
    • At the first step, the average squared error between the (centered) original points and the points projected on the closest line is
  • After the k-th step, the error is thus
  • The explained variance is the ratio
Data projection:
  • Once we computed the SVD of the centered data matrix, we can project the data on the plane spanned by the two vectors u1, u2.
  • This plane is the plane closest to data. The (centered) projected points are the 2D vectors in
  • More generally, the span of is the subspace of dimension closest to the data set (this comes directly from the low-rank approximation theorem). The projected points are given by
  • Consider a centered matrix of data points
  • The covariance matrix can be written as
  • If has SVD , then , where
  • Thus, the left singular vectors of are the eigenvectors of .

Variance Maximization problem:

  • Let be the sample Covariance Matrix for a data set of points with mean
  • For a given -vector , the line of points of the form
    • for
    • knowing that the projected data points are of the form for
  • We seek a direction such that the variance of the projected points on the line is maximal. Without loss of generality, we can assume
  • The variance of scores (m-vectors ) is
  • We wish to maximize
  • Solution: an optimal vector with an eigenvector of that corresponds to its largest eigenvalue
  • Projection on maximum variance line:
    • The line with maximum variance is set of points of the form
    • The projection of a generic point on the line is
    • We can project all the data points at once: the (centered) matrix of projected points is the dyad:
    • The line of maximum variance minimizes the sum of squared Euclidean norm distance between the data points and the line.
    • In fact, PCA can be interpreted two ways:
      • as a variance maximization process, relying on the eigenvalue decomposition of the covariance matrix;
      • as a low-rank approximation of the centered data matrix, relying on the SVD of that matrix

Minimum distance line:

  • The projection of along line is
  • So the sum of squared distances between and its projection is
  • Hence, minimizing the sum of squared distances is the same as maximizing , that is, the variance of the projected points.

Rank-one approximation:

  • Consider the centered data matrix and assume we want to find the rank-one matrix matrix that best approximates in the Frobenius norm sense:
  • Since , we have:
  • Given the SVD of the best choice is to take and , whence , etc…

The left singular vector :

  • The three problems:
    • Find the line of maximal projected data variance;
    • Find the line that minimizes the residual distance from the original points to their projections;
    • Find a rank one approximation of the centered data matrix;
  • All have solutions that involve the left principal singular vector of the centered data matrix .
  • N.B.: this is also the principal eigenvector of the covariance matrix . Indeed, if , then
    • is an eigenvalue factorization for