Low-rank Approximation

SVD is mainly used to find Dyads in low rank matrix for approximate data to a lower-rank matrix
From the SVD result, a $m \times n$ matrix with rank $k \leq m, n$ if it can be written as $P Q^{⊺}$ , with $P = [p_{1}, ..., p_{k}] \in R^{m \times k}$ and $Q = [q_{1}, ..., q_{k}] \in R^{n \times k}$ , the matrix can be expressed as a sum of $k$ dyads, $P Q^{⊺} = \sum_{i = 1}^{k} p_{i} q_{i}^{⊺}$
Geometric intepretation:
- If a $m \times n$ matrix $A$ can be written as a single dyad: $A = p q^{⊺}$ , this mean that every column (and row) is proportional to a single vector $p : A e_{j} = (q^{⊺} e_{j}) p$ for $j = 1, ..., n$
  - where $p \in R^{m}$ and $q \in R^{n}$
  - and $e_{j}$ is the $i$ -th unit vector in $R^{m}$
- If each column represents a data point, all these points are on a line going through 0 with $span {p}$
Sum of Dyads:
- If $A$ can be expressed as a sum of $k$ dyads: then each column is a linear combination of $k$ vectors:
  - $A e_{j} = \sum_{i = 1}^{k} (q_{i}^{⊺} e_{j}) p_{i}, j = 1, ..., n$
- If each column represents a data points, all these pints are on a subspace, $s p an {p_{1}, ..., p_{k}}$ . In practice, it is better if the $p_{i}$ ’s are independent, as then each dyad brings in “new information”
Dyads in low rank matrix

Approximate with $k << m, n$ rank, thereby create a new dataset with information close enough to the old dataset
We use SVD as it provides the closest subspace in Frobenius/Euclidean norm and cheaper with independent components
Let $A \in R^{m, n}$ be a given matrix, with $r ank (A) = r > 0$ .
Consider the problem of approximating A with a matrix of lower rank, which is $min_{A_{k} \in R^{m, n}} ∣∣ A - A_{k} ∣ ∣_{F}^{2}$ subject $r ank (A_{k}) = k$ where $1 \leq k \leq r$
Let $A = U \sum^{~} V^{⊺} = \sum_{i = 1}^{r} σ_{i} u_{i} v_{i}^{⊺}$
Then only take sum of first $k$ terms which is $A_{k} = \sum_{i = 1}^{k} σ_{i} u_{i} v_{i}^{⊺}$
Define ratio of retained/total infor: $η_{k} = \frac{∣∣ A _{k} ∣ ∣ _{F}^{2}}{∣∣ A ∣ ∣ _{F}^{2}} = \frac{σ _{1}^{2} + ... + σ _{k}^{2}}{σ _{1}^{2} + ... + σ _{r}^{2}}$
Then the relative (squared) norm approximation error: $e_{k} = \frac{∣∣ A - A _{k} ∣ ∣ _{F}^{2}}{∣∣ A ∣ ∣ _{F}^{2}} = \frac{σ _{k + 1}^{2} + ... + σ _{r}^{2}}{σ _{1}^{2} + ... + σ _{r}^{2}} = 1 - η_{k}$
Plot differen $η_{k}$ as a function of $k$ will show which value of $k$ is good enough

Given a $n \times m$ data matrix $X = [x_{1}, \dots, x_{m}]$ , and the SVD-based rank- $k$ approximation $\tilde{X}$ to $X$ , we have $X = i = 1 \sum r σ_{i} u_{i} v_{i}^{T} = U S V^{T} \approx \tilde{X} = i = 1 \sum k σ_{i} u_{i} v_{i}^{T} = \tilde{U} \tilde{S} \tilde{V}^{T}$ where:
- $\nabla \tilde{U} = [u_{1}, \dots, u_{k}] \in R^{n \times k}$
- $\nabla \tilde{V} = [v_{1}, \dots, v_{k}] \in R^{m \times k}$
- $\tilde{S} = diag (σ_{1}, \dots, σ_{k}) \in R^{k \times k}$
Thus: projecting data points, for any data point $x_{j} : x_{j} = X e_{j} \approx x_{j}^{'} := \tilde{X} e_{j} = \tilde{U} \tilde{S} \tilde{V}^{T} e_{j} = \tilde{U} h_{j}$ where:
- $h_{j} := \tilde{S} \tilde{V}^{T} e_{j} = (σ_{i} v_{i}^{T} e_{j})_{1 \leq i \leq k} \in R^{k}$
Vector $h_{j}$ is a low-dimensional representation of data point $x_{j}$ . More compactly: $X \approx \tilde{U} H, H = \tilde{S} \tilde{V}^{T} \in R^{k \times m}$
We may also project features (rows $f_{i}^{⊺}$ of $X, 1 \leq i \leq n$ ) using the same derivation as before, with $X^{⊺}$ instead of $X : X = f_{1}^{⊺} . . f_{n}^{⊺}$ with $f_{i} \approx V S^{⊺} U^{⊺} e_{i}, 1 \leq i \leq n$
- More compact: $X^{⊺} \approx \tilde{V} H, H = \tilde{S}^{⊺} \tilde{U}^{⊺}$

For any data point: $x_{j} \approx \tilde{X} e_{j} = \tilde{U} h_{j}$ for $h_{j} := \tilde{S} \tilde{V}^{⊺} e_{j} \in R^{k}$
Thus, any data point can be encoded to lower dimension
Given a low-dimensional representation $h$ of $x$ , we can decode, i.e. go back to the approximation $x'$ , via $x^{'} = \tilde{U}$

Solve the problem $min_{p, q} ∣∣ A - p q^{⊺} ∣ ∣_{F}^{2} : p \in R^{n}, q \in R^{m}$ , alternatively over $p, q$
Each step is a simple closed-form formula
Convergence improved when formulated as an iteration over normalized vectors: setting $u = p ∣∣ p ∣ ∣_{2}, v = q /∣∣ q ∣ ∣_{2}$ :
- $u = \frac{A v}{∣∣ A v ∣ ∣ _{2}}$ and $v = \frac{A ^{⊺} u}{∣∣ A ^{⊺} u ∣ ∣ _{2}}$
Works extrememly well for large, sparse A as there are empty information easily cut

StrixTheKiet Notes