From the SVD result, a m×n matrix with rank k≤m,n if it can be written as PQ⊺, with P=[p1,...,pk]∈Rm×k and Q=[q1,...,qk]∈Rn×k, the matrix can be expressed as a sum of k dyads, PQ⊺=∑i=1kpiqi⊺
Geometric intepretation:
If a m×n matrix A can be written as a single dyad: A=pq⊺, this mean that every column (and row) is proportional to a single vector p:Aej=(q⊺ej)p for j=1,...,n
where p∈Rm and q∈Rn
and ej is the i-th unit vector in Rm
If each column represents a data point, all these points are on a line going through 0 with span{p}
Sum of Dyads:
If A can be expressed as a sum of k dyads: then each column is a linear combination of k vectors:
Aej=∑i=1k(qi⊺ej)pi,j=1,...,n
If each column represents a data points, all these pints are on a subspace, span{p1,...,pk}. In practice, it is better if the pi ’s are independent, as then each dyad brings in “new information”
Approximate with k<<m,n rank, thereby create a new dataset with information close enough to the old dataset
We use SVD as it provides the closest subspace in Frobenius/Euclidean norm and cheaper with independent components
Let A∈Rm,n be a given matrix, with rank(A)=r>0.
Consider the problem of approximating A with a matrix of lower rank, which is minAk∈Rm,n∣∣A−Ak∣∣F2 subject rank(Ak)=k where 1≤k≤r
Let A=U∑~V⊺=∑i=1rσiuivi⊺
Then only take sum of first k terms which is Ak=∑i=1kσiuivi⊺
Define ratio of retained/total infor: ηk=∣∣A∣∣F2∣∣Ak∣∣F2=σ12+...+σr2σ12+...+σk2
Then the relative (squared) norm approximation error: ek=∣∣A∣∣F2∣∣A−Ak∣∣F2=σ12+...+σr2σk+12+...+σr2=1−ηk
Plot differen ηk as a function of k will show which value of k is good enough
Low-dimensional representation of data:
Given a n×m data matrix X=[x1,…,xm], and the SVD-based rank- k approximation X~ to X, we have X=i=1∑rσiuiviT=USVT≈X~=i=1∑kσiuiviT=U~S~V~T where:
∇U~=[u1,…,uk]∈Rn×k
∇V~=[v1,…,vk]∈Rm×k
S~=diag(σ1,…,σk)∈Rk×k
Thus: projecting data points, for any data point xj:xj=Xej≈xj′:=X~ej=U~S~V~Tej=U~hj where:
hj:=S~V~Tej=(σiviTej)1≤i≤k∈Rk
Vector hj is a low-dimensional representation of data point xj. More compactly: X≈U~H,H=S~V~T∈Rk×m
We may also project features (rows fi⊺ of X,1≤i≤n) using the same derivation as before, with X⊺ instead of X:X=f1⊺..fn⊺ with fi≈VS⊺U⊺ei,1≤i≤n
More compact: X⊺≈V~H,H=S~⊺U~⊺
SVD-based auto-encoder:
For any data point: xj≈X~ej=U~hj for hj:=S~V~⊺ej∈Rk
Thus, any data point can be encoded to lower dimension
Given a low-dimensional representation h of x, we can decode, i.e. go back to the approximation x′, via x′=U~
Power iteration:
Solve the problem minp,q∣∣A−pq⊺∣∣F2:p∈Rn,q∈Rm, alternatively over p,q
Each step is a simple closed-form formula
Convergence improved when formulated as an iteration over normalized vectors: setting u=p∣∣p∣∣2,v=q/∣∣q∣∣2:
u=∣∣Av∣∣2Av and v=∣∣A⊺u∣∣2A⊺u
Works extrememly well for large, sparse A as there are empty information easily cut