Matrix factorization is a class of algorithms used for recommendation systems in machine learning. Matrix factorization algorithms work by decomposing dimensionality. Commonly known matrix factorization algorithms are SVD and PCA.

m1m2m3m4u1w12umu2w21umu3w32um\begin{array}{c|cc} & m_1 & m_2 & m_3 & m_4 \\ \hline u_1 & & w_{12}^{um} & &\\ u_2 & w_{21}^{um} & & &\\ u_3 & & & w_{32}^{um} & \end{array}

How do we compute matrix factorizations?

Possible approach:

minW,ZWZTXFro2\min _{W, Z}\left\|W Z^{T}-X\right\|_{\mathrm{Fro}}^{2}

Here Fro||\cdot||_{Fro} denotes the so-called Frobenius norm, i.e.:

XFro:=i=1sj=1nxij2\|X\|_{\mathrm{Fro}} :=\sqrt{\sum_{i=1}^{s} \sum_{j=1}^{n}\left|x_{i j}\right|^{2}}

Then we can add regularization to the model for WW and ZZ:

minW,ZWZTXFro2+R1(W)+R2(Z)\min _{W, Z}\left\|W Z^{T}-X\right\|_{\mathrm{Fro}}^{2}+R_{1}(W)+R_{2}(Z)

Here we need to choose suitable regularization functions R1R_1 and R2R_2.

Since WZTW\cdot Z^{T} is equal to a value with more than one solution, thus this optimization problem is not a convex problem. We could use coordinate descent also called alternating minimization to solve this problem.

Coordinate Descent / Alternating Minimization

keep one fixed. 其基本思路是一次优化一个参数(坐标),轮流循环,将复杂优化问题分解为简单的子问题

Example:

Wk+1=argminWW(Zk)TX2W^{k+1}=\arg \min _{W}\left\|W\left(Z^{k}\right)^{T}-X\right\|^{2}

Zk+1=argminZWk+1ZTX2Z^{k+1}=\arg \min _{Z}\left\|W^{k+1} Z^{T}-X\right\|^{2}

{Wk+1=XZk((Zk)TZk)1Zk+1=(((Wk+1)TWk+1)1(Wk+1)TX)T\Rightarrow \left\{ \begin{array}{lr} W^{k+1}=X Z^{k}\left(\left(Z^{k}\right)^{T} Z^{k}\right)^{-1} & \\ Z^{k+1}=\left(\left(\left(W^{k+1}\right)^{T} W^{k+1}\right)^{-1}\left(W^{k+1}\right)^{T} X\right)^{T} & \end{array} \right.

Singular Value Decomposition

We can obtain other kinds of matrix factorizations. A very popular one is the singular value decomposition:

奇异值分解 (Singular Value Decomposition),它能将任意矩阵分解为两个正交阵和一个对角阵,并揭示出矩阵的许多性质。

Every matrix XRs×nX \in \mathbb{R}^{s \times n} can be written as

X=UΣVTX=U \Sigma V^{T}

URs×sU \in \mathbb{R}^{s \times s} is unitary matrix, i.e. UTU=IU^{T} U=I

VRn×nV \in \mathbb{R}^{n \times n} is unitary matrix, i.e. VTV=IV^{T} V=I

Σ=(σ100000σ200000000σs00)Rs×n= diagonal for s less than n\Sigma= \left( \begin{array}{lllllll}{ \sigma_{1}} & {0} & {\ldots} & {0} & {0} & {\ldots} & {0} \\ {0} & \sigma_{2} & {\ldots} & {0} & {0} & {\ldots} & {0} \\ {\vdots} & {\vdots} & {\ddots} & {0} & {0} & {\ldots} & {0} \\ {0} & {0} & {\ldots} & \sigma_{s} & {0} & {\ldots} & {0} \\ \end{array} \right) \in \mathbb{R}^{s \times n}=\text { diagonal } \quad \text{for s less than n}

σ1\sigma_1 is the largest, σs\sigma_s is the smallest

Σ=(σ1000σ2000000σn000000)Rs×n= diagonal fors>n\Sigma= \left( \begin{array}{llll}{ \sigma_{1}} & {0} & {\ldots} & {0} \\ {0} & {\sigma_{2}} & {\ldots} & {0} \\ {0} & {0} & {\ddots} & {0} \\ {0} & {0} & {\ldots} & {\sigma_{n}} \\ {0} & {0} & {\ldots} & {0} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {0} & {0} & {\ldots} & {0} \end{array} \right) \in \mathbb{R}^{s \times n}=\text { diagonal } \quad \text{for} s>n

Now for unitary matrices, we have:

Ux2=Ux,Ux=x,UTUx=x,x=x2\|U x\|^{2}=\langle U x, U x\rangle=\left\langle x, {U}^{T} U x\right\rangle=\langle x, x\rangle=\|x\|^{2}

i.e., rotations do not affect the Euclidean norm (Frobenius norm)

XVw2=UΣVTVw2=UΣw,UΣw==Σw,Σw=i=1Sσi2wi2\|X V w\|^{2}=\left\|U \Sigma V^{T} V w\right\|^{2}=\langle U \Sigma w, U \Sigma w\rangle==\langle\Sigma w, \Sigma w\rangle=\sum_{i=1}^{S} \sigma_{i}^{2}\left|w_{i}\right|^{2}

Principal Component Analysis

Principal Component Analysis is another matrix factorization algorithm which is similar to SVD.