Sherwin Chen
by Sherwin Chen
6 min read

Categories

Tags

Introduction

PCA is a dimensionality-reduction technique wildly used in data preprocessing. This post will walk through PCA and its derivatives—whitening and ZCA whitening.

Note: dimensionality reduction techniques highly rely on the assumption that features contains significant redundancy, which might not always be true in practice (features are usually designed with their unique contributions and removing any of them may affect the training accuracy to some degree.[Guolin Ke et al. 2017])

Table of contents

Singular Value Decomposition

Any real matrix \( \pmb A \) can be decomposed into

\[\begin{align} \pmb A= \pmb U\pmb \Sigma \pmb V^\top \end{align}\]

where

  • The columns of \( \pmb U \) are the left-singular eigenvectors, i.e., orthonormal eigenvectors of \( \pmb A\pmb A^\top \),
  • \( \pmb \Sigma\) is a diagonal matrix whose diagonal entries are the singular values, i.e., square roots of the eigenvalues of both \( \pmb A\pmb A^\top \) and \( \pmb A^\top\pmb A \)
  • The columns of \(\pmb V\) are the right-singular eigenvectors, i.e., orthonormal eigenvectors of \( \pmb A^\top\pmb A \) .

Therefore, we also have

\[\begin{align} \pmb A^\top\pmb A=\pmb V\pmb \Sigma^2 \pmb V^\top \end{align}\]

In addition, if \( \pmb A \) is a real symmetric matrix (which results in \( AA^\top=\pmb A^TA \)), \( \pmb U \) will be identical to \( \pmb V \) and \( \\pmb \Sigma^{1\over 2} \) will be a diagonal matrix whose diagonal entries are just the eigenvalues of \( \pmb A \). In this way, \( \pmb A \) could be rewritten as

\[\begin{align} \pmb A=\pmb V\pmb \Sigma \pmb V^\top \end{align}\]

Elevator back to directory

Dimensionality Reduction with SVD

In order to use SVD to reduce dimensionality, we take the matrix multipliciation of truncated \(\pmb U\) and \(\pmb \Sigma\):

\[\begin{align} \tilde {\pmb A}=\pmb U_{:, 1:k}\pmb \Sigma_{1:k, 1:k} \end{align}\]

Principal Component Analysis

PCA is a technique to break the linear correlation of variables so that the covariance matrix of the resultant data is diagonal. In that case, we can use the most variant features to simplify the input data without loss of much information.

Derivation

Objective. The objective of PCA is to find the projection of \(\pmb X\in\mathbb R^{m \times n}\) on a subspace \(S\in \mathbb R^k\) such that the information in the orthogonal complement space(i.e., the information discarded) is minimized. Mathematically, let \(\pmb X={\pmb x_i\in\mathbb R^n:i\in[1,m]}\) and define an orthonormal basis \({\pmb b_i\in\mathbb R^n: i\in[1,n]}\) such that \(\pmb x_i=\sum_{k=1}^n\lambda_{ik}\pmb b_k\). Let \(\bar{\pmb x}i=\sum{j=1}^k\lambda_{ij}\pmb b_j\) be the projection of \(\pmb x_i\) onto \(S\). We want to find \({\bar{\pmb x}1,\dots,\bar{\pmb x}_m}\) such that \(\mathcal L=\sum{i=1}^m\Vert \pmb x_i-\bar{\pmb x}_i\Vert_2^2\) is minimized.

Derivation. We rewrite \({\pmb x}_i\)

\[\begin{align} {\pmb x}_i=\sum_{j=1}^n \lambda_{ij}\pmb b_j=\sum_{j=1}^n \pmb b_j^\top\pmb x_i\pmb b_j =\sum_{j=1}^n \pmb b_j\pmb b_j^\top\pmb x_i\tag 1 \end{align}\]

where \(\sum_{k=1}^n \pmb b_k\pmb b_k^\top\) is the projection matrix on \(S\). Similarly, we have

\[\begin{align} \bar{\pmb x}_i=\sum_{j=1}^k \pmb b_j\pmb b_j^\top\pmb x_i\tag 2 \end{align}\]

Plugging Equations \((1)\) and \((2)\) into \(\mathcal L\), we obtain

\[\begin{align} \mathcal L=&\sum_{i=1}^m\left\Vert\sum_{j=k+1}^n\pmb b_j\pmb b_j^\top\pmb x_i\right\Vert_2^2\\\ =&\sum_{i=1}^m\sum_{j=k+1}^n{\pmb x}_i^\top\pmb b_j\pmb b_j^\top\pmb x_i&\color{red}{\text{since }\{\pmb b_j:j\in[1,n]\}\text{ is orthonormal}}\\\ =&\sum_{j=k+1}^n\pmb b_j^\top\left(\sum_{i=1}^m\pmb x_i\pmb x_i^\top\right)\pmb b_j&\color{red}{\text{since }\pmb x_i^\top\pmb b_j=\pmb b_j^\top\pmb x_i\text{ is a scalar}}\\\ =&\sum_{j=k+1}^n\pmb b_j^\top \pmb C\pmb b_j&\color{red}{\pmb C=\sum_{i=1}^m\pmb x_i\pmb x_i^\top=\pmb X^\top\pmb X}\tag 3\\\ =&\sum_{j=k+1}^n\text{tr}(\pmb b_j^\top \pmb C\pmb b_j)&\color{red}{\text{since }\pmb b_j^\top \pmb C\pmb b_j \text{ is a scalar, tr}(\cdot)\text{ is the trace operator}}\\\ =&\text{tr}\left(\sum_{j=k+1}^n\pmb b_j^\top \pmb C\pmb b_j\right)\\\ =&\text{tr}\left(\sum_{j=k+1}^n\pmb b_j\pmb b_j^\top \pmb C\right) \end{align}\]

where \(\sum_{j=k+1}^n\pmb b_j\pmb b_j^\top\) is the projection matrix on the orthogonal complement of \(S\). Note that \({1\over m}\pmb C\) is the sample covariance matrix assuming the expectation of \(\pmb X\) is zero. This suggests that PCA minimizes the sample covariance projected onto the orthogonal complement of \(S\); in other world, PCA aims to find a subspace \(S\) onto which the projection of the sample covariance is maximized.

Now we re-define PCA as a constraint optimization problem using Equation \((3)\)

\[\begin{align} \min_{\pmb b_{k+1},\dots,\pmb b_n}&\sum_{j=k+1}^n\pmb b_j^\top \pmb C\pmb b_j\tag 4\\\ s.t.\quad&\Vert \pmb b_j\Vert_2^2=1 \end{align}\]

The equivalent Lagrangian is

\[\begin{align} \min_{\pmb b_{k+1},\dots,\pmb b_n}\max_{\lambda_{k+1},\dots,\lambda{n}}\mathcal L(\pmb b_{k+1},\dots,\pmb b_n,\lambda_{k+1},\dots,\lambda{n})=\sum_{j=k+1}^n\pmb b_j^\top \pmb C\pmb b_j+\lambda_j(1-\Vert \pmb b_j\Vert_2^2) \end{align}\]

Taking the derivative w.r.t \(\pmb b_j\), we have

\[\begin{align} {\partial\mathcal L\over\pmb b_j}=\pmb C\pmb b_j-\lambda_j\pmb b_j \end{align}\]

Setting it to zero, we get

\[\begin{align} \pmb C\pmb b_j=\lambda_j\pmb b_j \end{align}\]

Therefore, \(\pmb b_j\) is an orthonormal eigenvector of \(\pmb C\) and \(\lambda_j\) is its eigenvalue. This indicates that Equation \((4)\) is minimized when \(\lambda_j\) is the smallest \(n-k\) eigenvalues of \(\pmb C\)(in fact, this follows directly from the Rayleigh quotients). Because \({\pmb b_{k+1},\dots,\pmb b_n}\) is the orthonormal basis for \(S^\perp\), we define an orthonormal basis for the subspace \(S\) as \(\pmb V_k={\pmb b_1,\dots,\pmb b_k}\), the eigenvectors of \(\pmb C\) with the largest \(k\) eigenvalues. Therefore, we have \(\bar{\pmb X}=\pmb X\pmb V_k\).

Algorithm

In order to perform PCA,

  • First do mean and variance normalization on the input \( \pmb X \): subtract the mean and divide each dimension of the centered data by the corresponding standard deviation. subtracting the mean helps avoid potential numerical issues; dividing by the corresponding standard deviation is important when the variables are note measured on the same scale
  • Then, to uncorrelate the input \( \pmb X \), rotate \( \pmb X \) using the orthogonal rotation matrix \( \pmb V \), the orthonormal eigenvectors of \( \pmb X^\top\pmb X \) (i.e., the right-singular eigenvectors of \( \pmb X \))
\[\begin{align} \pmb X_{PCA}=\pmb X\pmb V \end{align}\]

Proof of Uncorrelation

we could verify that the resultant data \( \pmb X_{PCA} \) is uncorrelated by proving the covariance matrix \( C_{PCA}=\pmb X^T_{PCA}\pmb X_{PCA} \) is a diagonal matrix:

\[\begin{align} \pmb X_{PCA}^\top\pmb X_{PCA}&=\pmb V^\top\pmb X^\top\pmb X\pmb V \\\ &=\pmb V^\top\pmb V\pmb \Sigma^2 \pmb V^\top\pmb V\\\ &=\pmb \Sigma^2 \end{align}\]

where \( \\pmb \Sigma \) is a diagonal matrix whose diagonal entries are the eigenvalues of \( \pmb X^\top\pmb X \)

Dimension Reduction

Usually we only retrain the top \( k \) components by taking only first \( k \) columns (associated to the largest \( k \) singular values of \(\pmb X\)) in \( \pmb V \), where \( k \) is the smallest value that satisfies

\[\begin{align} {\sum_{j=1}^k \lambda_j\over\sum_{j=1}^n\lambda_j}\ge0.99 \end{align}\]

where \( \lambda_i \) is the \( i \)-th eigenvalue.

The above inequality says to retain \( 99\% \) of variance. In practice, this could be reduced to \( 90\%+ \) depending on a specific application

Elevator back to directory

Whitening

Whitening further standardizes the result of PCA so that they have variance \( 1 \)

\[\begin{align} \pmb X_{whiten}=\pmb X_{PCA}\pmb \Sigma^{-1} \end{align}\]

After whitening, the data has covariance equal to the identity matrix \( I \) since

\[\begin{align} \pmb X_{whiten}^\top\pmb X_{whiten}&=\pmb \Sigma^{-1}\pmb X_{PCA}\pmb X_{PCA}\pmb \Sigma^{-1}\\\ &=\pmb \Sigma^{-1}\pmb \Sigma^2\pmb \Sigma^{-1}\\\ &=I \end{align}\]

Regularization

In practice, sometimes some of eigenvalues \( \lambda_i \) in \( \pmb \Sigma \) will be close to \( 0 \), and thus whitening step, where each column \( \pmb X_{:,i} \) in \( \pmb X_{PCA} \) is divided by \( \sqrt{\lambda_i} \), would cause the data to blow up. To maintain numerically stablility, we usually use a small amount of regularization, and add a small constant \( \epsilon \) to the eigenvalues before taking their square roots and inverse

\[\begin{align} \pmb X_{whiten}=\pmb X_{PCA}(\pmb \Sigma+\mathrm {diag}(\epsilon))^{-1} \end{align}\]

when \( \pmb X \) takes value around \( [-1, 1] \), a value of \( \epsilon\approx 10^{-5} \) might be typical.

For the case of images, adding \( \epsilon \) also has the effect of slightly smoothing the input image. This also has a desirable effect of removing aliasing artifacts caused by the way pixels are laid out in an image and can improve the features learned

Elevator back to directory

ZCA Whitening

Since whitening standardizes all components so that they have covariance \( I \), any rotation applied to the whitened data should stay whitened. That is, for any orthogonal matrix \( \pmb R \), \( \pmb X_{whiten}R^\top \) is also whitened. In ZCA whitening, we choose \( \pmb R=\pmb V \). Thus, we have

\[\begin{align} \pmb X_{ZCAwhiten}=\pmb X_{whiten}\pmb V^\top \end{align}\]

The resultant data is as close as possible to the original data (in the least squares sense). That is, ZCA whitening minimizes \( \Vert \pmb X-\pmb XA^\top\Vert_2^2 \) subject to \( \pmb XA^\top \) being whitened, where \( \pmb A \) is a transform matrix — in case of ZCA whitening, \( \pmb A=\pmb V(\pmb \Sigma+\mathrm diag(\epsilon))^{-1} \pmb V^\top \)

Elevator back to directory