Sherwin Chen
by Sherwin Chen
6 min read

Categories

Tags

Introduction

Conjugate gradient method is useful for solving the system of linear equation

\[\begin{align} Ax=b \end{align}\]

for the vector \( x \), where \( A \) is symmetric, positive-definite(i.e. \( x^TAx>0 \) for all non-zero vectors \( x \) in \( R^n \)), and real, and \( b \) is known as well. we denote the unique solution (which is guaranteed by \( A \) being positive definite, later we’ll see how) to this system by \( x^* \).

Preliminaries

Conjugate. Two vectors \( u, v \) are conjugate (with respect to \( A \)) if

\[\begin{align} \langle u,v\rangle_{A}=u^T{Av}=0\tag {1} \end{align}\]

where the known \( n\times n \) matrix \( A \) is symmetric, positive-definite.

A Direct Method

Suppose that

\[\begin{align} P=\{p_1, \dots, p_n\} \end{align}\]

is a set of mutually conjugate vectors (with respect to \( A \)). Then \( P \) forms a basis for \( \mathcal R^n \) (proof will be shown later here), and we may express the solution \( x^* \) of \( Ax=b \) in this basis

\[\begin{align} x_{*}=\sum_{i=1}^n\alpha_i p_i \end{align}\]

Then we calculate

\[\begin{align} b=Ax_\*=\sum_{i=1}^n\alpha_i A p_i \end{align}\]

Left multiply it by \( p_k^T \), we have

\[\begin{align} \sum_{i=1}^n \alpha_i p_k^T{Ap}_i&=p_k^Tb\\\ \alpha_kp_k^T{Ap}_k&=p_k^Tb\\\ \alpha_k&={\langle p_k^T,b\rangle\over \langle p_k^T, p_k\rangle_A} \end{align}\]

by solving every \( \alpha_k \) we’ll succeed having \(x^*\) in basis \( P \).

An Iterative Method

When \(x\) is of high dimension, compute \(x\) using the direct method is costly. Fortunately, if we choose the conjugate basis carefully, we may not need all of them to obtain a good approximation to \(x^*\). Now, we discuss an iterative method, which allows us to gradually construct \(x^*\)

Gram-Schmidt Orthogonalization

Given linearly independent vectors, \( {x_1, x_2, \dots, x_n} \), The Gram-Schmidt process builds an orthogonal basis \( {y_1, y_2,\dots, y_n} \) as follows

\[\begin{align} y_1 &= x_1\\\ y_{k+1} &= x_{k+1} - \sum_{i=1}^{k}{\langle x_{k+1}, y_i \rangle\over \langle y_i, y_i \rangle}y_i \end{align}\]

where \({\langle x_{k+1}, y_i \rangle\over \langle y_i, y_i \rangle}y_i\) is the projection of \(x_{k+1}\) on \(y_i\). The \( A \)-conjugate (\( A \)-orthogonal) basis is built similarly. Given linearly independent vectors, \( {g_1, g_2,\dots, g_n} \), we construct a \(A\)-conjugate basis \({d_1,d_2,\dots,d_n}\) as below

\[\begin{align} d_1 &= -g_1\\\ d_{k+1} &= -g_{k+1} + \sum_{i=1}^{k}{\langle g_{k+1}, d_i \rangle_A\over \langle d_i, d_i \rangle_A}d_i \end{align}\tag {2}\]

where I intended to negate \( d_k \)s so as to coordinate with the following discussion.

Minimization of The Quadratic Function

Note that \( x^* \) is also the unique minimizer of the following quadratic function

\[\begin{align} f(x)={1\over 2}x^TAx-x^Tb \end{align}\]

To see that \( x^* \) is the unique minimizer, we notice that \( A \) is positive definite, which suggests \( x^TAx>0 \), so \( f(x) \) has a unique minimizer. By taking the first derivative, we have

\[\begin{align} \nabla f(x)=Ax-b \end{align}\]

which is \( 0 \) at \( x^* \). Thus \( x^* \) is the unique minimizer of \( f(x) \).

Let \( x=\sum_{i=0}^{n-1} \alpha_i d_i\), where \({d_0, \dots, d_{n-1}}\) are \( A \)-conjugate basis and express \(f(x)\) as below

\[\begin{align} f(x)=\sum_{i=0}^{n-1} \left({1\over 2}\alpha_i^2\langle d_i, d_i \rangle_A-\alpha_i \langle d_i,b\rangle\right) \tag {3} \end{align}\]

This suggests that \( f(x) \) can be divided into \( n \) sub-functions. This is crucial in that it allows us to gradually construct \( x \):

\[\begin{align} x_0&=0\\\ x_{k+1}&=x_{k}+\alpha_kd_k\\\ &=x_0+\sum_{i=0}^k\alpha_id_i \end{align}\tag {4}\]

where we use \(x_0=0\) for simplicity but it can be any value. With this at hand, we can verify that when \(x\) is the minimizer of the quadratic function, the gradient \( g_{k+1}=\nabla_{x_{k+1}}f(x_{k+1}) \) is orthogonal to \( {d_0,\dots,d_k} \) (proof is given here). As \( g_k \) is in the subspace denoted by the \( A \)-conjugate basis \( {d_0,\dots,d_k} \), it natually follows that \( g_{k+1} \) is orthogonal to \( g_k \). The same proof applies to all other previously computed gradient. Now we have linearly independent vectors \({g_0, \dots, g_{n-1}}\), the \( A \)-conjugate basis \({d_0, \dots, d_{n-1}}\) can be constructed from \( (2) \). Furthermore, \( (2) \) can be simplified to be

\[\begin{align} d_{k+1}&=-g_{k+1}+\beta_k d_k\\\ \mathrm {where}\quad\beta_k&={g_{k+1}^T(g_{k+1}-g_k)\over g_k^Tg_k} \end{align}\tag{5}\]

The detailed derivation is appended at the end.

From \( (3) \), we can see that minimizing \( f(x) \) simply suggests to minimize each \(f_i(\alpha_i)\), where \(f_i(\alpha_i)={1\over 2}\alpha_i^2\langle d_i, d_i \rangle_A+\alpha_i\langle d_i,b\rangle\). Thus, we can compute \( \alpha_i \) directly by setting the derivative of \(f_i(\alpha_i)\) to zero:

\[\begin{align} \nabla_{\alpha_i}f_i(\alpha_i)&=0\\\ \alpha_i\langle d_i, d_i\rangle_A-\langle d_i,b\rangle&=0 \end{align}\]

Solving it, we have

\[\begin{align} \alpha_i&={\langle d_i, b\rangle \over \langle d_i,d_i\rangle_A}\\\ &=-{\langle d_i, g_i\rangle \over \langle d_i,d_i\rangle_A}\\\ &={\langle g_i, g_i\rangle \over \langle d_i,d_i\rangle_A}\tag {6} \end{align}\]

Some explanations:

  • We replace \( b \) with \( b=Ax_i - g_i \) in the second step and spot that
\[\begin{align} d_i^TAx_i&=d_i^T\left(A\sum_{j=0}^{i-1}\alpha_jd_j\right)\\\ &=\sum_{j<i}\alpha_jd_i^TAd_j\\\ &=0 \end{align}\]
  • The last step is obtained according to \( d_i=-g_i+\beta_{i-1}d_{i-1} \), which is shown in \( (5) \), and noticing \( d_{i-1} \) is orthogonal to \( g_i \).

We have basically introduced everything needed to iteratively implement conjugate gradient method. Here is still one more trick to add. Notice that the gradient \( g_{k+1}=Ax_{k+1}-b \), where \( x_{k+1} \) can be iteratively computed, so we expand \( x_{k+1} \) and have

\[\begin{align} g_{k+1}&=A(x_k+\alpha_k d_k)-b\\\ &=Ax_k-b+\alpha_k Ad_k\\\ &=g_k+\alpha_kAd_k\tag {7} \end{align}\]

Summary. In order to iteratively construct \(x^*\), we first reframe the linear system as a problem of finding the minimizer of a quadratic function \(f\) and express \(x\) in an \(A\)-conjugate basis \({d_0, \dots, d_{n-1}}\). We then show that the gradients of the quadratic function \(g_k=\nabla f(x_k),\ k={1,\dots n}\) are orthogonal to each other. Following Gram-Schmidt process, this allows us to iteratively construct an \(A\)-conjugate bass \({d_0, \dots, d_k}\) from \({g_0, \dots, g_k}\). Now the only thing left for computing \(x_{k+1}\) is the coefficients \({\alpha_0,\dots,\alpha_{k}}\), which can be obtained by solving \(\nabla_{\alpha_i}f_i(\alpha_i)=0\).

Algorithm

The iterative version of conjugate gradient method utilize Equations \((4-7)\):

\[\begin{align} &x_0=0\\\ &g_0=Ax_0-b\\\ &d_0=-g_0\\\ &\mathrm{for}\ k=0, 1, \cdots:\\\ &\quad \alpha_k={\langle g_k, g_k\rangle \over \langle d_k,d_k\rangle_A} & Eq.(6)\\\ &\quad x_{k+1}=x_k+\alpha_k d_k & Eq.(4)\\\ &\quad g_{k+1}=g_k+\alpha_kAd_k & Eq.(7)\\\ &\quad \beta_k={g_{k+1}^T(g_{k+1}-g_k)\over g_k^Tg_k} & Eq.(5)\\\ &\quad d_{k+1}=-g_{k+1}+\beta_kd_k & Eq.(5) \end{align}\]

The algorithm can be outlined as simply repeating the following steps

  1. Recursively compute \(x_{k+1}\) in the current conjugate basis \( d_{0:k} \).
  2. Recursively compute the gradient of \( f(x_{k+1}) \), \(g_{k+1}\).
  3. Calculate the next conjugate vector \( d_{k+1} \) from \(g_{k+1}\).

Proofs

Proof for Linear Independence of \( n \) Mutually Conjugate Vectors

Suppose that

\[\begin{align} P=\{p_1,\dots,p_n\} \end{align}\]

is a set of mutually conjugate vectors w.r.t some positive-definite \( A \). We are now going to prove vectors in \( P \) are linearly independent.

Assume there exists some \( p_k \) such that

\[\begin{align} p_k&=\sum_{i=1, i\ne k}^n\alpha_ip_i \end{align}\]

sticking it back into \( (1) \) we end up with a contradiction

\[\begin{align} p_k^TAp_l&=\sum_{i=1, i\ne k}^n\alpha_ip_i^TAp_l\\\ &=\alpha_lp_l^T{Ap}_l\\\ &\ne0 \end{align}\]

The last step is derived because of the positive-definite property of \( A \).

back to the context

Proof for \( g_{k+1}\bot d_1,\dots,d_k \)

Let \( A_k=\begin{bmatrix}\alpha_1\\ \dots\\ \alpha_k\end{bmatrix} \) and \( D_k=\begin{bmatrix}d_1\\ \dots\\ d_k\end{bmatrix} \), now we have

\[\begin{align} x_{k+1}=x_0+D_k^TA_k \end{align}\]

to find \( A_k \) that minimizes \( f(x_{k+1}) \), we compute \( A_k \) when the gradient is \( 0 \). That is,

\[\begin{align} \nabla_{A_k} f(x_{k+1})&=0\\\ D_k\nabla_{x_{k+1}}f(x_{k+1})&=0 \end{align}\]

The above equation suggests

\[\begin{align} d_1\nabla_{x_{k+1}}f(x_{k+1})&=d_1g_{k+1}=0\\\ d_2\nabla_{x_{k+1}}f(x_{k+1})&=d_2g_{k+1}=0\\\ \vdots \end{align}\]

so we have

\[\begin{align} g_{k+1}\bot d_1,\dots,d_k \end{align}\]

back to the context

Simplified Iterative Derivation of \( d \)

We have stated that

\[\begin{align} d_1 &= -g_1\\\ d_{k+1} &= -g_{k+1} + \sum_{i=1}^{k}{\langle g_{k+1}, d_i \rangle_A\over \langle d_i, d_i \rangle_A}d_i \end{align}\tag {2}\]

now let us take a deeper look at it and see what we will get.

According to \( (4) \), we derive

\[\begin{align} d_i={1\over \alpha_i}(x_{i+1}-x_i) \end{align}\]

Left multiplying \( A \), we get

\[\begin{align} Ad_i={1\over \alpha_i}A(x_{i+1}-x_i)={1\over \alpha_i}(g_{i+1}-g_i) \end{align}\]

Plugging it back to \( (2) \), we obtain

\[\begin{align} d_{k+1}&=-g_{k+1}+\sum_{i=1}^k{g_{k+1}^T(g_{i+1}-g_{i})\over d_i^T(g_{i+1}-g_i)}d_i\\\ &=-g_{k+1}+{g_{k+1}^T(g_{k+1}-g_k)\over d_k^T(g_{k+1}-g_k)}d_k&\mathrm{since}\ g_{k+1}\bot g_1, \dots, g_k\\\ &=-g_{k+1}+{g^T_{k+1}(g_{k+1}-g_k)\over -d_k^Tg_k}d_k&\mathrm{since}\ d_k\bot g_{k+1}\\\ &=-g_{k+1}+{g^T_{k+1}(g_{k+1}-g_k)\over g_k^Tg_k}d_k&\mathrm{since}\ d_k=-g_k+\beta_{k-1}d_{k-1}\\\ &=-g_{k+1}+\beta_kd_k&\mathrm{where}\ \beta_k={g^T_{k+1}(g_{k+1}-g_k)\over g_k^Tg_k} \end{align}\]

for some mysterious (numerical) reason, we generally don’t further simplify the denominator of \( \beta_k \)

back to the context