Linear-Quadratic Regulator (LQR)
Problem Introduction
For a linear dynamics model \( s_{t+1}=f_t(s_t,a_t) \) and a quadratic reward function \( r_t(s_t,a_t) \)
\[\begin{align} s_{t+1}=f_t(s_t,a_t)&=F_t\begin{bmatrix}s_t\\\a_t\end{bmatrix}+f_t\tag{1} \\\ r_t(s_t,a_t)&={1\over 2}\begin{bmatrix}s_t\\\a_t\end{bmatrix}^TR_t\begin{bmatrix}s_t\\\a_t\end{bmatrix}+\begin{bmatrix}s_t\\\a_t\end{bmatrix}^Tr_t\tag{2} \end{align}\]where \( R_t \) is a negative definite matrix (so that we could take derivative to find the optimal action in Eq. \( (5) \)), and \(R_t,r_t \) are concisely expressed as below
\[\begin{align} R_t&=\begin{bmatrix}R_{s_t,s_t}&R_{s_t,a_t}\\\R_{a_t,s_t}&R_{a_t,a_t}\end{bmatrix}\\\ r_t&=\begin{bmatrix}r_{s_t}\\\r_{a_t}\end{bmatrix}\\\ \end{align}\]Our objective is to select actions \( a_1,\cdots, a_T \) so as to maximize the total rewards over a period of time \( T \)
\[\begin{align} \max_{a_1,\cdots, a_T} \sum_{t=1}^Tr_t(s_t,a_t)\tag{3} \end{align}\]For simplicity and without the loss of generality, we assume \( F_t \) and \( R_t \) are symmetric matrices. Also notice that I intend to omit constant terms during the whole discussion since they do not contribute to the algorithm we describe.
Analysis
We first consider the simplest action \( a_T \) whose the action value at time step \( T \) is simply the reward at that time step
\[\begin{align} Q(s_T,a_T)&=r(s_T,a_T)\\\ &={1\over 2}\begin{bmatrix}s_T\\\a_T\end{bmatrix}^TR_T\begin{bmatrix}s_T\\\a_T\end{bmatrix}+\begin{bmatrix}s_T\\\a_T\end{bmatrix}^Tr_T\\\ &={1\over 2}\begin{bmatrix}s_T\\\a_T\end{bmatrix}^TQ_T\begin{bmatrix}s_T\\\a_T\end{bmatrix}+\begin{bmatrix}s_T\\\a_T\end{bmatrix}^Tq_T\tag{4} \end{align}\]At the last step, we substitute \( Q_T, q_T \) for \( R_T, r_T \) so as to be consistent with the following discussion.
Since \(Q\)-function is a quadratic function, we can compute the best action by compute the gradient of the \(Q\)-function and setting it to zero
\[\begin{align} \nabla_{a_T}Q(s_T,a_T)&=Q_{a_T,s_T}s_T+Q_{a_T,a_T}a_T+q_{a_T}=0\\\ a_T&=K_Ts_T+k_T\tag{5}\\\ where\quad K_T&=-Q_{a_T,a_T}^{-1}Q_{a_T,s_T}\\\ k_T&=-Q_{a_T,a_T}^{-1}q_{a_T} \end{align}\]Then we plug \( (5) \) back into \( (4) \), we get the maximal action value at state \( s_T \)
\[\begin{align} V(s_T)&=\max_{a_T}Q(s_T,a_T)\\\ &={1\over 2} \begin{bmatrix}s_T\\\K_Ts_T+k_T\end{bmatrix}^T Q_T \begin{bmatrix}s_T\\\K_Ts_T+k_T\end{bmatrix} + \begin{bmatrix}s_T\\\K_Ts_T+k_T\end{bmatrix}^Tq_T\tag{6} \end{align}\]Expanding \( (6) \) and solving the algebra, we end up with the following concise expression
\[\begin{align} V(s_T)&={1\over 2}s_T^TV_Ts_T+s_T^Tv_T\tag {7}\\\ where\quad V_T&=Q_{s_T,s_T}+Q_{s_T,a_T}K_T+K_T^TQ_{a_T,s_T}+K_T^TQ_{a_T,a_T}K_T\\\ &=Q_{s_T,s_T}-Q_{s_T,a_T}Q_{a_T,a_T}^{-1}Q_{a_T,s_T}\\\ v_T&=Q_{s_T,a_T}k_T+K_T^TQ_{a_T,a_T}k_T+q_{s_T}+K_T^Tq_{a_T}\\\ &=q_{s_T}-Q_{s_T,a_T}Q_{a_T,a_T}^{-1}q_{a_T} \end{align}\]Now we replace \( s_T \) in \( (7) \) with \( f_{T-1}(s_{T-1},a_{T-1}) \)
\[\begin{align} V(s_T)&={1\over 2}\left(F_{T-1}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix} + f_{T-1}\right)^T V_T \left(F_{T-1}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix} + f_{T-1}\right) + \left(F_{T-1}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix} + f_{T-1}\right)^T v_T\\\ &={1\over 2}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^TF_{T-1}^T V_T F_{T-1}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix} + \begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^T \left(F_{T-1}^TV_Tf_{T-1} + F_{T-1}^Tv_T\right)\tag{8} \end{align}\]Next let’s express action value at time step \( T-1 \)
\[\begin{align} Q(s_{T-1},a_{T-1})&={1\over 2}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^T R_{T-1} \begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix} + \begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^T r_{T-1} + V(s_T)\tag {9}\\\ &={1\over 2}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^TQ_{T-1}\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}+\begin{bmatrix}s_{T-1}\\\a_{T-1}\end{bmatrix}^Tq_{T-1}\tag {10}\\\ where\quad Q_{T-1}&=R_{T-1}+F_{T-1}^TV_TF_{T-1}\\\ q_{T-1}&=r_{T-1}+F_{T-1}^TV_Tf_{T-1}+F_{T-1}^Tv_T \end{align}\]where in the second step, we substitute \( (8) \) for \( V(s_T) \).
From here, it is easy to see a consistent recursive pattern lying in the \( Q \)-function, and hence the whole process. Therefore, it is reasonable to think that we can recursively construct the whole process from time step \( T \) down to the beginning.
Algorithm
Backward recursion
The backward recursion recursively computes value functions and auxiliary matrices from time step \( T \) down to the beginning
\[\begin{align} &\mathrm{for\ }t=T \mathrm{\ to\ }1:\\\ &\quad Q_t=R_t+F_t^TV_{t+1}F_{t-1}\\\ &\quad q_t=r_t+F_{t}^TV_{t+1}f_{t-1}+F_{t}^Tv_{t+1}\\\ &\quad Q(s_t,a_t)={1\over 2}\begin{bmatrix}s_{t}\\\a_{t}\end{bmatrix}^T Q_t \begin{bmatrix}s_{t}\\\a_{t}\end{bmatrix} +\begin{bmatrix}s_{t}\\\a_{t}\end{bmatrix}^T q_t+\mathrm{const}\\\ &\quad K_t=-Q_{a_t,a_t}^{-1}Q_{a_t,s_t}\\\ &\quad k_t=-Q_{a_t,a_t}^{-1}q_{a_t}\\\ &\quad a_t=\arg\min_{a_t}Q(s_t,a_t)=K_ts_t+k_t\\\ &\quad V_t=Q_{s_T,s_T}-Q_{s_T,a_T}Q_{a_T,a_T}^{-1}Q_{a_T,s_T}\\\ &\quad v_t=q_{s_T}-Q_{s_T,a_T}Q_{a_T,a_T}^{-1}q_{a_T}\\\ &\quad V(s_t)={1\over 2}s_t^TV_ts_t+s_t^Tv_t+\mathrm{const} \end{align}\]Here we add \( \mathrm{const} \) back to the action and state values for completeness, but it should be aware that they do not take any effect on the algorithm process or action selection.
Forward recursion
The forward recursion uses the auxiliary matrices computed in the backward recursion to select actions and predict following states.
\[\begin{align} &\mathrm{for}\ t=1\ \mathrm{to}\ T:\\\ &\quad a_t = K_ts_t+k_t\\\ &\quad s_{t+1} = f(s_t, a_t) \end{align}\]Stochastic dynamics
The model we have used so far is deterministic, where \( s_{t+1}=f(s_t,a_t) \). Things will become more complex if the model is stochastic, in which the next state is sampled from a probabilistic distribution, i.e., \( s_{t+1}\sim p(s_{t+1}\vert s_t,a_t) \). Fortunately, if the probabilistic distribution is a linear-Gaussian distribution with a constant covariance as follows, we could use the same algorithm as before.
\[\begin{align} p(s_{t+1}|s_t,a_t)&=\mathcal N(\mu_t,\Sigma_t)\\\ where\quad \mu_t&=F_t\begin{bmatrix}s_t\\\a_t\end{bmatrix}+f_t \end{align}\]To see the reasoning, let us recap the above analysis or just recall the Bellman equation. It is easy to see that the dynamics model is only required when we compute \( Q(s_{t}, a_{t}) \) from \( V(s_{t+1}) \). Therefore, if we could prove the following result, then we will be free to use the same algorithm without any modifications
\[\begin{align} \mathbb E[V(s_t)]&={1\over 2}s_t^TV_Ts_t+s_t^Tv_T\\\ &={1\over 2}\mu_t^TV_t\mu_t+\mu_t^Tv_t +\mathrm{const}\tag{11} \end{align}\]Note the const term is free to omit as we stated at the beginning of the post.
Here we prove a more general case where we have
\[\begin{align} \mathbb E[X^TAX+X^TB+C]=\mu^T A\mu+\mu^TB+\mathrm{const} \end{align}\]Decomposing the left side, we get
\[\begin{align} \mathbb E[X^TAX+X^TB+C]&=\mathbb E\left[\sum_{ij}a_{ij}x_ix_j+\sum_ib_ix_i+C\right]\\\ &=\sum_{ij}a_{ij}\mathbb E[x_ix_j]+\sum_ib_i\mathbb E[x_i]+C\\\ &=\sum_{ij}a_{ij}\left(\sigma_{ij}+\mu_i\mu_j\right)+\sum_ib_i\mu_i+C\\\ &=\sum_{ij}a_{ij}\sigma_{ij}+\sum_{ij}a_{ij}\mu_i\mu_j+\sum_ib_i\mu_i+C\\\ &=\mathrm{tr(A\circ\Sigma)}+\mu^TA\mu+\mu^TB+C\\\ &=\mu^TA\mu+\mu^TB+\mathrm{const} \end{align}\]Iterative LQR (iLQR)
For a nonlinear system, we could approximate it as a linear-quadratic system by taking Taylor expansion as follows
\[\begin{align} \mathrm{Model:}\quad&f(s_t,a_t)=f(\hat s_t,\hat a_t)+\nabla_{s_t,a_t}f(\hat s_t,\hat a_t)\begin{bmatrix}s_t-\hat s_t\\\a_t-\hat a_t\end{bmatrix}\tag{12}\\\ \mathrm{Reward:}\quad&r(s_t,a_t)=r(\hat s_t,\hat a_t)+\nabla_{s_t,a_t}r(\hat s_t,\hat a_t)\begin{bmatrix}s_t-\hat s_t\\\a_t-\hat a_t\end{bmatrix}+\begin{bmatrix}s_t-\hat s_t\\\a_t-\hat a_t\end{bmatrix}^T\nabla_{s_t,a_t}^2r(\hat s_t,\hat a_t)\begin{bmatrix}s_t-\hat s_t\\\a_t-\hat a_t\end{bmatrix}\tag{13} \end{align}\]If we take \( \delta s_t=s_t-\hat s_t \) and \( \delta a_t=a_t-\hat a_t \), then we could rewrite the system in terms of \( \delta s_t \) and \( \delta a_t \).
\[\begin{align} \mathrm{Model:}\quad&\bar f(\delta s_t,\delta a_t)=f(\hat s_t,\hat a_t)+\nabla_{s_t,a_t}f(\hat s_t,\hat a_t)\begin{bmatrix}\delta s_t\\\ \delta a_t\end{bmatrix}\tag{14}\\\ \mathrm{Reward:}\quad&\bar r(\delta s_t,\delta a_t)=r(\hat s_t,\hat a_t)+\nabla_{s_t,a_t}r(\hat s_t,\hat a_t)\begin{bmatrix}\delta s_t\\\ \delta a_t\end{bmatrix}+\begin{bmatrix}\delta s_t\\\ \delta a_t\end{bmatrix}^T\nabla_{s_t,a_t}^2r(\hat s_t,\hat a_t)\begin{bmatrix}\delta s_t\\\ \delta a_t\end{bmatrix}\tag{15} \end{align}\]Solving this dynamics by using LQR, we get the ‘optimal’ sequence \( \delta a_1,\cdots, \delta a_T \), with which we could update \( \hat a_t=\hat a_t+\delta a_t \). Then we run the forward path with real dynamics and these actions, and repeat until convergence.
Algorithm
\[\begin{align} &\mathrm{Run\ forward\ path\ with\ }f(s_t, a_t)\ \mathrm{and}\ a_t=0\\\ &\mathrm{Until\ convergence:}\\\ &\quad F_t = \nabla_{s_t,a_t}f(\hat s_t,\hat a_t)\\\ &\quad f_t = f(\hat s_t, \hat a_t)\\\ &\quad R_t = \nabla_{s_t,a_t}^2r(\hat s_t, \hat a_t)\\\ &\quad r_t = \nabla_{s_t,a_t}r(\hat s_t, \hat a_t)\\\ &\quad \mathrm{Run\ LQR\ on\ }\delta s_t, \delta a_t\\\ &\quad \mathrm{Run\ forward\ path\ with\ }f(s_t,a_t)\ \mathrm{and}\ a_t=K_t\delta s_t+k_t+\hat a_t\\\ &\quad \mathrm{Update\ }\hat s_t,\hat a_t\ \mathrm{based\ on\ states\ and\ actions\ in\ forward\ path} \end{align}\]Line search
As with Newton’s method, we need to do some backward line search in iLQR to prevent overshooting. To do so, we generally multiply \( k_t \) in the action selection by \( \alpha \), which starts with \( 1 \) and gradually decreases until improvement is achieved. This makes sense since \( \alpha \) smoothly interpolates between the original trajectory and the optimal trajectory under the quadratic approximation — notice that \( \alpha=0 \) results in the exact same trajectory since if we start with \( s_1=\hat s_1 \), we’ll have \( a_1=K_1\delta s_1+\hat a_1=\hat a_1 \), which results in \( s_2=\hat s_2 \), and so on and so on.
References
CS 294-112 at UC Berkeley. Deep Reinforcement Learning Lecture 10