Sherwin Chen
by Sherwin Chen
6 min read

Categories

Tags

Introduction

In the previous post, we introduce LQR which attempts to find the optimal action sequence for a linear-quadratic system. In this post, we will combine iLQR with a local linear-Gaussian model which is left to learn but as before assumed to have a linear mean and constant covariance.

Overall Algorithm

In many situations, we don’t need to learn the global model since it may be redundant and a learned model may be erroneously optimistic in many places. Instead, it may be sufficient and efficient to just stick to estimating dynamics only in a local region around the current policy. Here we introduce a specific type of local models named linear-Gaussian models, which is defined as follows

\[\begin{align} p(s_{t+1}|s_t,a_t)&=\mathcal N(f(s_t,a_t),\Sigma_t) \tag{1} \\\ where\quad f(s_t,a_t)&=F_t\begin{bmatrix}s_t \\\ a_t\end{bmatrix} \end{align}\]

Unlike the linear-Gaussian model defined in the previous post, here \( F_t \) is unknown and left to learn. We do not include the constant term in the \( f(s_t,a_t) \) since later when we apply LQR to \( \delta s_t \) and \( \delta a_t \), these constant terms will be cancelled out anyway.

Long story short, the algorithm repeats the following three steps:

  1. Run policy \( p(a_t\vert s_t) \) to collect data \( D={\tau} \)
  2. Fit dynamics \( p(s_{t+1}\vert s_t,a_t) \) at each time step using linear regression
  3. Improve policy using dual gradient descent

This process bears much resemblance to iLQR introduced in our previous post, which iteratively applies LQR to an approximate dynamics. Both algorithms follows the same pattern: running the current policy, collecting data, adjusting the model, and improving the current policy. The differences arise at the model improvement and policy improvement steps: At the model improvement step, iLQR adjusts the Taylor expansion according to the observed data, while the above process fits dynamics using linear regression; As for the policy improvement step, here we require some exploration and meanwhile we have to constrain the policy to prevent the new trajectory from deviating too much from the old trajectory—this constraint is actually imposed by the validation of our local model.

Action Distribution for Exploration

Before talking about policy improvement using LQR, let us take one step back and consider what we expect the policy to be. In regular iLQR, we deterministically take action \( a_t=K_t(s_t-\hat s_t)+k_t+\hat a_t \). This is fine since we have the model at hand and all we have to do is to adjust the first-order Taylor expansion so as to improve the policy. Here, however, we don’t have a model yet, so we need to do some exploration to figure out the local model. In order to do that, we take a time-varying linear-Gaussian policy

\[\begin{align} p(a_t|s_t)=\mathcal N(K_ts_t+k_t, \Sigma_t) \tag{2} \end{align}\]

A good choice of the variance matrix \( \Sigma_t \) is \( Q_{a_t, a_t}^{-1} \) — the inverse of the coefficient of \( a_t^2 \) in the \( Q \)-value function, which has a very nice intuitive interpretation saying that: If the change of the action matters a lot to the \( Q \)-value (i.e., \( \vert Q_{a_t, a_t}\vert \) is large), we minimize the variance of the action; if the change of the action matters little (i.e., \( \vert Q_{a_t, a_t}\vert \) is small), we maximize the variance. Furthermore, the choice of Gaussian distribution also brings another benefit that LQR with linear-Gaussian policy has maximum entropy, which provides extra bonus for exploration. Together, our LQR produces a time-varying action distribution of Gaussian that maximizes the objective

\[\begin{align} \max\sum_{t=1}^T \mathbb E_{(s_t,a_t)\sim p(s_t,a_t)}\left[r(s_t,a_t)+\mathcal H(p(a_t|s_t)) \right]\tag{3} \end{align}\]

Eq.\( (3) \) will come in handy in the following discussion.

Trust Region iLQR

Because our local model is only accurate locally, the action sequence planned by iLQR oftentimes will result in a trajectory that deviates from the one it planned. Such an issue will become more severe if the new action sequence planned by iLQR differs from the old one in very drastic ways. Therefore, we have to add constraints on actions so as to limit the changes, which is generally done via the KL divergence as follows

\[\begin{align} \max \sum_{t=1}^T\mathbb E_{p(s_t,a_t)}\left[r(s_t,a_t)\right]\\\ s.t.\ \sum_{t=1}^TD_{KL}(p(a_t|s_t)\Vert p_{old}(a_t|s_t))\le \epsilon\tag {4} \end{align}\]

where \( p \) and \( p_{old} \) are the new and old policies, respectively. To solve this optimization problem, we first write down the Lagrangian dual problem corresponding to Eq. \( (4) \)

\[\begin{align} \min_\lambda\sup_p\mathcal L(p, \lambda)\tag {5}\\\ s.t.\quad \lambda\ge 0\\\ where\quad \mathcal L(p,\lambda)=\sum_{t=1}^T\mathbb E_{p(s_t,a_t)}\left[r(s_t,a_t)-\lambda\log p(a_t|s_t)+\lambda\log p_{old}(a_t|s_t)\right] + \lambda\epsilon \end{align}\]

and then apply dual gradient descent to solve it:

  1. Find the optimal \( p \) that maximizes \( \mathcal L(p,\lambda) \)
  2. Perform gradient descent \( \lambda\leftarrow\lambda-\alpha \nabla_\lambda\mathcal L(p,\lambda) \)
  3. Repeat 1-2 until convergence

It is easy to perform step 2, but step 1 is a little tricky. Let us write it down and try to align it with Eq.\( (3) \):

\[\begin{align} \max_p\sum_{t=0}^T\mathcal L(p, \lambda)&=\max_p\mathbb E_{p(s_t,a_t)}[r(s_t,a_t)-\lambda\log p(a_t|s_t)+\lambda\log p_{old}(a_t|s_t)]+\lambda\epsilon\\\ &=\max_p\mathbb E_{p(s_t,a_t)}\left[{1\over \lambda}r(s_t,a_t)-\log p(a_t|s_t)+\log p_{old}(a_t|s_t)\right]\\\ &=\max_p\mathbb E_{p(s_t,a_t)}\left[{1\over \lambda}r(s_t,a_t)+\log p_{old}(a_t|s_t)+\mathcal H(p(a_t|s_t))\right]\\\ &=\max_p\mathbb E_{p(s_t,a_t)}\left[\tilde r(s_t, a_t)+\mathcal H(p(a_t|s_t))\right]\tag {6}\\\ \mathrm{where}\quad \tilde r(s_t,a_t)&={1\over \lambda}r(s_t,a_t)+\log p_{old}(a_t|s_t)\tag {7} \end{align}\]

Eq.\( (6) \) indicates that we can direct solve it using the linear-Gaussian LQR if we use Equation \((7)\) as the reward function. In fact, we may instead use some maximum entropy methods.

Discussion

We may additionally train a model-free algorithm using trajectories collected along the pass, and a decayed supervision from the policy trained by iLQR. This strategy bears some resemblance to policy distillation and kickstarting.

Since we directly train a local linear-Gaussian model via linear regression, unlike iLQR discussed in the previous post, we do not contains \(\hat s_t\) and \(\hat a_t\) in the policy(Eq.\((2)\)). The trajectory-centric property is in fact implicitly imposed by the KL constraint in Eq.\((4)\).

Supplementary Materials

Dual Gradient Descent

For an optimization problem

\[\begin{align} \max &f(x)\tag {8}\\\ s.t.\quad g(x)&\le 0 \end{align}\]

the Lagrangian dual problem is defined as follows

\[\begin{align} \min_\lambda\sup_x&\mathcal L(x,\lambda)\\\ s.t.\quad &\lambda \ge 0\\\ where\quad \mathcal L(x, \lambda)&=f(x)-\lambda g(x) \end{align}\]

This problem can be solved by the following steps

  1. We first find the an optimal value of \( x \) that maximizes \( \mathcal L(x,\lambda) \), i.e., solving \( x^*\leftarrow\arg\max_x\mathcal L (x,\lambda) \)
  2. Then we apply gradient descent on \( \lambda \): \( \lambda \leftarrow \lambda - \alpha \nabla_\lambda \mathcal L(x^*,\lambda) \)
  3. Repeat the above process until convergence

Now let us take a look at why this actually works. Note that \( -\lambda g(x) \) is non-negative when the constraint is satisfied since \( \lambda \ge0 \) and \( g(x)\le 0 \). Therefore, the Lagrangian \( \mathcal L(x,\lambda) \) is always an upper bound of \( f(x) \) as long as the constraint is satisfied, i.e.

\[\begin{align} \mathcal L(x,\lambda)\ge f(x) \end{align}\]

also

\[\begin{align} \sup_x(\mathcal L(x,\lambda))\ge \sup_x(f(x)) \end{align}\]

As we continuously tune \( \lambda \) to minimize \( \sup_x(\mathcal L(x,\lambda)) \), \( \sup_x(\mathcal L(x,\lambda)) \) gradually approaches \( \sup_x(f(x)) \).

Augmented Lagrangian Method

Now let’s make the constraint in Eq.\( (8) \) stronger by replacing the notation \( \le \) with \( = \) and we have

\[\begin{align} \max f(x) \tag {9}\\\ s.t. g(x)= 0 \end{align}\]

We still can use dual gradient descent to solve Eq.\( (9) \). However, if we have a wrong initial value for the Lagrangian multiplier \( \lambda \), the above process fails to enforce the constraint early on the optimization process, especially for non-convex optimization problems. To help prevent the constraint is violated too much, we employ a method called augmented Lagrangian, which simply adds the square of the constraint to the Lagrangian:

\[\begin{align} \bar{\mathcal L}(x,\lambda)=f(x)-\lambda g(x)-\rho \Vert g(x)\Vert^2 \end{align}\]

where the penalty coefficients \( \rho \) is a hyperparameter, which could be increased as the optimization goes on. The quadratic term further penalizes the constraint, which not only helps to improve the stability of the process but also is guaranteed to converges to the correct solution since it approaches \( 0 \) when convergence.

References

CS 294-112 at UC Berkeley. Deep Reinforcement Learning Lecture 10

Sergey Levine et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics.