Sherwin Chen
by Sherwin Chen
9 min read

Categories

Tags

Introduction

Vanilla policy-based methods suffer from instability since it is hard to choose an approriate step size. A bad(large) step in policy gradient methods is more damaging than that in supervised learning algorithms since it leads to a bad policy, and a bad policy in turn results in a bad input distribution, which may not even be recoverable in the near future.

In this post, we will first briefly review the objective of policy gradient methods and its gradient estimator, then introduce several surrogate objectives built on that gradient estimator. Next we will discuss two algorithms that regulate the step size to help avoid large steps.

Table of Contents

Objectives of Policy Gradient Methods

In reinforcement learning, we evaluate a policy based on the expected total rewards collected when following that policy. It is therefore intuitive to take the expected rewards as the objective of a policy-based method, which is

\[\begin{align} \max_\theta L(\theta)=\int_\tau r(\tau)p(\tau;\theta)d\tau \tag {1} \end{align}\]

where \( \tau \) represents a trajectory, \( r(\tau) \) the rewards collected along the trajectory, and \( p(\tau;\theta) \) the probability of taking the trajectory. As we’ve proved in the previous post, the policy gradient of the objective \( L(\theta) \) is

\[\begin{align} \nabla_\theta L(\theta)&=\mathbb E\left[ \nabla_\theta \log \pi_\theta(a_t|s_t)\hat A_t \right] \end{align}\]

where \( \pi_\theta \) is a stochastic policy and \( \hat A_t \) is an estimator of the advantage function at timestep \( t \). Given this gradient estimator, we could write a surrogate objective as

\[\begin{align} \max_\theta L(\theta)=\mathbb E\left[\log \pi_\theta(a_t|s_t)\hat A_t \right] \end{align}\]

or equivalently

\[\begin{align} \max_{\theta} L_{\theta_{old}}(\theta)&=\mathbb E\left[{\pi_{\theta}(a_t|s_t)\over \pi_{\theta_{old}}(a_t|s_t)}\hat A_t\right]\tag {2} \end{align}\]

since at \( \theta_{old} \), we have

\[\begin{align} \nabla_{\theta} L_{\theta_{old}}(\theta)_{|\theta_{old}}=\mathbb E\left[{\nabla_{\theta}\pi_\theta(a_t|s_t)_{|\theta=\theta_{old}}\over \pi_{\theta_{old}}(a_t|s_t)}\hat A_t\right]=\mathbb E\left[\nabla_\theta\log\pi_\theta(a_t|s_t)_{|\theta=\theta_{old}}\hat A_t\right] \end{align}\]

Eq. \( (2) \) has a very nice importance sampling interpretation: assuming \(\pi_\theta\) is close to \(\pi_{\theta_{old}}\), we could think it in a way that states and actions are sampled from the behavior policy \( \pi_{\theta_{old}} \), and we want to optimize the target policy \( \pi_\theta \).

Elevator back to directory

Trust Region Policy Optimization(TRPO)

To address the step-size issue, TRPO takes advantage of a concept of trust region. Formally, a trust reigon is defined to limit the change of \( \pi_\theta \) from \( \pi_{\theta_{old}} \):

\[\begin{align} \max_\theta &L_{\theta_{old}}(\theta)\\\ \mathrm{w.r.t.}\quad \mathbb E[D_{KL}&(\pi_{\theta_{old}}(\cdot|s_t)\Vert \pi_\theta(\cdot|s_t))]\le\delta \tag {3} \end{align}\]

Apart from the theoretical reasoning, KL divergence chosen here is more likely for analytically computational convenience since we generally define the action distributions as the Gaussian distributions, of which KL divergence could be analytically computed. Also, KL divergence defined in Eq.\( (3) \) exhibits moment matching behavior, which is desirable in practice. (Note this is different from what we will see in IRL, where mode-seeking is preferred. The difference stems from the different purposes of using KL divergence. In IRL, we want the policy to learn to approach a target policy that we consider is optimal, whereas here we just constrain the policy from changing too much. Therefore, moment-matching is more suitable here. In addition, perhaps the most important reason is that here samples used to approximate the expectation are collected by following the old policy, which concludes the KL constraint by nature)

We could as well penalize the objective function instead of setting a constraint:

\[\begin{align} \max_\theta L_{\theta_{old}}(\theta)-\beta\mathbb E\left[D_{KL}(\pi_{\theta_{old}}(\cdot|s_t)\Vert \pi_\theta(\cdot|s_t))\right]\tag {4} \end{align}\]

It can be theoretically proved that if we take the maximal KL divergence instead of the expected KL divergence in the penalty term, then we get a lower bound on the policy performance \( (1) \). Therefore, by iteratively optimizing \( (4) \) (with max KL divergence rather than expectation), we are at least guaranteed to converge to a local optimum of the policy performance \( L(\pi_\theta) \)(as stated by MM algorithm). The reason we take expectation here instead of maximization is that maximization is impractical to solve since it requires computing KL divergence for all states.

Note that, in practice, both \( \delta \) and \( \beta \) are hyperparameters and \( \delta \) is easier to tune than \( \beta \) since \( (4) \) is too conservative. Thus, a fixed \( \delta \) is generally better than a fixed \( \beta \).

Since TRPO is relatively complicated and hard to optimize than PPO, I leave the detailed discussions to the end for interested readers.

Update

Some improvements have been proposed since the TRPO was first introduced. One of the effective way to make the first-order gradient descent algorithms applicable to TRPO is to adaptively adjust the \( \beta \) term in Eq.\( (4) \) as follows

\[\begin{align} &if\ D_{KL}[\pi_{\theta_{old}}\Vert\pi_\theta] > \alpha_{high}D_{KL}^{exp}:\\\ &\quad \beta \leftarrow \lambda \beta\\\ &else\ if\ D_{KL}[\pi_{\theta_{old}}\Vert\pi_\theta] < \alpha_{low}D_{KL}^{exp}\\\ &\quad \beta \leftarrow {\beta\over\lambda} \end{align}\]

where \( D_{KL}^{exp} \) is the desired change in policy per iteration. The scaling term \( \lambda>1 \) controls the adjustment of the KL-penalty term when the actual change in the policy falls outside of the interval \( \left[\alpha_{low}D_{KL}^{exp}, \alpha_{high}D_{KL}^{exp}\right] \).

Elevator back to directory

Proximal Policy Optimization(PPO)

TRPO relies on an additional KL constraint to restrict policy update in a trust region. Such a constraint makes the algorithm hard to optimize and not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks).

PPO dispenses with such a constraint, and prohibits large changes to the policy by further exploring the surrogate objective \( (2) \). More concretely, let \( r_t(\theta) \) denote the probability ratio

\[\begin{align} r_t(\theta)={\pi_{\theta}(a_t|s_t)\over \pi_{\theta_{old}}(a_t|s_t)} \end{align}\]

PPO penalizes policy that moves \( r_t(\theta) \) away from \( 1 \) with the following objective

\[\begin{align} \max_\theta L_t(\theta)=\mathbb E\left[\min (r_t(\theta)\hat A_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t)\right]\tag {5} \end{align}\]

where \( \epsilon \) is a hyperparameter, i.e, \( \epsilon=0.2 \). The first term inside \( \min \) is just the surrogate objective defined in \( (2) \). The second term modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving \( r_t \) outside of the interval \( [1-\epsilon, 1+\epsilon] \). The \( \min \) says that when the advantage \(\hat A_t>0\), we cap the ratio \(r_t \le 1+\epsilon\) — we don’t want \(\pi_\theta\) to be much more probable than \(\pi_{\theta_{old}}\) at \((s_t,a_t)\) as such an over-optimization has gone beyond the trust region and may hurt the performance. Similar reasoning happens to the opposite side. As a result, the objective in Equation \((5)\) becomes a lower bound on the original objective \( (2) \), thereby ensuring there will never be any unexpected big policy update.

The authors also suggest to add an entropy bonus to ensure sufficient exploration. The final objective for the policy network becomes

\[\begin{align} \max_\theta L_t(\theta)=\mathbb E\left[\min\bigl(r_t(\theta)\hat A_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\bigr)+cH(\pi_\theta(\cdot|s_t))\right]\tag {6} \end{align}\]

where the entropy coefficient \( c \) used by the authors in Atari experiments is \( 0.01 \).

In practice, we usually normalize \(\hat A\) with batch statistics, which effectively serves as an adaptive learning rate heuristic that bounds the gradient variance(Tucker et al. 2018).

Elevator back to directory

Supplementary Materials

Solving (4)

Eq. \( (4) \) can be approximately solved by making linear approximation to the objective term and quadratic approximation to the penalty term, from which we have

\[\begin{align} \max_\theta\quad&L_{\theta_{old}}({\theta})_{\theta=\theta_{old}} + \left({\partial \over \partial\theta}L_{\theta_{old}}(\theta)_{|\theta=\theta_{old}}\right)^T(\theta-\theta_{old})\\\ &- \beta\bigg( D_{KL}(\pi_{\theta_{old}}, \pi_\theta)_{|\theta=\theta_{old}} + \left({\partial \over \partial \theta} D_{KL}(\pi_{\theta_{old}},\pi_\theta)_{|\theta=\theta_{old}}\right)^T(\theta-\theta_{old}) \\\ &\quad+{1\over 2}(\theta-\theta_{old})^T{\partial^2\over \partial\theta^2}D_{KL}(\pi_{\theta_{old}},\pi_\theta)_{|\theta=\theta_{old}}(\theta-\theta_{old}) \bigg)\tag{7} \end{align}\]

since our objective is to find \( \theta \) that optimizes \( (4) \), we could omit the unrelevant terms in the objective \( (7) \). Also noticing that \( {\partial\over \partial \theta} D_{KL}(\pi_{\theta_{old}},\pi_\theta){\vert \theta=\theta{old}}=0 \) since \(D_{KL}(\pi_{\theta_{old}},\pi_\theta) \) is at its minimum when \( \pi_{\theta}=\pi_{\theta_{old}} \), we simplify the objective \( (7) \) and have

\[\begin{align} \max_{\theta} g^T (\theta-\theta_{old})-{1\over 2}\beta(\theta-\theta_{old})^T H(\theta-\theta_{old})\tag {8} \end{align}\] \[\begin{align} where\ g&= {\partial\over \partial\theta}L_{\theta_{old}}(\pi_\theta)_{|\theta=\theta_{old}}\\\ H&={\partial^2\over \partial\theta^2}D_{KL}(\pi_{\theta_{old}},\pi_\theta)_{|\theta=\theta_{old}} \end{align}\]

by taking the derivative we can solve the above optimization directly and have

\[\begin{align} \beta H(\theta - \theta_{old}) &=g\tag {9}\\\ \theta&=\theta_{old}+{1\over\beta}H^{-1}g\tag {10} \end{align}\]

This approximate update is called the natural policy gradient.

This method is not useful in large-scale problems since computing \( H^{-1} \) is prohibitively costly (both in computation and memory). Instead, we generally solve \( (9) \) approximately using the conjugate gradient method in which we directly compute \(Hx\) by

\[\begin{align} Hx&={\partial\over\partial\theta}\sum_j{\partial \over\partial\theta_j}D_{KL}(\pi_{\theta_{old}},\pi_\theta)x_j\tag {11}\\\ where\quad x&=\theta-\theta_{old} \end{align}\]

This is done by noticing that \(H_{i,j}={\partial\over\partial\theta_j}{\partial\over\partial\theta_i}D_{KL}(\pi_{\theta_{old}},\pi_\theta)\), and the \(k^{th}\) element of \(Hx\) is

\[\begin{align} (Hx)_{k}=&\sum_j H_{k,j}x_j\\\ =&\sum_j {\partial\over\partial\theta_j}{\partial\over\partial\theta_k}D_{KL}(\pi_{\theta_{old}},\pi_\theta)x_j\\\ =&{\partial\over\partial\theta_k}\sum_j {\partial\over\partial\theta_j}D_{KL}(\pi_{\theta_{old}},\pi_\theta)x_j\\\ \end{align}\]

which gives Equation \((11)\).

Since the summation in Equation \((11)\) yields a scalar, in this way, we only need to compute gradients twice.

According to [3], we can gain additional \(20\%\) speed-ups by noticing

\[\begin{align} H&=\left(JMJ^T+{\partial^2\pi_\theta\over\partial\theta^2}{\partial\over\partial\pi_\theta}D_{KL}(\pi_{\theta_{old}}, \pi_\theta)\right)_{|\theta=\theta_{old}}\\\ &\approx \left(JMJ^T\right)_{|\theta=\theta_{old}}\tag {12}\\\ where\ J&={\partial\pi_\theta\over\partial\theta}\\\ M&={\partial^2\over\partial\pi_{\theta}^2}D_{KL}(\pi_{\theta_{old}}, \pi_\theta) \end{align}\]

where the approximation is made since the second term vanishes as \( {\partial \over\partial\pi_\theta}D_{KL}(\pi_{\theta_{old}}, \pi_\theta) = 0 \) when \( \theta=\theta_{old} \).

Now let’s take a deeper look at \( M \) — in the following discussion, I omit \( \theta \) for clarity

\[\begin{align} M&={\partial^2\over\partial\pi^2}D_{KL}(\pi_{_{old}}, \pi)\\\ &=-{\partial^2 \over\partial\pi^2}\sum\pi_{old}\log\pi\\\ &=-{\partial\over\partial\pi}{\pi_{old}\over\pi} \end{align}\]

the resulting matrix is a diagonal matrix since \( {\partial\over\partial\pi_j}\left(\pi_{old}\over\pi\right)_i = 0 \) when \( i \ne j \). Therefore, we have

\[\begin{align} M=\begin{bmatrix} \left(\pi_{old}\over \pi^2\right)_1&0&\cdots&0\\\ 0&\left(\pi_{old}\over \pi^2\right)_2&\cdots&0\\\ \vdots&\vdots&\ddots&\vdots\\\ 0&0&\cdots&\left(\pi_{old}\over \pi^2\right)_n \end{bmatrix} \end{align}\]

where \( \left(\pi_{old}\over \pi^2\right)n \) is the \( n^{th} \) entry in \( {\pi{old}\over \pi^2}\). Especially, when \( \theta=\theta_{old} \), it ends up with

\[\begin{align} M_{|\theta=\theta_{old}}=\begin{bmatrix} \left(1\over \pi_{old}\right)_1&0&\cdots&0\\\ 0&\left(1\over \pi_{old}\right)_2&\cdots&0\\\ \vdots&\vdots&\ddots&\vdots\\\ 0&0&\cdots&\left(1\over \pi_{old}\right)_n \end{bmatrix} \end{align}\]

which can be compactly expressed as a vector, consisting of the non-zero diagonal elements. In this way we can directly compute \(H\) using only two first-order derivatives. [3] in fact uses a complex process to compute \(Hx\) using this approximation, but I cannot see the reason why not directly compute \(H\).

Solving (3)

With knowledge of solving \( (4) \), it is easier to solve \( (3) \). We first figure out the search direction following the same process as we previously solve \( (4) \) with \( \beta=1 \) using the conjugate gradient method. Let’s say we obtain the search direction \( s \). Then we approximate the maximal step size \( \alpha \) with respect to the KL constraint as follows

\[\begin{align} D_{KL}(\pi_{\theta_{old}}\Vert\pi_\theta)\approx&{1\over2}(\theta_{old}+\alpha s-\theta_{old})^TH(\theta_{old}+\alpha s-\theta_{old})\\\ {1\over 2}(\alpha s)^TH(\alpha s)=&\delta\\\ \alpha=&\sqrt{2\delta\over s^THs} \end{align}\]

where \( \delta \) is the desired bound for KL divergence. Last, we do line search, starting with the maximal step size \( \alpha \) and shrinking \( \alpha \) exponentially until the objective \( (2) \) improves.

MM Algorithm

Let \( f(\theta) \) be the objective concave function to be maximized. At the \( m \) step of the MM algorithm, \( m=0, 1… \), the constructed function \( g(\theta\vert \theta_m) \) will be called the minorized version of the objective function (namely, the surrogate function) at \( \theta_m \) if

\[\begin{align} g(\theta|\theta_m)&\le f(\theta)\quad \mathrm{for\ all\ }\theta\\\ g(\theta_m|\theta_m)&=f(\theta_m) \end{align}\]

By maximizing \( g(\theta\vert \theta_m) \) instead of \( f(\theta) \) and let

\[\begin{align} \theta_{m+1}=\arg\max_\theta g(\theta|\theta_m) \end{align}\]

The above iterative method will guarantee that \( f(\theta_m) \) will converge to a local optimum or a saddle point as \( m \) go to infinity

References

  1. John Schulman et al. Trust Region Policy Optimization
  2. John Schulman et al. Proximal Policy Optimization Algorithms
  3. http://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/
  4. A good intuition for trust region methods: https://www.youtube.com/watch?v=qaOKZkeutqE
  5. Thicker, George, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, and Sergey Levine. 2018. “The Mirage of Action-Dependent Baselines in Reinforcement Learning.” 35th International Conference on Machine Learning, ICML 2018 11: 7985–94. https://doi.org/10.17863/CAM.23539.

Elevator back to directory