Sherwin Chen
by Sherwin Chen
5 min read

Categories

Tags

Introduction

General policy gradient methods face two main challenges: 1) stable and steady improvement despite the nonstationarity of incoming data; 2) sample efficiency. In the previous post, we have discussed TRPO and PPO which restrict the step size of policy updates to obtain a reliable improvement. In this post, we shift our focus to the second challenge, which is addressed by using value function to substantially reduce the variance of policy gradient estimates at cost of some bias, with exponentially-weighted estimator of the advantage function.

Table of Contents

Preliminary

First we introduce a notion of \( \gamma \)-just that will be frequently referred in the rest of the post. A \( \gamma \)-just estimator of the advantage function is an estimator that does not introduce bias when we use it to approximate the underlying real discounted advantage function \( A^{\pi,\gamma} \). In other word, an estimator is \( \gamma \)-just if \( \gamma \) is the only source of bias – recall that the discount factor \( \gamma \) could also be view as a variance reduction technique, at the price of some bias, as it diminishes the weight of future rewards. An estimator \( \hat A_t \) is \( \gamma \)-just for all \( t \) if

\[\begin{align} \mathbb E_{s_{0:\infty},a_{0:\infty}}\left[\hat A_t(s_{0:\infty}, a_{0:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)\right]=\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[A^{\pi, \gamma}(s_t,a_t)\nabla_\theta\log\pi_\theta(a_t|s_t)\right]\tag{1} \end{align}\]

where \(A^{\pi,\gamma}=Q^{\pi,\gamma}(s_t,a_t)-V^{\pi,\gamma}\) is the real discounted advantage function following policy \(\pi\).

One sufficient condition for \( \hat A_t \) to be \( \gamma \)-just is that

\[\begin{align} \hat A_t(s_{0:\infty}, a_{0:\infty})=Q_t(s_{t:\infty}, a_{t:\infty})-b_t(s_{0:t}, a_{0:t-1}) \tag{2} \end{align}\]

where \( b_t \) is an arbitrary function of states and actions sampled before \( a_t \), and \( Q_t \), which can depend on any trajectory variables starting from \( s_t, a_t \), is an unbiased estimator of the discounted \( Q \)-function \( Q^{\pi,\gamma}(s_t,a_t) \), i.e.

\[\begin{align} \mathbb E_{s_{t+1:\infty}, a_{t+1:\infty}}\left[Q_t(s_{t:\infty}, a_{t:\infty})\right] = Q^{\pi, \gamma}(s_t, a_t) \end{align}\]

detailed proof will be given in the end

Elevator back to directory

Generalized Advantage Estimation

Let’s now consider the \( n \)-step advantage:

\[\begin{align} \hat A_t^{(n)} &= \sum_{i=0}^{n-1}r_{t+i}+\gamma^n V(s_{t+n})- V(s_t)\\\ &=r_{t}+\gamma V(s_{t+1})-V(s_{t})+\gamma\left(\sum_{i=1}^{n-1}r_{t+i+1 }+\gamma ^{n-1}V(s_{t+n})-V(s_{t+1})\right)\\\ &=\delta_t+\gamma\left(\sum_{i=0}^{n-2}r_{t+i+1}+\gamma ^{n-1}V(s_{t+n})-V(s_{t+1})\right)\\\ &=\sum_{i=0}^{n-1}\gamma^i\delta_{t+i}\tag{3} \end{align}\]

Equation \((3)\) actually illustrates a very nice interpretation that if we view \( \delta_t \) as a shaped reward with \( V \) as the potential function (aka. potential-based reward), then the \( n \)-step advantage is actually \( \gamma \)-discounted sum of these shaped rewards. With this in mind, we further define the generalized advantage estimator \( GAE(\gamma, \lambda) \) as the exponentially-weighted average of all \( n \)-step estimators, which is closely analogous to \( TD(\lambda) \) in terms of advantage function:

\[\begin{align} \hat A_t^{GAE(\gamma, \lambda)}&:=(1-\lambda)\left(\sum_{n=0}^{\infty}\lambda^{n}\hat A_{t}^{(n+1)}\right)\\\ &=(1-\lambda)\left(\sum_{n=0}^{\infty}\lambda^n\sum_{i=0}^{n}\gamma^i\delta_{t+i}\right)\\\ &=(1-\lambda)\left(\sum_{i=0}^{\infty}\gamma^i\delta_{t+i}\sum_{n=i}^{\infty}\lambda^n \right)\\\ &=(1-\lambda) \left(\sum_{i=0}^{\infty}\gamma^i\delta_{t+i}\lambda^i\sum_{n=0}^{\infty}\lambda^n\right)\\\ &=(1-\lambda)\left(\sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}{1\over1-\lambda}\right)&\mathrm{since\ }\lambda < 1\\\ &=\sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}\tag{4} \end{align}\]

Equation \((4)\) suggests that the generalized advantage estimator can be simply interpreted as \( \gamma\lambda \)-discounted sum of shaped rewards. Taking a deeper look at \( \lambda \), we could see that \( \lambda \) controls bias-variance trade-off just as \( \gamma \) does. In particular, by setting \( \lambda=0 \) and \( \lambda=1 \), we will have

\[\begin{align} \hat A_t^{GAE(\gamma, 0)}&=\delta_{t}\tag{5}\\\ \hat A_t^{GAE(\gamma, 1)}&=\sum_{i=0}^{\infty}\gamma^i R_{t+i+1}-V(s_t)\tag{6} \end{align}\]

It is easy to see that \( \hat A_t^{GAE(\gamma, 1)} \) is \( \gamma \)-just regardless of the accuracy of \( V \) since it satisfies Equation.\((2)\), but it has high variance due to the cumulative rewards. For \( \lambda < 1 \), \( \hat A_t^{GAE(\gamma, \lambda)} \) is \( \gamma \)-just only when \( V \) is an accurate estimate of discounted state value \( V^{\pi,\gamma} \), and otherwise induces bias, but it typically has much lower variance.

Although both \( \gamma \) and \( \lambda \) contribute to the bias-variance tradeoff, they serve different purposes and work best with different ranges of values. Specifically, \( \gamma \) determines the scale of the return and value function, and bias is introduced as long as \( \gamma \) is less than \( 1 \). \( \lambda \), on the other hand, governs the weight of \( n \)-step return, and introduces bias only when the value function is inaccurate. As a result, \( \lambda \) introduces far less bias than \( \gamma \) for a reasonable accurate value function and the best value of \( \lambda \) is generally much lower than that of \( \gamma \).

Additionally, because \(\lambda\) controls the the weight of \(n\)-step return, it is desirable to have a long sequence length for a large \(\lambda\).

Elevator back to directory

Algorithm

In this section, we’ll walk through the detailed algorithm introduced by John Schulman et al., which makes heavy use of trust region to stablize training.

Trust Region Value Function Estimation

The loss for value function used in GAE is simple mean square error constrained by a trust region. More specifically, we define the objective as

\[\begin{align} \min_\phi\quad& L(\phi)=\mathbb E\left[\left\Vert V_\phi(s_t)-\hat V_t\right\Vert^2\right]\\\ s.t.\quad&\mathbb E\left[{\left\Vert V_\phi(s_t)- V_{\phi_{old}}(s_t)\right\Vert^2\over 2\sigma^2}\right]\le\epsilon \end{align}\]

where \( \sigma^2=\mathbb E\left[\left\Vert V_{\phi_{old}}-\hat V_t\right\Vert^2\right] \) is computed before optimization. This is solved in a similar procedure as TRPO does:

  1. By making linear approximation to the objective and quadratic approximation to the constraint, we end up with the objective and constraint
\[\begin{align} \min_\phi\quad &g^T(\phi-\phi_{old})\\\ s.t.\quad&\mathbb E\left[(\phi-\phi_{old})^TH(\phi-\phi_{old})\le\epsilon\right]\\\ \mathrm{where}\quad&g=\nabla_{\phi}L(\phi)_{|\phi=\phi_{old}},\ H=\mathbb E \left[\nabla_{\phi}V(s){1\over \sigma^2}\nabla_\phi V(s)^T\right]_{|\phi=\phi_{old}} \end{align}\]
  1. Compute the step direction \( s \) — using conjugate gradient algorithm to solve \( Hs=g \).
  2. Rescale \( s \) to \( \alpha s \) such that \( {1\over 2}(\alpha s)^TH(\alpha s)=\epsilon \) and take \( \phi=\phi_{old}+\alpha s \).
  3. Repeat the above process.

Elevator back to directory

Policy Gradient Method

The choice of policy gradient method is rather relaxed, TRPO and PPO are all good candidates. In the paper, the authors perform TRPO for policy update.

For completeness, the whole algorithm for iteratively updating policy and value function is given below

Note that the advantage is computed before the value function update. The authors explain that additional bias would have been introduced if the value function was updated first. In practice, we generally normalize the advantage to have mean zero and standard deviation one to further stablize the process.

Supplementary Materials

Proof

We now prove that Equation \((2)\) is sufficient for Equation \((1)\)

\[\begin{align} \mathbb E_{s_{0:\infty},a_{0:\infty}}\left[\hat A_t(s_{0:\infty}, a_{0:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)\right]&=\mathbb E_{s_{0:\infty},a_{0:\infty}}\Bigl[\bigl(Q_t(s_{t:\infty}, a_{t:\infty})-b_t(s_{0:t}, a_{0:t-1})\bigr)\nabla_\theta\log\pi_\theta(a_t|s_t)\Bigr]\\\ &=\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[Q_t(s_{t:\infty}, a_{t:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)\right]\\\ &\quad\quad -\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[b_t(s_{0:t}, a_{0:t-1})\nabla_\theta\log\pi_\theta(a_t|s_t)\right] \end{align}\]

Taking a closer look at the first term, we have

\[\begin{align} \mathbb E_{s_{0:\infty},a_{0:\infty}}\Bigl[Q_t(s_{t:\infty}, a_{t:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)\Bigr] &=\mathbb E_{s_{0:t},a_{0:t}}\Bigl[\mathbb E_{s_{t+1:\infty},a_{t+1:\infty}}[Q_t(s_{t:\infty}, a_{t:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)]\Bigr]\\\ &=\mathbb E_{s_{0:t},a_{0:t}}\Bigl[\mathbb E_{s_{t+1:\infty},a_{t+1:\infty}}[Q_t(s_{t:\infty}, a_{t:\infty})]\nabla_\theta\log\pi_\theta(a_t|s_t)\Bigr]\\\ &=\mathbb E_{s_{0:t},a_{0:t}}[Q^{\pi,\gamma}(s_{t}, a_{t})\nabla_\theta\log\pi_\theta(a_t|s_t)] \end{align}\]

where the last step is obtained because \(E_{s_{t+1:\infty},a_{t+1:\infty}}[Q_t(s_{t:\infty}, a_{t:\infty})] \) is an unbiased estimator of the discounted \( Q \)-function. We follow a similar process to analyze the second term

\[\begin{align} \mathbb E_{s_{0:\infty},a_{0:\infty}}\left[b_t(s_{0:t}, a_{0:t-1})\nabla_\theta\log\pi_\theta(a_t|s_t)\right] &=\mathbb E_{s_{0:t},a_{0:t-1}}\bigl[\mathbb E_{s_{t+1:\infty},a_{t:\infty}}[b_t(s_{0:t}, a_{0:t-1})\nabla_\theta\log\pi_\theta(a_t|s_t)]\bigr]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\bigl[b_t(s_{0:t}, a_{0:t-1})\mathbb E_{s_{t+1:\infty},a_{t:\infty}}[\nabla_\theta\log\pi_\theta(a_t|s_t)]\bigr]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\left[b_t(s_{0:t}, a_{0:t-1})\int_{a_t}\nabla_\theta\log\pi_\theta(a_t|s_t)\pi_\theta(a_t|s_t)da_t\right]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\left[b_t(s_{0:t}, a_{0:t-1})\int_{a_t}\nabla_\theta\pi_\theta(a_t|s_t)da_t\right]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\left[b_t(s_{0:t}, a_{0:t-1})\nabla_\theta\int_{a_t}\pi_\theta(a_t|s_t)da_t\right]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\bigl[b_t(s_{0:t}, a_{0:t-1})\nabla_\theta1\bigr]\\\ &=\mathbb E_{s_{0:t},a_{0:t-1}}\bigl[b_t(s_{0:t}, a_{0:t-1})0\bigr]\\\ &=0 \end{align}\]

Therefore, we have

\[\begin{align} &\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[\hat A_t(s_{0:\infty}, a_{0:\infty})\nabla_\theta\log\pi_\theta(a_t|s_t)\right]\\\ =&\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[Q^{\pi,\gamma}(s_t,a_t)\nabla_\theta\log\pi_\theta(a_t|s_t)\right]\\\ =&\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[\big(Q^{\pi, \gamma}(s_t,a_t)-V^{\pi, \gamma}(s_t,)\big) \nabla_\theta\log\pi_\theta(a_t|s_t)\right]\\\ =&\mathbb E_{s_{0:\infty},a_{0:\infty}}\left[A^{\pi, \gamma}(s_t,a_t)\nabla_\theta\log\pi_\theta(a_t|s_t)\right] \end{align}\]

where the penultimate step is obtained because \(V^{\pi,\gamma}(s_t)\) does not depends on \(a_t\) and \(\mathbb E_{s_{0:\infty},a_{0:\infty}}[V^{\pi, \gamma}(s_t,)\nabla_\theta\log\pi_\theta(a_t\vert s_t)]=0\). Now that we have derive Equation \((1)\) from the Equation \((2)\), the proof completes.

Elevator back to directory

References

John Schulman et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation