Introduction

We discuss Maximum a posteriori Policy Optimization, a KL-regularized reinforcement learning algorithm for continuous control problems. Despite its appealing performance on difficult control task, MPO is a bit involved compared to contemporary SOTA methods such as SAC and TD3. Nevertheless, it introduces several interesting techniques that may benefit other algorithms.

Background

In our previous posts[1, 2], we considered control problems as a temporal probabilistic graphical model, in which we introduced optimality variable \(O_t\) that indicated whether the corresponding state and action were optimal. We then derived an evidence lower bound on the likelihood of optimality \(\log(p(O_{1:T}))\) via variational inference. We briefly repeat the process as follows

\[\begin{align} \log p(O_{1:T})&=\log{p(O_{1:T}|\tau)p(\tau)q(\tau)\over p(\tau|O_{1:T}) q(\tau)}\\\ &\qquad\color{red}{\text{take expectation over }q(\tau) \text{ and rearange}}\\\ &=\mathbb E_{q(\tau)}[\log p(O_{1:T}|\tau)]-D_{KL}(q(\tau)\Vert p(\tau))+D_{KL}(q(\tau)\Vert p(\tau|O_{1:T}))\\\ &\ge\mathbb E_{q(\tau)}[\log p(O_{1:T}|\tau)]-D_{KL}(q(\tau)\Vert p(\tau))\\\ &\qquad\color{red}{\text{expand trajectory, omit transition probabilities, and rearange}}\\\ &=\sum_t\mathbb E_{s_t,a_t\sim q}\left[\log p(O_t|s_t,a_t)\right]-D_{KL}(q(a_t| s_t)\Vert p(a_t|s_t))\\\ &\qquad\color{red}{p(O_t|s_t,a_t)\propto\exp(r(s_t,a_t))}\\\ &=\sum_t\mathbb E_{s_t,a_t\sim q}\left[r(s_t,a_t)\right]-D_{KL}(q(a_t| s_t)\Vert p(a_t|s_t)) \end{align}\]

Adding discount factor \(\gamma\) and temperature \(\eta\), we obtain the KL regularized RL objective as follows:

\[\begin{align} \max_q\mathcal J(q,p)=\max_q \mathbb E_{q}[\sum_{t=0}^\infty\gamma^t\big(r(s_t,a_t)-\eta D_{KL}(q(a_t|s_t)\Vert p(a_t|s_t))\big)]\tag 1 \end{align}\]

Define \(V(s_t)=\mathbb E_{q}[\sum_{t’\ge t}\gamma^{t’-t}\big(r(s_{t’},a_{t’})-\eta D_{KL}(q(a_{t’}\vert s_{t’})\Vert p(a_{t’}\vert s_{t’}))\big)]\) and \(Q(s_t,a_t)=r(s_t,a_t)+\gamma\mathbb E_{s_{t+1}\sim p(s_{t+1}\vert s_t,a_t)}[V(s_{t+1})]\). We can rewrite Equation \((1)\) as

\[\begin{align} \max_q\mathcal J(q,p)=\max_q \mathbb E_q[Q(s_t,a_t)-\eta D_{KL}(q(a_t|s_t)\Vert p(a_t|s_t))]\tag 2 \end{align}\]

Note that given the prior policy \(p\) and action value function \(Q\), we can derive the optimal policy \(q\) in a close form as \(q(a\vert s)\sim p(a\vert s)\exp(Q(s,a)/\eta)\). Therefore, MPO alternates between evaluating \(Q\) and optimizing \(p\), where \(p\) can be regard as \(q\) in the previous step. This process can be regarded as an instance of the family of Expectation Maximization(EM) algorithms. We elaborate MPO in the following sections following the EM framework.

E-Step

In the E-step of iteration \(i\), we first evaluate the \(Q\)-function with \(p=q=\pi_i\), where \(\pi_i\) is the policy learned in the previous iteration. This reduces the regularized \(Q\) function to the standard \(Q\) function: \(Q(s_t,a_t)=\sum_{t\ge t}\gamma^{t’-t}r(s_t,a_t)\) and enables us to optimize \(Q\) by minimizing the Bellman residual—in practice, the target \(Q\) is estimated from the Retrace algorithm.

M-Step

In the M-step of iteration \(i\), we update \(\pi_i\) to \(\pi_{i+1}\). This can be done in two ways depending on which one of \(p\) and \(q\) we want to optimize.

Optimizing \(q=\pi\)

In this way, we optimize Equation \((2)\) with \(p=\pi_i\), i.e.,

\[\begin{align} q=\arg\max_\pi{1\over N}\sum_{s\sim\mathcal D}\mathbb E_{a\sim \pi(a|s)} [Q(s,a)-\eta D_{KL}(\pi(a|s)\Vert p(a|s))]\tag 3 \end{align}\]

where \(N\) is the number of states \(s\) used in a minibatch. This results in an algorithm similar to SAC or TRPO/PPO depending on the choice of the optimization process: if \(Q\) is approximated by an action-value function, it’s similar to SAC. If \(Q\) is approximated by the trajectory returns, it resembles TRPO/PPO but with a reversed KL term.

Optimizing \(p=\pi\)

Notice that the optimal solution of \(q\) for Equation \((2)\) is \(q(a\vert s)\sim p(a\vert s)\exp(Q(s,a)/\eta)\). Therefore, we can compute a sample-based policy \(q(a_i\vert s_j)={\exp(Q(s_j,a_i)/\eta)\over\sum_k\exp(Q(s_j,a_k)/\eta)}\), where \(a_i\sim p(a\vert s_j)\). This enables us to optimize \(p\) by minimizing the KL divergence between \(q\) and \(p\)

\[\begin{align} p=\arg\min_\pi D_{KL}(q\Vert \pi)=\arg\max_\pi{1\over N}\sum_{s\sim\mathcal D}\sum_{a\sim q(a|s)}q(a|s)\log \pi(a|s)\tag 4 \end{align}\]

Unfortunately, sample based maximum likelihood may suffer from overfitting to samples. Additionally, \(q(a\vert s)\) is unreliable due to a poor approximation of \(Q\)—potentially resulting in a large change of the action distribution in the wrong direction when optimizing Equation \((4)\). One effective regularization that addresses both concerns is to limit the overall change in \(q\). This is done by adding an additional KL constraint and changing the objective from Equation \((4)\) to

\[\begin{align} p=&\arg\max_\pi{1\over N}\sum_{s\sim\mathcal D}\sum_{a\sim q(a|s)}q(a|s)\log \pi(a|s)\tag 5\\\ s.t.&\quad D_{KL}(\pi_k\Vert \pi)\le \epsilon_\pi \end{align}\]

To make objective amenable to gradient based optimization we employ the generalized Lagrangian, yielding the following primal optimization problem

\[\begin{align} \max_\pi\min_\lambda{1\over N}\sum_{s\sim\mathcal D}\sum_{a\sim q(a|s)}q(a|s)\log \pi(a|s)+\lambda (\epsilon_\pi-{1\over N}\sum_{i}^KD_{KL}(\pi_k(\cdot|s_i)\Vert \pi(\cdot|s_i))\tag 6 \end{align}\]

We solve Equation \((6)\) by iteratively optimizing \(p\) and \(\lambda\) independently.

Noticing that the reward and the KL terms are on arbitrary relative scale, which makes it difficult to find \(\eta\). Abdolmaleki et al. 2018ab find it empirically better to replace the soft KL regularization in Equation \((2)\) with a hard constraint

\[\begin{align}\max_q\mathcal J(q)=\max_q \mathbb E_q[Q(s_t,a_t)]\tag 7\\\ s.t.\quad D_{KL}(q(a_t|s_t)\Vert p(a_t|s_t))<\epsilon \end{align}\]

As a result, \(\eta\) can be found by solving the following convex dual function(proof in Supplementary Materials)

\[\begin{align} \eta=\arg\min_\eta\eta\epsilon+\eta\sum_j^k{1\over K}\log\left(\sum_i^N{1\over N}\exp(Q(s_j,a_i)/\eta)\right)\tag 8 \end{align}\]

In practice, this optimization is performed via a few steps of gradient descent on \(\eta\) for every batch after the network optimization. As \(\eta\) should be positive, we use a projection operator(e.g., softplus) to ensure the positivity of \(\eta\).

Song et al. 2019 further find it better to compute policy \(q\) and \(\eta\) with only top \(50\%\) advantages data.

Fitting an improved Gaussian policy

The above method works for any distribution. However, in particular for continuous action spaces, it still can suffer from premature convergence(i.e., variance collapses to zero). The reason is that we are essentially maximizing for \(Q\)-values, and the optimal solution is to give a probability of \(1\) to the best action based on \(Q\)-values. However, \(Q\)-values are generally not an accurate estimate of expected reward, especially in the early stage of training, and therefore such policy collapsing could be devastating and results in premature convergence. Although the KL constraint can postpone this effect, but it does not stop the policy losing entropy. In that case, an entropy constraint or regularization may be helpful.

Abdolmaleki et al. 2018b find a simple change to Gaussian policies can avoid premature convergence. Concretely, they maximizes the mean and covariance separately: when optimizing one, the other is fixed to the one from the target network.

This procedure has two advantages: 1) the gradient w.r.t. the parameters of the covariance is now independent of changes in the mean; hence the only way the policy can increase the likelihood of good samples far away from mean is by stretching along the value landscape. 2) we can set the KL bound for mean and covariance separately. This is especially useful in high-dimensional action spaces, where we want to avoid problems with ill-conditioning of the covariance matrix but want fast learning, enabled by large changes to mean.

References

Abdolmaleki, Abbas, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. 2018. “Maximum a Posteriori Optimization.”

Abdolmaleki, Abbas, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin Riedmiller. 2018. “Relative Entropy Regularized Policy Iteration.” ArXiv.

Song, H. Francis, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, et al. 2019. “V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control,” 1–19. http://arxiv.org/abs/1909.12238.

Supplementary Materials

Dual Function Derivation

When we optimizing \(p=\pi\), we solve

\[\begin{align} \max_q\mathcal J(q,p)=&\max_q \int_s\mu(s)\int_a q(a|s)Q(s_t,a_t)\\\ s.t.\quad\int_s \mu(s)D_{KL}(q(a|s)\Vert p(a|s))<&\epsilon\\\ \int_s\int_a\mu(s)q(a|s)=&1 \end{align}\]

where \(\mu(s)\) is the state distribution.

First we write the generalized Lagrangian

\[\begin{align} \max_q\min_\eta\min_\nu\mathcal L(q,\eta,\nu)=&\int_s\mu(s)\int_a q(a|s)Q(s,a)\\\ &+\eta\left(\epsilon-\int_s \mu(s)\int_aq(a|s)\log {q(a|s)\over p(a|s)}\right)\\\ &+\nu\left(1-\int_s\int_a\mu(s)q(a|s)\right) \end{align}\]

Taking the derivative w.r.t. \(q\) yields

\[\begin{align} {\partial\over\partial q}\mathcal L(q,\eta,\nu)=Q(s,a)-\eta\log q(a|s)+\eta\log p(a|s)-(\eta+\nu) \end{align}\]

Setting it to zero and rearranging terms, we get

\[\begin{align} q(a|s)=p(a|s)\exp(Q(s,a)/\eta)\exp(-(\eta+\nu)/\eta)\tag 9 \end{align}\]

The last term is independent of \(a\) and is a normalization constant for \(q\). Therefore, we have

\[\begin{align} \exp\left({\eta+\nu\over\eta}\right)=\int_a p(a|s)\exp(Q(s,a)/\eta)\\\ {\eta+\nu\over\eta}=\log\int_a p(a|s)\exp(Q(s,a)/\eta) \end{align}\]

Now we plug in Equation \((9)\) to the log likelihood ratio of the Lagrangian, which results in

\[\begin{align} \mathcal L(q,\eta,\nu)=&\int_s\mu(s)\int_a q(a|s)Q(s,a)\\\ &+\eta\left(\epsilon-\int_s \mu(s)\int_aq(a|s)\left({Q(s,a)\over\eta}-{(\eta+\nu)\over\eta}\right)\right)\\\ &+\nu\left(1-\int_s\int_a\mu(s)q(a|s)\right)\\\ =&\eta\epsilon+\eta{\eta+\nu\over\eta}\\\ =&\eta\epsilon+\eta\log\int_a p(a|s)\exp(Q(s,a)/\eta)\tag{10} \end{align}\]

In practice, we samples \(K\) visited states from the dataset, each with \(N\) actions from \(p(a\vert s)\), which gives us the sample-based objective in Equation \((8)\).