M-RL — Munchausen Reinforcement Learning

Introduction

Bootstrapping is a core mechanism in Reinforcement learning. Most RL algorithms replace the true value of a transition state by their current estimate of this value. Vieillard et al. 2020a argue that one could augment the Bellman updates with the policy signal by adding a scaled log-policy to the immediate reward. They show that DQN with such as simple modification achieves competitive performance with distributional methods.

Munchausen Reinforcement Learning

Fun Notes.The method proposed is called “Munchausen Reinforcement Learning(M-RL)” as a reference to a famous passage of The Surprising Adventures of Baron Munchausen by Raspe, where the Baron pulls himself out of a swamp by pulling on his own hair.

Motivation: Consider that an optimal deterministic policy \(\pi^*\) is known. The log-policy is therefore \(0\) for optimal actions and \(-\infty\) for sub-optimal ones. This is very a strong learning signal that we could add to the reward to ease learning, without changing the optimal control. As the optimal policy \(\pi^*\) is unknown, we replace it by the current policy \(\pi\). As we’ll see later, that this choice favors policies with large action-gap since such policies provide less negative rewards.

M-DQN

M-DQN modifies DQN by first augmenting with a soft target and then adding a scaled log-policy to the immediate reward. We can write the target of M-DQN as follows

\[\begin{align} \hat q_{\text{m-dqn}}(s,a)=&r(s,a)\color{red}{+\alpha\tau\log\pi(a|s)}+\gamma\sum_{a'}\pi(a'|s')(q(s',a')\color{blue}{-\tau\log\pi(a'|s')})\tag 1\\\ \pi(a|s)=&\text{softmax}({q(s,a)\over\tau})\tag 2 \end{align}\]

where we omit time step and use \(s’,a’\) to denote the next state and action. We highlight the difference introduced by a soft target in blue and that introduced by M-DQN in red.

To avoid numerical instability caused by small \(\tau\), we usually compute the policy as: \(\pi=\text{softmax}({q-\max(q)\over \tau})\).

Multi-step: It seems that in the multi-step case, M-DQN does not add the log policy and the entropy term to the subsequent rewards other than the immediate one. In other words, the target is computed as follows

\[\begin{align} \hat q_{\text{m-dqn}}(s_t,a_t)=\sum_{k=t}^{t+n-1}r(s_k,a_k)\color{red}{+\alpha\tau\log\pi(a_t|s_t)}+\gamma^n\sum_{a'}\pi(a'|s_{t+n})(q(s_{t+n},a')\color{blue}{-\tau\log\pi(a'|s_{t+n})}) \end{align}\]

This is memory and computational efficient as storing and computing all intermediate policy is expensive. However, one can use the run-time policy as an approximation though such policies are often stale.

Following similar process, one can easily derive the target for distributional RL such as IQN, we omit it and directly dive into the theoretical part next.

Theoretical Analysis

Understanding M-VI

We can write M-DQN in the following abstract form

\[\begin{align} \begin{cases} \pi_{k+1}&=\arg\max_\pi\left<\pi, q_k\right>\color{blue}{+\tau\mathcal H(\pi)}\\\ q_{k+1}&=r\color{red}{+\alpha\tau\log\pi_{k+1}}+\gamma P\left<\pi_{k+1},q_{k}\color{blue}{-\tau\log\pi_{k+1}}\right>+\epsilon_{k+1} \end{cases}\tag 3 \end{align}\]

where we define the component-wise dot product \(\left<f,g\right>=\sum_af(s,a)g(s,a)\), and the transition kernel \(Pv=\sum_{s’}P(s’\vert s,a)v(s’)\). \(\epsilon_{k+1}\) stands for the error between the actual and ideal update. We call this scheme Munchausen Value Iteration (M-VI(\(\alpha,\tau\)). Removing the red term, we retrieve the entropy regularized approximate value iteration(AVI), of which Soft-DQN and SAC are instantiations. Removing also the blue term, we retrieve the classic AVI of which DQN is an instantiation.

To gain some insights, we rewrite the evaluation step, setting \(\alpha=1\) and with \(q’_k:=q_k-\tau\log\pi_k\), the value update step can be viewed as

\[\begin{align} q_{k+1}=&r+\tau\log\pi_{k+1}+\gamma P\left<\pi_{k+1},q_k-\tau\log\pi_{k+1}\right>+\epsilon_{k+1}\\\ q_{k+1}-\tau\log\pi_{k+1}=&r+\gamma P\left<\pi_{k+1},q_k-\tau\log \pi_k +\tau\log \pi_k-\tau\log \pi_{k+1} \right>+\epsilon_{k+1}\\\ q'_{k+1}=&r+\gamma P\big(\left<\pi_{k+1},q'_k\right>-\tau D_{KL}[\pi_{k+1}\Vert\pi_k]\big)+\epsilon_{k+1} \end{align}\]

The policy update step can be rewritten as

\[\begin{align} \left<\pi,q_k\right>+\tau\mathcal H(\pi)=\left<\pi,q_k-\tau\log\pi_k\right>-\tau D_{KL}(\pi\Vert\pi_k)=\left<\pi,q'_k\right>-\tau D_{KL}(\pi\Vert\pi_k) \end{align}\]

This indicates that M-VI(\(\alpha, \tau\)) performs KL regularization between successive policies.

Relation to MD-VI

We show that M-VI(\(\alpha, \tau\)) is the same as one of Mirror Descent VI (MD-VI) from Vieillard et al. 2020b, with KL scaled by \(\alpha\tau\) and entropy scaled by \((1-\alpha)\tau\). Thus, M-VI(\(\alpha,\tau\)) is equivalent to MD-VI(\(\alpha\tau,(1-\alpha)\tau\)), as formalized below

Theorem 1. For any \(k\ge 0\), define \(q’_k:=q_k-\alpha\tau\log\pi_k\), we have

\[\begin{align} (3)\Leftrightarrow \begin{cases} \pi_{k+1}&=\arg\max_\pi\left<\pi,q'_k\right>-\alpha\tau D_{KL}(\pi\Vert\pi_k)+(1-\alpha)\tau\mathcal H(\pi)\\\ q'_{k+1}&=r+\gamma P\big(\left<\pi_{k+1},q'_k\right> -\alpha\tau D_{KL}[\pi_{k+1}\Vert\pi_k] +(1-\alpha)\tau\mathcal H(\pi_{k+1})\big)+\epsilon_{k+1} \end{cases}\tag 4 \end{align}\]

Proof. For the policy update, we have

\[\begin{align} \langle \pi,q_k\rangle+\tau\mathcal H(\pi)=&\langle\pi,q_k-\alpha\tau\log\pi_k\rangle+\alpha\tau\langle\pi, \log\pi_k\rangle+\tau\mathcal H(\pi)\\\ =&\langle\pi,q_k-\alpha\tau\log\pi_k\rangle-\alpha\tau\langle\pi, \log{\pi\over\pi_k}\rangle-\alpha\tau\mathcal H(\pi)+\tau\mathcal H(\pi)\\\ =&\langle\pi,q_k-\alpha\tau\log\pi_k\rangle-\alpha\tau\langle\pi, \log{\pi\over\pi_k}\rangle-(1-\alpha)\tau\mathcal H(\pi) \end{align}\]

Similarly, we write the value update

\[\begin{align} q_{k+1}=&r+\alpha\tau\log\pi_{k+1}+\gamma P\langle\pi_{k+1},q_k-\tau\log\pi_{k+1}\rangle+\epsilon_{k+1}\\\ q_{k+1}-\alpha\tau\log\pi_{k+1}=&r+\gamma P\langle\pi_{k+1},q_k-\alpha\tau \log\pi_{k}+\alpha\tau \log\pi_{k}-\alpha\tau\log\pi_{k+1}-(1-\alpha)\tau\log\pi_{k+1}\rangle+ \epsilon_{k+1}\\\ q'_{k+1}=&r+\gamma P\big(\left<\pi_{k+1},q'_k\right> -\alpha\tau D_{KL}[\pi_{k+1}\Vert\pi_k]+(1-\alpha)\tau\mathcal H(\pi_{k+1})\big)+\epsilon_{k+1} \end{align}\]

Theorem 1 shows the equivalence between Equation \((3)\) and Equation \((4)\). However, the fact that the policy in Equation \((3)\) is entropy regularized and that in Equation \((4)\) is KL regularized introduces different analytical solutions, which is caused by the different definition of the \(Q\) function. Computing \(\pi_{k+1}\) by Equation \((3)\) yields an analytical solution proportional to \(\exp{q_{k}\over\tau}\) while doing so by Equation \((4)\) yields a solution proportional to \(\pi_k\exp{q’_k\over\tau}\), and that depends on the previous policy \(\pi_k\).

Relation to Advantage Learning

From Equation \((3)\), we have

\[\begin{align} &\max_\pi\left<\pi,q_k\right>+\tau\mathcal H(\pi)\\\ =&\max_\pi\left<\pi,\tau\log{\exp (q_k/\tau)\over\pi}\right>\\\ &\qquad\color{red}{Z=\langle1,\exp {q_k\over\tau}\rangle}\\\ =&\max_\pi\left<\pi,\tau\log{ {1\over Z}\exp (q_k/\tau)\over\pi}\right>+\tau\log Z\\\ =&\max_\pi-D_{KL}\left(\pi\Big\Vert {1\over Z}\exp (q_k/\tau)\right)+\tau\log Z\\\ =&\tau\log Z\\\ \pi_{k+1}=&{1\over Z}\exp (q_k/\tau) \end{align}\]

Therefore, we can rewrite the value update as

\[\begin{align} q_{k+1}=&r+\alpha\tau\log\pi_{k+1}+\gamma P\left<\pi_{k+1},q_k-\tau\log\pi_{k+1}\right>+\epsilon_{k+1}\\\ =&r+\alpha(q_{k}-\tau\log Z)+\gamma P(\tau\log Z)+\epsilon_{k+1}\\\ =&r+\alpha(q_{k}-\tau\log\langle1,\exp{q_k\over\tau}\rangle)+\gamma P(\tau\log\langle1, \exp{q_k\over\tau}\rangle)+\epsilon_{k+1} \end{align}\]

Taking the limit \(\tau\rightarrow 0\), we retrieve Advantage Learning (AL):

\[\begin{align} q_{k+1}=&r+\alpha(q_k-\langle\pi_{k+1},q_k\rangle)+\gamma P\langle\pi_{k+1},q_k\rangle+\epsilon_{k+1}\\\ \pi_{k+1}=&\arg\max_a q_k \end{align}\]

AL adds a small penalty to the reward when the selected action \(a\) is suboptimal, i.e., \(a\ne \arg\max_a q_k\). This increases the action-gap defined as the difference, for a given state, between the (optimal) value of the optimal action and that of the suboptimal ones. Such an increasing action-gap can mitigate the undesirable effects of approximation and estimation errors made on \(q\) on the induced greedy policies. Bellemare et al. 2016 have introduced a family of Bellman-like operators that are gap-increasing.

Experiments

Vieillard et al. 2020 provide an extensive analysis on Atari games. One should definitely refer to it when designing a single-agent RL algorithm on Atari games. We briefly summarize several results on DQN families

Adam significantly improves the performance of DQN, which originally use RMSprop as its optimization algorithm
Soft DQN alone does not bring much benefits over DQN.
Advantage Learning(AL) improves the overall performance
M-DQN performs slightly better than AL and achieves comparable performance to C51
M-RL seems to impair the performance on hard exploration games such as MontezumaRevenge

References

Vieillard, Nino, Université De Lorraine, F- Nancy, Olivier Pietquin, and Matthieu Geist. 2020. “Munchausen Reinforcement Learning,” no. NeurIPS.

Vieillard, Nino, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. 2020. “Leverage the Average: An Analysis of KL Regularization in RL,” no. NeurIPS. http://arxiv.org/abs/2003.14089.

Bellemare, Marc G., Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. 2016. “Increasing the Action Gap: New Operators for Reinforcement Learning.” 30th AAAI Conference on Artificial Intelligence, AAAI 2016, 1476–83.