Extentions to DQN
Ever since DQN was first introduced, there have been many extentions to DQN to help alleviate its limitations. In this post, we briefly introduce several of them used in rainbow.
Distributional DQN
Distriutional DQN updates the value distribution \( Z(s, a) \) whose expectation is the action-value \( Q(s,a) \), i.e., \( Q(s,a)=E[Z(s,a)] \). By replacing the action value with the value distribution, it is able to model the stochastic nature of the environment, thereby providing a more stable learning target.
The detailed explanation takes pages to dig into. For interested readers, please refer to my previous post, or just read up the paper. Here I just write down the update step and the loss for further uses.
\[\begin{align} \mathcal{\hat T}Z_{\theta^-}(s,a)&=R(s,a)+\gamma Z_{\theta^-}(s',\arg\max_{a'}Q_{\theta^-}(s', a'))\\\ L&=D_{KL}(\Phi\mathcal{\hat T}Z_{\theta^-}(s,a)\Vert Z_\theta(s,a)) \end{align}\]where \( \theta^- \) indicates that we fix the parameter in the target value distribution for stable training and \( \Phi\mathcal{\hat T}Z_{\theta^-}(s,a) \) are the target value distribution projected onto the support of \( Z_\theta(s,a) \) whose \( i \)th component is
\[\begin{align} \left(\Phi \mathcal{\hat T}Z_{\theta^-}(s, a)\right)_i=\sum_{j=0}^{N-1}\left[1-{\left|\left[\mathcal{\hat T}z_j\right]_{V_{\min}}^{V_{\max}}-z_i\right|\over\Delta z} \right]_0^1{p_{\theta^-}}_j(s', \arg\max_{a'}Q_{\theta^{-}}(s_{t+n}, a')) \end{align}\]Multi-Step Learning
We use multi-step bootstrap instead of one-step bootstrap via the truncated n-step return:
\[\begin{align} R_t^{(n)}=\sum_{k=0}^{n-1}\gamma_t^k R_{t+k+1} \end{align}\]Therefore we have the target value
\[\begin{align} Y=R_t^{(n)}+\gamma_t^n\max_{a'} Q(s_{t+n}, a') \end{align}\]Multi-step bootstrap actually could be problematic because DQNs are off-policy algorithms, which should not work with n-step return without the help of other auxiliary techniques like importance sampling. However, this generally works pretty well for small \( n \): it typically learns faster due to faster propagation of reward signals, especially early on when Q-values are inaccurate.
Double Q-Learning
Conventional DQN is affected by an overestimation bias introduced by the maximization step in the target and loss
\[\begin{align} Y&=R(s,a)+\gamma \max_{a'} Q_{\theta^{-}}\left(s',a'\right)\\\ L&=\left(Y-Q_\theta(s,a)\right)^2 \end{align}\]Such an overestimation bias is introduced by two facts that 1) \( Q \)-function is noisy due to environmental noise, function approximation, non-stationary etc., and 2) the expected value of the maximum of a set of variables is greater than or equal to the maximum of their expected values (i.e., \( \mathbb E[\max_aQ(s, a)]\ge\max_a\mathbb E[Q(s, a)] \), deduced from Jensen’s inequality and the convex property of \( \max \). Note that the expectation here averages the noise in the \(Q\)-function). This may not always be bad: if all values would be uniformly higher, then the relative action preferences are preserved and we would not expect the resulting policy to be any worse. Furthermore, it is known sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known exploration technique. If, however, the overestimations are not uniform and not concentrated at state about which we wish to learn more, then they might negatively affect the quality of the resulting policy. (Hado van Hasselt et al. [3])
Double Q-Learning addresses this overestimation by introducing two \( Q \)-functions: one for action selection and the other for value update so as to decouple these two operation. When incorporating double \( Q \)-learning into DQN, we generally use the online network for action selection and the target network for value update. Therefore the target and loss for double DQN become
\[\begin{align} Y&=R(s,a)+\gamma Q_{\theta^{-}}\left(s',\arg\max_{a'}Q_{\theta}(s',a')\right)\\\ L&=\left(Y-Q_{\theta}(s,a) \right)^2 \end{align}\]Prioritized Replay
DQN samples uniformly from the replay buffer. Ideally, we would like to sample more frequently those transitions from which there are more to learn. Prioritized replay samples transitions with probability \( p_t \) relative to the last encountered absolute TD error:
\[\begin{align} p_t\propto \left|R(s,a)+\gamma \max_{a'}Q_{\theta^{-}}(s', a')-Q_\theta(s,a)\right|^{\alpha} \end{align}\]where \( \alpha \) is a hyper-parameter that determines the shape of the distribution. New transitions are inserted into the replay buffer with the maximum priority, providing a bias towards recent transitions. For more details, please refer to my previous post or just read up the paper.
Dueling Networks
The dueling network is a network architecture designed for value-based RL. It redefines \( Q \) network as
\[\begin{align} Q_{\theta}(s,a)=V_\eta(f_{\xi}(s))+A_\psi(f_{\xi}(s), a)-{\sum_{a'}A_\psi(f_{\xi}(s),a')\over |\mathcal A|} \end{align}\]The reason why we introduce the last average term is that it is not sufficient, with only the first two terms, to recover \( V \) and \( A \) uniquely for a given \( Q \).
It may be more proper to address the meaning of each term if we replace the last average term with a max, i.e., \( \max_{a’}A_\psi(f_{\xi}(s),a’) \). If we did so, then \( V(s) \) would be the optimal state-value function (i.e., \( V(s) = \max_{a}Q(s,a) \), which will be more clearer if we consider it together with the max term we substituted in) and \( A(s,a) \) would be the advantage function. The reason we use the average term is that it increases the stability of the optimization: with the average term the advantages only need to change as fast as the mean, instead of having to compensate any change to the optimal action’s advantage.
In this way, the dueling network learns value and advantage separately so that it can learn which states are (or are not) valuable, without having to learn the effect of each action for each state. This is particularly useful in states where its actions do not affect the environment in any relevant way. For example, when an environment contains some redundant actions, the dueling network shows faster convergence.
Noisy Networks
Noisy Nets introduces extra noisy layers in addition to general linear layers to add adaptive noise to the parameters, whereby providing some explorations. The basic idea is to add noise to the parameters
\[\begin{align} y:=(\mu^w+\Sigma^w\cdot \epsilon^w)x + (\mu^b+\Sigma^b\cdot \epsilon^b)\tag {1} \end{align}\]This noise-adding process is almost the same as the reparameterization trick used in VAE, where \( \mu^{w}, \Sigma^{w}, \mu^b, \Sigma^b \) are parameters and the noise variables \( \epsilon^w, \epsilon^b \) are sampled from normal distributions.
Refactoring this equation we end up with two dense layers
\[\begin{align} y=(Wx + b)+((W_{\mathrm{noisy}}\cdot \epsilon^w)x+b_{\mathrm{noisy}}\cdot \epsilon^b)\tag {2} \end{align}\]where \( W, b, W_{\mathrm{noisy}}, b_{\mathrm{noisy}} \) are parameters of dense layers and noisy layers, and \( \epsilon^w, \epsilon^b \) are the noise variables. We intended to use different symbols for the parameters in \( (1) \) and \( (2) \) to better express the roles they play in these equations, hoping this does not introduce any confusion.
As a side note, the authors suggest to use factorised Gaussian noise to gain some computational efficiency. That is, instead of sampling every element of \( \epsilon^w \) and \( \epsilon^b \) independently, we now only sample only \( p+q \) unit Gaussian variables, where \( p \) is the input dimensions and \( q \) the output dimensions. Then we define the noise random variables
\[\begin{align} \epsilon^w&=f(\epsilon_p)^Tf(\epsilon_q)\\\ \epsilon^b&=f(\epsilon_q) \end{align}\]where \( f(x)=\mathrm{sgn}(x)\sqrt{\vert x\vert } \).
The last thing worth to remember is that when cooperating with on-policy algorithms, the noise of the whole network should be held fixed for the duration of the roll-out so that the policy produced by the network is fixed. In my humble opinion, this should also be done for off-policy algorithms to produce deep exploration.
Rainbow
Rainbow combines all six aforementioned improvements to DQN together as below:
- Incorporate multi-step learning into Distributional DQN: replace the distributional loss with a multi-step variant. The target distribution and its projection onto the support of \( Z_\theta \) become
And the loss now is
- Combine with double Q-learning: select the greedy action according to the online network and evaluate such an action using the target network. Now we have the target distribution and its projection:
- Introduce the prioritized replay: prioritize transitions by the KL loss
- Adapt the dueling network architecture to value distributions: the shared representation \( f_\xi(s) \) is fed into the value stream \( V_\eta \) with \( N_{\mathrm{atoms}} \) outputs, and into the advantage stream \( A_\psi \) with \( N_{\mathrm{atoms}}\times N_{\mathrm{actions}} \) outputs. For each atom \( z_i \), the value and advantage streams are aggregated and then passed through a softmax layer to obtain the normalized parametric distributions used to estimate the value distribution. Now the probability of \( z_i \) becomes
where
- Replace linear layers with their noisy equivalent: Within these noisy linear layers, we use factorized Gaussian noise to reduce the number of independent noise variables (whereby reducing the compute time). In factorized Gaussian noise, we use \( p \) unit Gaussian variables \( \epsilon_{in} \) for noise of the inputs and \( q \) unit Gaussian variables \( \epsilon_{out} \) for noise of the outputs (thus \( p + q \) Gaussian variables in total). \( \epsilon^w \) and \( \epsilon^b \) can be written as
where \( f(x)=\mathrm{sgn}(x)\sqrt{\vert {x}\vert } \) (Meire Fortunato et al. [6])
Summary
At last, we briefly summary each component of rainbow and its effect
- Multi-step learning: Reduce bias
- Distributional DQN: Describe the stochasticity of the environment
- Double DQN: Alleviate the overestimation bias introduced by the maximization step
- Prioritized replay: Improve data efficiency
- Dueling networks: Learn value and advantage separately so that it can learn which states are (or are not) valuable, without having to learn the effect of each action for each state
- Noisy networks: Add adaptive noise to dense layers to provide some explorations
Supplementary Materials
Proof That \( \max \) is Convex
\[\begin{align} \max(\theta x + (1-\theta)y) &\le \max(\theta x) +\max ((1-\theta)y)\\\ &=\theta \max(x) + (1-\theta)\max(y) \end{align}\]References
[1] Matteo Hessel et al. Rainbow: Combining Improvements in Deep Reinforcement Learning
[2] Marc G. Bellemare et al. A Distributional Perspective on Reinforcement Learning
[3] Hado van Hasselt et al. Deep Reinforcement Learning with Double Q-Learning
[4] Tom Schaul et al. Prioritized Experience Replay
[5] Ziyu Wang et la. Dueling Network Architectures for Deep Reinforcement Learning
[6] Meire Fortunato et al. Noisy Networks For Exploration