PopArt: Preserving Outputs Precisely, while Adaptively Rescaling Targets

Introduction

One of the challenges in reinforcement learning is to deal with rewards of different scales and frequencies, which can result in tedious hyperparameter finetuning for different tasks. The original DQN clips reward to \([-1, 1]\) in order to play all suits of Atari games. Although clipping rewards facilitates learning across different tasks, for some games, it fundamentally changes their objectives, resulting in qualitatively different policies of behavior. For example, most games have sparse non-zero rewards outside of \([−1, +1]\). Clipping then results in optimizing the frequency of rewards, rather than their sum.

We discuss PopArt(Preserving Outputs Precisely, while Adaptively Rescaling Targets), proposed by Hasselt et al. 2016, that adaptively normalize the target values used in value-based reinforcement learning. There are several benefits of such normalization:

Normalizing the target values makes the algorithm more generalizable across different tasks, freeing us from tedious hyperparameter finetuning.
For multi-variate functions the normalization can be used to disentangle the natural magnitude of each component from its relative importance in the loss function. This is particularly useful when the components have different units, such as when we predict signals from sensors with different modalities.
Adaptive normalization can help deal with non-stationary. For instance, in reinforcement learning the policy of behavior can change repeatedly during learning, thereby changing the distribution and magnitude of the values.

PopArt

Normalization

In PopArt, we normalize both the prediction \(h_{\pmb \theta}(X)\) and the target \(Y\) to keep them in a consistent scale. We first normalize the target using the following affine transformation

\[\begin{align} \tilde{Y}=\pmb\Sigma^{-1}(Y-\pmb\mu)\tag{1} \end{align}\]

where \(\pmb\Sigma\) and \(\pmb\mu\) are scale and shift that are computed from data. We then use this normalized target to learn a normalized function \(g\) of form

\[\begin{align} g_{\pmb \theta,\pmb W,\pmb b}(X)=\pmb Wh_{\pmb\theta}(X)+\pmb b\tag {2} \end{align}\]

which could be an arbitrary neural network with a dense layer at the end. We can write its unnormalized counterpart as

\[\begin{align} f_{\pmb\theta,\pmb\Sigma,\pmb \mu,\pmb W,\pmb b}(X)=\pmb \Sigma g_{\pmb \theta,\pmb W,\pmb b}(X)+\pmb \mu=\pmb\Sigma(\pmb Wh_\theta(X)+\pmb b)+\pmb\mu\tag{3} \end{align}\]

Preserving Outputs Precisely

In Equation 3, \(f\) depends on \(\pmb\Sigma\) and \(\pmb\mu\), which suggests that the outputs of the unnormalized function may be different when the scale and shift changes, even with the same input \(X\) and parameters. Such inconsistency introduced by change of targets is not a desirable behavior. To prevent such unwanted changes, we change the parameters \(\pmb W\) and \(\pmb b\) accordingly as follows

\[\begin{align} \pmb W_{new}=\pmb\Sigma_{new}^{-1}\pmb\Sigma\pmb W\quad\text{and}\quad \pmb b_{new}=\pmb\Sigma^{-1}_{new}(\pmb\Sigma\pmb b+\pmb\mu-\pmb\mu_{new})\tag {4} \end{align}\]

Here \(\pmb\Sigma_{new}\) and \(\pmb\mu_{new}\) are the new scale and shift. Then we will have

\[\begin{align} f_{\pmb\theta,\pmb\Sigma,\pmb \mu,\pmb W,\pmb b}(x)=f_{\pmb\theta,\pmb\Sigma_{new},\pmb \mu_{new},\pmb W_{new},\pmb b_{new}}(X)\tag{5} \end{align}\]

Now we can use these new parameters to update the normalized function \(g\) towards the normalized target \(\tilde Y\).

Adaptively Rescaling Targets

The scale and shift for targets \({Y_i}_{i=1}^t\) up to some time \(t\) can be computed by

\[\begin{align} \mu_t&={1\over t}\sum_{i=1}^tY_i\\\ \sigma_t&={1\over t}\sum_{i=1}^tY^2_i-\mu_t^2\tag {6} \end{align}\]

Here we follow the original paper discussing the scalar case only. It is straightforward to extend to multi-variate case. We can incrementally compute these statistics as follows

\[\begin{align} \mu_t&=(1-\beta_t)\mu_{t-1}+\beta_tY_t\\\ \sigma_t&=\nu_t-\mu_t^2,\quad where\ \nu_t=(1-\beta_t)\nu_{t-1}+\beta_tY_t^2\tag {7} \end{align}\]

Here \(\nu_t\) estimates the second moment of the targets and \(\beta_t\in[0,1]\) is a step size. Equation \((7)\) is equivalent to Equation \((6)\) when \(\beta_t={1\over t}\). If \(\beta_t=\beta\) is constant we get exponential moving averages, placing more weight on recent data points which is appropriate in non-stationary settings.

A constant \(\beta\) has the additional benefit of never becoming negligibly small. Consider that, at some time step \(t\gg1\), we first get some targets \(Y_t\) that is much larger than all previously observed targets. If \(\beta_t={1\over t}\), then the statistics would adapt only slightly, and the normalized target may be large, resulting in gradient that may be large enough to harm the learning. A constant \(\beta\) ensures the normalization can adapt to the large targets before updating, potentially making learning more robust. The following proposition suggests the normalized target is bounded by \(\beta\)

Proposition 2. When using update \((7)\) to adapt the normalization parameters \(\sigma\) and \(\mu\), the normalized targets are bounded for all \(t\) by

\[\begin{align} -\sqrt{1-\beta_t\over\beta_t}\le{(Y_t-\mu_t)\over\sigma_t}\le\sqrt{1-\beta_t\over\beta_t} \end{align}\]

The proof is provided in the supplementary materials. Note that Proposition 2 does not rely on any assumptions about the distribution of the targets. This is an important result, because it implies we can bound the potential normalized errors before learning, without any prior knowledge about the actual targets we may observe.

Now we present the whole algorithm of PopArt

\[\begin{align} &\mathbf{PopArt}:\\\ 1.&\quad \mathbf {For}\ t=1...:\\\ 2.&\quad\quad\text{Use }Y_t\text{ to update statistics}\ \pmb\Sigma, \pmb\mu\text{ using Equation (7)}\\\ 3.&\quad\quad\text{Update }\pmb W\text{ and }\pmb b\text{ using Equation }(4)\\\ 4.&\quad\quad\text{Compute normalized targets }\tilde Y_t \text{ using Equations }(1)\\\ 5.&\quad\quad\text{Compute normalized outputs }g(X_t)\text{ using Equations }(2)\\\ 6.&\quad\quad\text{Perform optimization algorithm to update }g(X_t)\text{ toward } \tilde Y_t \end{align}\]

Notice that \(\pmb W\) and \(\pmb b\) are updated twice: first to adapt to the new scale and shift to preserve the outputs of the function, and then by SGD.

An Equivalence for Stochastic Gradient Descent

So far we have completed the discussion of PopArt. In this section, we analyze the effect of the magnitude of the errors on the gradients hwne using regular SGD, following the footprint of the authors.

Consider SGD updates for an unnormalized function

\[\begin{align} f_{\pmb\theta,\pmb W,\pmb b}(X)=\pmb Wh_{\pmb\theta}(X)+\pmb b \end{align}\]

The update for the weight matrix \(\pmb W\) is

\[\begin{align} \pmb W_t=\pmb W_{t-1}-\alpha_t\pmb\delta_th_{\pmb\theta_t}(X_t)^\top \end{align}\]

where \(\pmb \delta_t=f_{\pmb\theta,\pmb W,\pmb b}(X_t)-Y_t\) is the gradient of the MSE, which we call the unnormalized error. The update of this update depends linearly on the magnitude of the error, which implies that the idea scale of the weights \(\pmb W\) depends linearly on the magnitude of the targets.

Now consider the parameters of the last layer in \(h_{\pmb \theta}\), \(\pmb\theta^{-1}t=\pmb\theta^{-1}{t-1}-\alpha J_tW_{t-1}^\top\pmb\delta_t\), where \(\pmb J_t\) is the Jacobian. Because the magnitudes of both the weights \(\pmb W\) and the errors \(\pmb\delta\) depend linearly on the magnitude of the targets, the magnitudes of \(\pmb\theta^{-1}\) depends quadratically on the magnitude of the targets. This propagates exponentially to shallower layers, but there is no clear reason for doing so.

To prevent such unexpected growth, van Hasselt et al. proposed to track the magnitudes of the targets in a separate parameter \(\Sigma\), and then multiply the updates for all lower layers with a factor \(\Sigma^{-2}\). This is equivalent to moving the normalization from targets in PopArt to parameters \(\pmb \theta\).

References

Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. in 29th Conference on Neural Information Processing Systems (NIPS 2016)

Supplementary Materials

Proof of Proposition 2

In the following proof, we omit \(t\) and denote \(t-1\) as \(-1\)

\[\begin{align} \left(Y-\mu\over\sigma\right)^2&=\left(Y-(1-\beta)\mu_{-1}-\beta Y\over\sigma\right)^2\\\ &={\left(1-\beta\right)^2(Y-\mu_{-1}^2)\over\nu-\mu^2}\\\ &={(1-\beta)^2(Y-\mu_{-1})^2\over(1-\beta)\nu_{-1}+\beta Y^2-((1-\beta)\mu+\beta Y)}\\\ &={(1-\beta)^2(Y-\mu_{-1})^2\over(1-\beta)(\nu_{-1}+\beta Y^2-(1-\beta)\mu^2-2\beta\mu_{-1}Y)}\\\ &={(1-\beta)(Y-\mu_{-1})^2\over\nu_{-1}+\beta Y^2-(1-\beta)\mu^2-2\beta\mu_{-1}Y}\\\ &\le{(1-\beta)(Y-\mu_{-1})^2\over\mu_{-1}^2+\beta Y^2-(1-\beta)\mu^2-2\beta\mu_{-1}Y}\\\ &={(1-\beta)(Y-\mu_{-1})^2\over\beta Y^2+\beta\mu^2-2\beta\mu_{-1}Y}\\\ &={(1-\beta)\over\beta} \end{align}\]

where the equality follows from the fact \(\nu\ge\mu^2\).