Generalization in RL

Introduction

We briefly summarize several recent papers that focus on generalization in reinforcement learning

IBAC-SNI

Igl et al. 2019 propose two methods to improve generalization in reinforcement learning. We briefly discuss them sequentially in the following sub-sections

Selective Noise Injection(SNI)

Igl et al. first argue that noise injected by regularization methods such as dropout and batch normalization can deteriorate the agent’s performance and therefore affect the data distribution. Furthermore, such stochasticity can destabilize the training through the target critic or importance sampling. Therefore, Igl et al. propose to remove such noise at the rollout time and from the training targets. For example, when batch normalization is involved, we use the moving average statistics when computing actions during rollouts and the targets during training. In addition, they also propose to use a mixture policy gradients

\[\begin{align} \mathcal G^{SNI}(\pi^r,\pi,V)=\lambda \mathcal G(\bar\pi^r,\bar\pi,\bar V)+(1-\lambda)\mathcal G(\bar\pi^r,\pi,\bar V) \end{align}\]

where \(\pi^r\) is the rollout policy(as they use AC algorithms in their experiments), \(\pi\) is the network to update, \(V\) is the value network, and \(\mathcal G\) denotes the gradient function. The bar above the symbol implies that noises are suspended when computing the quantity. Therefore, the first term is the gradient function without noise injected and the second is with noise injected in the training policy network \(\pi\).

\(\mathcal G^{SNI}\) consists of two terms interpolated by \(\lambda\in[0,1]\). The first computes the gradients w.r.t. the denoised network \(\bar \pi\), which is used to reduce the variance. This term is especially important early on in training when the network has not yet learned to compensate for the injected noise. Experiments also justify the effectiveness of such interpolation – they found \(\lambda=.5\) outperform \(\lambda=1\) and \(\lambda=0\) in most cases.

Information Bottleneck Actor Critic(IBAC)

IBAC applies information bottleneck to the AC network, which minimizes \(\mathcal I(o;z)\) and maximizes \(\mathcal I(z;a)\), where \(z=f_\theta(o,\epsilon)\) is the output of the encoder parameterized by \(\theta\). The architecture thus becomes similar to a \(\beta\)-VAE. As now the encoder \(p(z\vert o)\) is already regularized, they only apply the policy entropy term to the action heads. The final loss becomes

\[\begin{align} \mathcal L=\mathcal L_{AC}-\lambda \mathcal H(\pi(\cdot|z))+\beta\mathcal L_{KL} \end{align}\]

where \(\mathcal L_{AC}\) is the loss function of the AC algorithm, \(\mathcal H(\pi(\cdot\vert z))\) is the entropy loss, and \(\mathcal L_{KL}=D_{KL}(p_\theta(z\vert o)\Vert q(z))\).

DrAC

Raileanu et al. 2020 experiments a collection of data augmentation techniques in RL. Similar work has been down by Laskin et al. 2020 before, which directly applied data augmentation to the PPO objective. Raileanu et al. 2020 point out it is problematic as it changes \(\pi(a\vert s)\) to \(\pi(a\vert f(s))\), where \(f\) applies data augmentation to \(s\). Instead, Raileanu et al. 2020 leave the PPO objective as it is and add two additional loss terms to regularize the policy and value functions:

\[\begin{align} \mathcal J&=\mathcal J_{PPO} - \alpha(\mathcal L_\pi+\mathcal L_V)\\\ \mathcal L_\pi&=D_{KL}[\pi_\theta(a|s)\Vert \pi(a|f(s))]\\\ \mathcal L_V&={1\over 2}(V(f(s)) - \text{sg}(V(s)))^2 \end{align}\]

As a result, they call their algorithm Data-regularized Actor-Critic method(DrAC).

mixreg

Wang et al. 2020 propose generating augmented observations by linearly interpolating two observations

\[\begin{align} \tilde s=\lambda s_i+(1-\lambda)s_j \end{align}\]

Where \(\lambda\sim Beta(\alpha,\alpha)\) with \(\alpha=0.2\) in their experiments.

Because the new observation becomes a convex combination of two random observations, they also mix training signals accordingly. For policy gradient method, the objective for augmented observations becomes

\[\begin{align} \mathcal J=\mathbb E\left[\log\pi_\theta(\tilde a|\tilde s)\tilde A\right] \end{align}\]

where \(\tilde A=\lambda A_i+(1-\lambda)A_j\), and \(\tilde a\) is \(a_i\) if \(\lambda\ge 0.5\) or \(a_j\) otherwise.

For Q-learning, the objective for augmented observations becomes

\[\begin{align} \mathcal L=\mathbb E\left[\left(r+\gamma\max_{a'}Q(\tilde s,a')-Q(\tilde s,\tilde a)\right)^2\right] \end{align}\]

Where \(r=\lambda r_i+(1-\lambda)r_j\), \(Q(\tilde s’,a’)=\lambda Q(s_i’,a_i’)+(1-\lambda)Q(s_j’,a_j’)\), and \(\tilde a\) is \(a_i\) if \(\lambda\ge 0.5\) or \(a_j\) otherwise.

It is quite astonishing that, during the test time, mixreg performs better than regular regularization techniques such as data augmentation, l2 regularization, and batch normalization. Although the authors demonstrates that mixing training signals is important to mixreg, it is still unclear why this method works. One possible explanation is that mixreg imposes piece-wise linearity regularization to the learned policy and value functions w.r.t. the states. Such regularization encourages the agent to learn a smoother policy with better generalization performance.

Surprise minimization

Chen 2020 shows adding surprise minimizing information to rewards can improve generalization. Specifically, Chen 2020 trains PPO with the reward function defined as \(r(s,a)+\alpha\log p(s)\), where \(\alpha\) controls the relative scale of the surprise and \(\log(p(s))\) estimates the surprise. Two ways are proposed to estimate the state distribution:

Normal distributions. A buffer of size \(20\) times the mini-batch size is used to store the most recent observations in grayscale. Before each training iteration, a surprise minimizing reward is computed from the buffer

\[\log p(s_t)=-\sum_i(\log\sigma_i+{(s_i-\mu_i)^2\over 2\sigma^2})\]

where \(\mu_i\) and \(\sigma_i\) are the sample mean and standard deviation of the \(i^{th}\) dimension calculated across each state in the buffer.

Variational autoencoder. A VAE is trained with the raw RGB observations. Before each PPO training iteration, we first compute the embeddings \(\pmb z\) from the encoder of the VAE for all observation \(\pmb o\). Then we model \(p(z)\) as a normal distribution, whose mean and diagonal covariance are computed from \(\pmb z\), and we estimate \(\log p(z)\) for each sample
```
dist = tfd.MultivariateNormalDiag(
    tf.reduce_mean(z, 0, keepdims=True),
    tf.math.reduce_std(z, 0, keepdims=True),
)
logp = dist.log_prob(z)
```

Experiments are confounding as the agent performs better on testing environments than on the training environments

References

Igl, Maximilian, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. 2019. “Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck,” no. NeurIPS. http://arxiv.org/abs/1910.12911.

Raileanu, Roberta, Max Goldstein, Denis Yarats, and Rob Fergus. n.d. “Automatic Data Augmentation for Generalization in Deep Reinforcement Learning.”

Laskin, Michael, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. 2020. “Reinforcement Learning with Augmented Data.” http://arxiv.org/abs/2004.14990.

Wang, Kaixin, Bingyi Kang, Jie Shao, and Jiashi Feng. 2020. “Improving Generalization in Reinforcement Learning with Mixture Regularization,” no. NeurIPS: 1–21.

Chen, Jerry Zikun. 2020. “Reinforcement Learning Generalization with Surprise Minimization.”