Sherwin Chen
by Sherwin Chen
1 min read

Categories

Tags

Introduction

We discuss self-imitation learning, proposed by Oh et al., in which the agent exploits the previous transitions that receive better returns than it expects.

Loss

The objective of self-imitation learning is to exploit the transitions that lead to high returns. In order to do so, Oh et al. introduce a prioritized replay that prioritized transitions based on \((R-V(s))+\), where \(R\) is the discounted sum of rewards and \((\cdot)+=\max(\cdot,0)\). Besides the tranditional A2C updates, the agent also updates its network using transitions in the replay sampled proportional to their priorities with the following losses:

\[\begin{align} \mathcal L^{sil}&=\mathcal L^{sil}_\pi+\beta^{sil}\mathcal L^{sil}_v\\\ \mathcal L^{sil}_\pi&=\mathbb E_{s,a,R\in\mathcal D}[-\log\pi_\theta(a|s)(R-V(s))_+]\\\ \mathcal L^{sil}_v&=\mathbb E_{s,a,R\in\mathcal D}\left[{1\over 2}\Vert(R-V(s))_+\Vert^2\right] \end{align}\]

where \(\beta^{sil}\) controls the value weights. Oh et al. also draw a connection between the above loss and lower-bound-soft-Q-learning. We do not elaborate it here as, despite a little tricky, it’s easy to follow: the proof per se is built upon the theory of maximum entropy RL(MaxEnt RL), but the final conclusion assumes the temperature \(\alpha\) in MaxEnt RL is close to \(0\) to build the connection, which completely ignores the policy update \(\mathcal L_\pi^{sil}\) and the entropy regularization introduced by MaxEnt RL.

Interesting Experimental Results

In order to see if off-policy actor-critic methods can also benefit from past good experiences, Oh et al. try to apply the idea to ACER by replacing the replay buffer with the same prioritized replay. Unfortunately, the resulting algorithm even performs worse than A2C. They conjecture that this is due to the importance weight term in ACER, which weakens the effect of good transitions if the current policy deviate too much from the decision made in the past. However, they does not compare ACER with A2C in their experiments, which makes the comparisons obscure.

References

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-Imitation Learning