Introduction
We discuss Reactor(Retrace Actor), proposed by Gruslys et al. 2018, a single-agent model-free algorithm that combines Retrace, distributional RL, \(\beta\)-LOO and prioritization replay. As Retrace and distributional RL have been covered in our previous posts, we here focus on \(\beta\)-LOO only. They also introduce a prioritization mechanism for sequential data but we donβt cover it here as more recent algorithms often use the one from R2D2.
TL; DR
- \(\beta\)-LOO policy gradient estimate interpolates Reinforce and \(Q\)-learning-based policy gradient:
where \(\beta=\min(c,{1\over \mu(a\vert s)})\). \(\mathbb E_{\mu}\mathcal G_{\beta-LOO}\) is unbiased when \(c\rightarrow\infty\) and \(R(s,\hat a)\) is an accurate estimate of the return, regardless of the accuracy of \(Q(s,\hat a)\). In practice \(R(s,\hat a)\) is computed by the Retrace algorithm
\(\beta\)-LOO
Given state-action value \(Q^\pi(s,a)\), we can compute the unbiased policy gradient by
\[\begin{align} \mathcal G=\sum_a Q^\pi(s,a)\nabla\pi(a|s)\tag 1 \end{align}\]Consider the off-policy case, where we have actions \(\hat a\) sampled from a behavior policy \(\mu(\hat a\vert s)\). Assume that we have access to an unbiased estimate \(R(s, \hat a)\) of \(Q^\pi(s,\hat a)\). We can estimate \(\mathcal G\) using likelihood ratio method with importance sampling ratio
\[\begin{align} \mathcal G_{ISLR}=\mathbb E_{\hat a\sim\mu}[{\pi(\hat a|s)\over\mu(\hat a|s)}R(s, \hat a)\nabla\log\pi(\hat a|s)]\tag 2 \end{align}\]This estimate is of high variance due to the introduction of the importance sampling ratio. A possible way for reducing variance is to estimate \(\mathcal G\) directly from Equation \((1)\) by using the return \(R(\hat a)\) for the chosen action \(\hat a\) and our current estimate \(Q\) of \(Q^\pi\) for the other actions, which leads to the so-called leave-one-out(LOO) policy gradient estimate:
\[\begin{align} \mathcal G_{LOO}=R(s,\hat a)\nabla\pi(\hat a|s)+\sum_{a\ne\hat a}Q(s,a)\nabla\pi(a|s)\tag 3 \end{align}\]This estimate is of less variance because the importance ratio is gone, but it may be biased if the estimated \(Q\) values differ from \(Q^\pi\). Gruslys et al. 2018, thus, introduce \(\beta\)-LOO policy gradient estimate to obtain a better bias-variance tradeoff
\[\begin{align} \mathcal G_{\beta-LOO}=\beta(R(s,\hat a)-Q(s,\hat a))\nabla\pi(\hat a|s)+\sum_aQ(s,a)\nabla\pi(a|s)\tag 4 \end{align}\]where \(\beta\) can be a function of policies, \(\mu\) and \(\pi\), and the selected action \(\hat a\). When \(\beta=1\), Equation \((4)\) reduces to Equation \((3)\). When \(\beta={1\over\mu(\hat a\vert s)}\), Equation \((4)\) becomes
\[\begin{align} \mathcal G_{\beta-LOO}={\pi(\hat a|s)\over\mu(\hat a|s)}(R(s,\hat a)-Q(s,\hat a))\nabla\log \pi(\hat a|s)+\sum_aQ(s,a)\nabla\pi(a|s)\tag 5 \end{align}\]Proposition 1 shows that Equation \((5)\) is unbiased, regardless of \(Q\).
Proposition 1. Assume \(\hat a\sim\mu\) and \(R(s,\hat a)=Q^\pi(s,\hat a)\). Then the bias of \(\mathcal G_{\beta-LOO}\) is \(\vert \sum_a(1-\mu(a\vert s)\beta)(Q(s, a)-Q^\pi(s, a)\nabla\pi(a\vert s)\vert \).
Proof. The bias of \(\mathcal G_{\beta-LOO}\) is
\[\begin{align} \mathbb E_\mu\mathcal G_{\beta-LOO} - \mathcal G=&\sum_{a}\mu(a|s)\beta(R(s,a)-Q(s,a))\nabla \pi( a|s)+\sum_aQ(s,a)\nabla\pi(a|s) - \sum_a Q^\pi(s,a)\nabla\pi(a|s)\\\ =&\sum_a(1-\mu(a|s)\beta)(Q(s,a)-Q^\pi(s,a))\nabla\pi(a|s) \end{align}\]where we assume \(R(s,a)\) is an unbiased estimate of \(Q^\pi(s,a)\). We can see that when \(\beta={1\over\mu(a\vert s)}\), \(\mathcal G_{\beta-LOO}\) is unbiased regardless the accuracy of \(Q(s,a)\). In practice, to present better bias-variance tradeoff, we use \(\beta=\min(c,{1\over \mu(a\vert s)})\) for some constant \(c\ge 1\).
References
Gruslys, Audrunas, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc G. Bellemare, and RΓ©mi Munos. 2017. βThe Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning.β ArXiv, 1β18.