Introduction
Consider the general operator for return-based off-policy algorithms:
\[\begin{align} \mathcal RQ(x_t,a_t):=Q(x_t,a_t)+\mathbb E_\mu\left[\sum_{k=t}^{t+n-1}\gamma^{k-t}\left(\prod_{i=t+1}^k c_i\right)(r_k+\gamma\mathbb E_{\pi} Q(x_{k+1},\cdot)-Q(x_k,a_k))\right]\tag {1} \end{align}\]where \(\pi\) and \(\mu\) are the target and behavior policies, respectively. All \(Q\)s are computed from the target network. Munos et al. 2016 introduced Retrace(\(\lambda\)), which defines \(c_i=\lambda\min\left(1,{\pi(a_i\vert x_i)\over\mu(a_i\vert x_i)}\right)\), with the following advantages:
- Retrace(\(\lambda\)) truncates the IS ratio at \(1\), whereby it does not suffer from the variance explosion due to the product of IS ratios.
- It does not cut the traces in the on-policy case, making it possible to benefit from the full returns
- For any traces \(0\le c_i\le {\pi(a\vert x)\over\mu(a\vert x)}\), \(\mathcal RQ\) is a \(\gamma\)-contraction around \(Q^\pi\) for arbitrary \(\pi\) and \(\mu\).
- In the control setting (where the behavior policies are increasingly greedy), Retrace converges almost surely to \(Q^*\), without requiring the GLIE assumption.
In the following post, we will show the following:
- For any traces \(0\le c\le{\pi(a_t\vert x_t)\over\mu(a_t\vert x_t)}\), the return-based operator(Equation \((1)\)) is a \(\gamma\)-contraction around \(Q^\pi\), for arbitrary policies \(\mu\) and \(\pi\)
- In the control case (where \(\pi\) is replaced by a sequence of increasingly greedy policies) the online Retrace(\(\lambda\)) algorithm converges a.s.(almost surely) to \(Q^*\), without requiring the GLIE assumption(Greedy in the Limit with Infinite Exploration)
Analysis of Retrace(\(\lambda\))
Policy Evaluation
In the evaluation setting, we prove that \(\mathcal R\) operator defined in Equation \((1)\) is \(\gamma\)-contraction around its fixed point \(Q^\pi\) for any traces \(0\le c\le{\pi(a_t\vert x_t)\over\mu(a_t\vert x_t)}\). We first introduce the following lemma.
Lemma 1. The difference between \(\mathcal RQ\) and its fixed point \(Q^\pi\) is
\[\begin{align} \mathcal RQ(x_t,a_t)-Q^\pi(x_t,a_t)=\mathbb E_\mu\left[\sum_{k\ge t+1}\gamma^{k-t}\left(\prod_{i=t+1}^{k-1}c_i\right)\left(\mathbb E_\pi(Q-Q^\pi)(x_k,\cdot)-c_k(Q-Q^\pi)(x_k,a_k)\right)\right]\tag {2} \end{align}\]Proof. We begin by rewriting Equation \((1)\):
\[\begin{align} \mathcal RQ(x_t,a_t)=&\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t+1}^kc_i\right)\left(r_k+\gamma\big[\mathbb E_\pi Q(x_{k+1},\cdot)\big]-c_{k+1}Q(x_{k+1},a_{k+1})\right)\right]\\\ \tag {3} \end{align}\]where we move \(Q(x_{k+1})\) in \(\delta_{k+1}Q\) into \(\delta_{k}Q\),. Since \(Q^\pi\) is the fixed point of \(\mathcal R\), we have
\[\begin{align} Q^\pi=\mathcal RQ^\pi(x_t,a_t)=\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t+1}^kc_i\right)\left(r_k+\gamma\big[\mathbb E_\pi Q^\pi(x_{k+1},\cdot)\big]-c_{k+1}Q^\pi(x_{k+1},a_{k+1})\right)\right]\tag4 \end{align}\]We now can deduce Lemma 1 by subtracting Equation \((4)\) from Equation \((3)\).
In the rest of the discussion, all expectations are taken over (potentially a sequence of behavior policies) \(\mu\) if the distribution is not explicitly specified.
Define history \(\mathcal F_i\) as sequences drawn from a behavior policy \(\mu\) up to time step \(i\), non-negative coefficients \(c_i=c_i(a_i,\mathcal F_i)\) as a function of \(a_i\) and \(\mathcal F_i\). Next, we show \(\mathcal R\) is \(\gamma\)-contraction around \(Q^\pi\), i.e., \(\Vert \mathcal RQ-Q^\pi\Vert\le\gamma\Vert Q-Q^\pi\Vert\), where \(\Vert\cdot\Vert\) is supremum norm.
Theorem 1. The operator \(\mathcal R\) defined by Equation \((1)\) has a fixed point \(Q^\pi\). Furthermore, if for each \(a_i\in\mathcal A\) and each history \(\mathcal F_i\) we have \(c_i\in\left[0,{\pi(a_i\vert x_i)\over \mu(a_i\vert x_i)}\right]\), then for any \(Q\)-function \(Q\)
\[\begin{align} \Vert \mathcal RQ-Q^\pi\Vert\le\gamma\Vert Q-Q^\pi\Vert \end{align}\]Proof. From Lemma 1 and defining \(\Delta Q:=Q-Q^\pi\), we have
\[\begin{align} \mathcal RQ(x_t,a_t)-Q^\pi(x_t,a_t)=&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\left(\mathbb E_\pi\Delta Q(x_k,\cdot)-c_k\Delta Q(x_k,a_k)\right)\right]\\\ &\qquad\color{red}{\text{move }\mathbb E_{a_k\sim\mu}\text{ inward}}\\\ =&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k-1}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\left(\sum_b\pi(b|x_k)\Delta Q(x_k,b)-\mu(b|x_k)c_k\Delta Q(x_k,b)\right)\right]\\\ =&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k-1}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\left(\sum_b\big(\pi(b|x_k)-\mu(b|x_k)c_k\big)\Delta Q(x_k,b)\right)\right]\\\ \end{align}\]As \(\pi(b\vert x_k)-\mu(b\vert x_k)c_k\ge0\), we have \(\mathcal RQ(x_t,a_t)-Q^\pi(x_t,a_t)=\sum_{x_t,b}w_{x_t,b}\Delta Q(x_t,b)\), where coefficients \(w_{x_t,b}\) is defined as
\[\begin{align} w_{x_t,b}=\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k-1}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\big(\pi(b|x_k)-\mu(b|x_k)c_k\big)\right] \end{align}\]The sum of those coefficients is
\[\begin{align} \sum_{x_t,b}w_{x_t,b}=&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k-1}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\left(\sum_b\big(\pi(b|x_k)-\mu(b|x_k)c_k\big)\right)\right]\\\ =&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k-1}}\left[\left(\prod_{i=t+1}^{k-1}c_i\right)\big(1-\mathbb E_{a_k}c_k\big)\right]\\\ =&\sum_{k\ge t+1}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k}}\left[\prod_{i=t+1}^{k-1}c_i-\prod_{i=t+1}^{k}c_i\right]\\\ =&\gamma\sum_{k\ge t}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k}}\prod_{i=t+1}^{k}c_i-\left(\sum_{k\ge t}\gamma^{k-t}\mathbb E_{x_{t+1:k},a_{t+1:k}}\prod_{i=t+1}^{k}c_i-\gamma^0\prod_{i=t+1}^tc_i\right)\\\ &\qquad\color{red}{C=\sum_{k\ge t}\gamma^{k-t}\mathbb E_{x_{t:k},a_{t:k}}\prod_{i=t+1}^{k}c_i}\\\ =&\gamma C-(C-1)\\\ =&1-(1-\gamma)C \end{align}\]Because \(C-1\ge0\), we have \(1-\gamma -(1-\gamma)C=(1-\gamma)(1-C)\le 0\), i.e. \(\sum_{x_t,b}w_{x_t,b}\le\gamma\). Because \(\mathcal R Q(s,a)-Q^\pi(s,a)\) is a sub-convex combination of \(\Delta Q\) weighted by non-negative coefficients \(w_{s,b}\) which sum to at most \(\gamma\), \(\mathcal R\) is a \(\gamma\)-contraction mapping around \(Q^\pi\).
Remark 1. Notice that the coefficient \(C\) above depends on \((x, a)\). If we write \(\eta(x,a)=1-(1-\gamma)C\), then we have shown that
\[\begin{align} \Vert \mathcal RQ(x,a)-Q^\pi(x,a)\Vert\le\eta(x,a)\Vert Q-Q^\pi\Vert \end{align}\]where \(\eta(x,a)\in[0,\gamma]\) is a \((x,a)\)-specific contraction coefficient, which is \(\gamma\) when \(c_1=0\) (the trace is cut immediately) and can be close to zero when learning from full returns \(c_k\approx1\) for all \(k\ge t\).
Control
In the control setting, the single target policy \(\pi\) is replaced by a sequence of policies \(\pi_k\) which depends on \(Q_k\). We first define a sequence of increasingly greedy policies
Definition 1. A sequence of policies (\(\pi_k:k\in \mathbb N\)) is increasingly greedy w.r.t. a sequence (\(Q_k:k\in\mathbb N\)) of \(Q\)-functions if the following property holds for all \(k\): \(P^{\pi_{k+1}}Q_{k+1}\ge P^{\pi_k}Q_{k+1}\), where \(P^\pi Q(x,a)=\sum_{x’}\sum_{a’}P(x’\vert x,a)\pi(a’\vert x’)Q(x’,a’)\).
Intuitively, this means that each \(\pi_{k+1}\) is at east as greedy as the previous policy \(\pi_k\) for \(Q_{k+1}\). Many natural sequences of policies are increasingly greedy, including \(\epsilon_k\) greedy policy (with non-increasing \(\epsilon_k\)) and softmax policies (with non-increasing temperature).
Notice that, in the Retrace algorithm, \(c_i\) is Markovian, in the sense that it depends on \(x_i, a_i\) (as well as \(\pi\) and \(\mu\)) only but not on the full past history. This allows us to define the (sub)-probability
\[\begin{align} (P^{c\mu}Q)(x,a)=\sum_{x'}\sum_{a'}P(x'|x,a)\mu(a'|x')c(x',a')Q(x',a')\tag 5 \end{align}\]With the help of Equation \((5)\), we rewrite the state-based Retrace operator as follow
\[\begin{align} \mathcal RQ=&Q+\sum_{t\ge 0}\gamma^t(P^{c\mu})^t(\mathcal T^\pi Q-Q)\tag 6\\\ =&Q+(I-\gamma P^{c\mu})^{-1}(\mathcal T^\pi Q-Q)\tag 7\\\ where\quad \mathcal T^\pi Q=&r+P^\pi Q' \end{align}\]where we omit \(x,a\) for simplicity. Notice that the Retrace operator defined by Equations \((6-7)\) is state based while the one defined by Equation \((1)\) is time-based.
Theorem 2. Consider an arbitrary sequence of behaviour policies (\(\mu_k\)) (which may depend on (\(Q_k\))) and a sequence of target policies (\(\pi_k\)) that are increasingly greedy w.r.t. the sequence (\(Q_k\)):
\[\begin{align} Q_{k+1}=\mathcal R_k Q_k \end{align}\]Assume the target policies \(\pi_k\) are \(\epsilon_k\)-away from the greedy policies w.r.t. \(Q_k\), in the sense that \(\mathcal T^{\pi_k}Q_k\ge\mathcal T^* Q_k-\epsilon_k\Vert Q_k\Vert e\), where \(e\) is the vector with \(1\)-components. Further suppose that \(\mathcal T^{\pi_0}Q_0\ge Q_0\). Then for any \(k\ge 0\),
\[\begin{align} \Vert Q_{k+1}-Q^\*\Vert\le\gamma \Vert Q_{k}-Q^\*\Vert+\epsilon_k\Vert Q_k\Vert \end{align}\]In consequence, if \(\epsilon_k\rightarrow 0\) then \(Q_k\rightarrow Q^*\).
Theorem 2 provides a nice implication: Retrace converges almost sure to the optimal policy \(Q^*\) without GLIE.
Proof. We prove the upper-bound on \(Q_{k+1}- Q^*\), and refer interested readers to paper for the lower-bound proof.
Upper bound on \(Q_{k+1}- Q^*\). Since \(Q_{k+1}=\mathcal R_k Q_k\), from Equation \((7)\), we have
\[\begin{align} Q_{k+1}- Q^\*=&Q_k+(I-\gamma P^{c\mu_k})^{-1}(\mathcal T^{\pi_k} Q_k-Q_k)-Q^\*\\\ =&(I-\gamma P^{c\mu_k})^{-1}\big(\mathcal T^{\pi_k} Q_k-Q_k+(I-\gamma P^{c\mu_k})(Q_k-Q^\*)\big)\\\ =&(I-\gamma P^{c\mu_k})^{-1}\big(\mathcal T^{\pi_k} Q_k-Q^\*-\gamma P^{c\mu_k}(Q_k-Q^\*)\big)\\\ =&(I-\gamma P^{c\mu_k})^{-1}\big(\mathcal T^{\pi_k} Q_k-\mathcal TQ^\*-\gamma P^{c\mu_k}(Q_k-Q^\*)\big)\\\ &\qquad \color{red}{\mathcal T^{\pi_k} Q_k-\mathcal TQ^\*=\gamma P^{\pi_k}Q_k-\gamma PQ^\*\le \gamma P^{\pi_k}(Q_k-Q^\*)}\\\ \le &(I-\gamma P^{c\mu_k})^{-1}\big(\gamma P^{\pi_k}(Q_k-Q^\*)-\gamma P^{c\mu_k}(Q_k-Q^\*)\big)\\\ =&\gamma(I-\gamma P^{c\mu_k})^{-1}(P^{\pi_k}-P^{c\mu_k})(Q_k-Q^\*)\\\ &\qquad \color{red}{A_k=\gamma(I-\gamma P^{c\mu_k})^{-1}(P^{\pi_k}-P^{c\mu_k})}\\\ =&A_k(Q_k-Q^\*) \end{align}\]It is clear that \(A_k\) has non-negative elements as
\[\begin{align} (P^{\pi_k}-P^{c\mu})e(x,a)=\sum_{x'}\sum_{a'}P(x'|x,a)(\pi(a'|x')-c(x',a')\mu_k(a'|x'))\ge 0 \end{align}\]By rewriting \(A_k\) as \(\gamma\sum_{t\ge0}\gamma^t(P^{c\mu_k})^t(P^{\pi_k}-P^{c\mu_k})\), we have
\[\begin{align} A_ke=&\gamma\sum_{t\ge0}\gamma^t(P^{c\mu_k})^t(P^{\pi_k}-P^{c\mu_k})e\\\ &\qquad\color{red}{P^{\pi}e=1}\\\ =&\gamma\sum_{t\ge0}\gamma^t(P^{c\mu_k})^te-\sum_{t\ge0}\gamma^{t+1}(P^{c\mu_k})^{t+1}e\\\ &\qquad\color{red}{\sum_{t\ge0}\gamma^{t+1}(P^{c\mu_k})^{t+1}e=\sum_{t\ge0}\gamma^{t}(P^{c\mu_k})^{t}e-e}\\\ =&(\gamma-1)\sum_{t\ge0}\gamma^{t}(P^{c\mu_k})^{t}e+e\\\ &\qquad\color{red}{\sum_{t\ge0}\gamma^{t}(P^{c\mu_k})^{t}e\ge e\text{ and }\gamma < 1}\\\ \le&\gamma e \end{align}\]Therefore, \(A_k\) has non-negative elements, whose sum over each row (all \((x,a)\)) is at most \(\gamma\). We deduce that \(Q_{k+1}-Q^*\le \gamma\Vert Q_k-Q^*\Vert e\).
Lower bound on \(Q_{k+1}-Q^*\). We have
\[\begin{align} Q_{k+1}=&Q_k+\sum_{i\ge 0}\gamma ^i (P^{c\mu_k})^i(\mathcal T^{\pi_k}Q_k-Q_k)\\\ =&\mathcal T^{\pi_k}Q_k+\sum_{i\ge 1}\gamma ^i (P^{c\mu_k})^i(\mathcal T^{\pi_k}Q_k-Q_k)\\\ =&\mathcal T^{\pi_k}Q_k+\gamma (P^{c\mu_k})(I-\gamma P^{c\mu})^{-1}(\mathcal T^{\pi_k}Q_k-Q_k) \end{align}\]and by assumption that \(\pi_k\) is \(\epsilon_k\)-away from the greedy policy w.r.t. \(Q_k\)
\[\begin{align} Q^\*=&\mathcal T^{\pi^\*}Q^\*-\mathcal T^{\pi^\*}Q_k+\mathcal T^{\pi^\*}Q_k-\mathcal T^{\pi_k}Q_k+\mathcal T^{\pi_k}Q_k\\\ \le&-\gamma P^{\pi^\*}(Q_k-Q^\*)+\epsilon_k\Vert Q_k\Vert e+\mathcal T^{\pi_k}Q_k \end{align}\]Therefore, we derive
\[\begin{align} Q_{k+1}-Q^\*=\gamma (P^{c\mu_k})(I-\gamma P^{c\mu})^{-1}(\mathcal T^{\pi_k}Q_k-Q_k)+\gamma P^{\pi^\*}(Q_k-Q^\*)-\epsilon_k\Vert Q_k\Vert e \end{align}\]Because \(\mathcal T^{\pi_k}\) is the Bellman evaluation operator and \(\mathcal T^{\pi_k}Q_k-Q_k\ge 0\), we obtain
\[\begin{align} Q_{k+1}-Q^\*\ge \gamma P^{\pi^\*}(Q_k-Q^\*)-\epsilon_k\Vert Q_k\Vert e \end{align}\]Combining the upper and the lower bounds, we complete the proof.
References
Munos, Rémi, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. 2016. “Safe and Efficient Off-Policy Reinforcement Learning.” NIPS 2016, November. http://arxiv.org/abs/1606.02647.
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction
Supplementary Materials
We can derive Equation 1 from the action values with control variates(Sutton & Barto 2018)
\[\begin{align} q(x_t,a_t)=&r_t+\gamma\big(c_{t+1}q(x_{t+1},a_{t+1})+ V(x_{t+1})-c_{t+1}Q(x_{t+1},a_{t+1})\big)\\\ where\quad V(x_t)=&\mathbb E_{a\sim\pi(a|s)}[Q(x_t,a)] \end{align}\]where \(q\) denotes the target, \(Q\) is the action value function approximator. Expanding \(q(x_{t+1},a_{t+1})\), we will have
\[\begin{align} q(x_t,a_t)=&r_t+\gamma V(x_{t+1})-Q(x_t,a_t)+Q(x_t,a_t)+\gamma c_{t+1}\big(q(x_{t+1},a_{t+1})-Q(x_{t+1},a_{t+1})\big)\\\ =&\delta_t+Q(x_t,a_t)+\gamma c_{t+1}(q(x_{t+1},a_{t+1})-Q(x_{t+1},a_{t+1}))\\\ =&\delta_t+Q(x_t,a_t)+\gamma c_{t+1}(\delta_{t+1}+\gamma c_{t+2}(q(x_{t+2},a_{t+2})-Q(x_{t+2},a_{t+2})))\\\ =&Q(x_t,a_t)+\sum_{k=t}^{t+n-1}\gamma^{k-t}\prod_{i=t+1}^kc_i\delta_k \end{align}\]where \(\delta_k=r_t+\gamma V(x_{t+1})-Q(x_t,a_t)\) and we take \(q(x_{t+n}, a_{t+n})=Q(x_{t+n}, a_{t+n})\) at the last step. We conclude \(q(x_t,a_t)=\mathcal RQ(x_t,a_t)\) when there is no truncation applied to the importance ratio.
We now derive a recursive form for Retrace(\(\lambda\))
\[\begin{align} q(x_t,a_t)=&Q(x_t,a_t)+r_t+\gamma V(x_{t+1})-Q(x_t,a_t)+\sum_{k=t+1}^{t+n-1}\gamma^{k-t}\prod_{i=t+1}^kc_i\delta_k\\\ =&r_t+\gamma V(x_{t+1})+ +\gamma c_{t+1}\big(q(x_{t+1},a_{t+1})-Q(x_{t+1},a_{t+1})\big) \end{align}\]Alternatively, we write
\[\begin{align} q(x_t,a_t)=&Q(x_t,a_t)+\sum_{k=t}^{t+n-1}\gamma^{k-t}\prod_{i=t+1}^kc_i\delta_k\\\ =&Q(x_t,a_t)+\delta_t+\gamma c_{t+1}\sum_{k=t+1}^{t+n-1}\gamma^{k-t-1}\prod_{i=t+2}^kc_i\delta_k\\\ q(x_t,a_t)-Q(x_t,a_t)=&\delta_t+\gamma c_{t+1}\big(q(x_{t+1},a_{t+1})-Q(x_{t+1}, a_{t+1})\big) \end{align}\]