Introduction
The V-trace loss, introduced by Espeholt et al. 2018, targets at near-on-policy data and has been successfully applied to solving challenging tasks such as StarCraft II. In this post, we theoretically analyze V-trace, showing that when data is way off-policy, V-trace does not converge to a local optimal solution, not even when an optimal value function is provided. At last, we demonstrate that it is possible for V-trace to learn a local optimal greedy policy from off-policy data if we mix in a proportion of on-policy data.
V-trace
The V-trace target is defined as
\[\begin{align} v(x_t) &:= V(x_t)+\sum_{k=t}^{t+n-1}\gamma^{k-t}\left(\prod_{i=t}^{k-1}c_i\right)\delta_kV\tag {1}\\\ where\quad \delta_kV&:=\rho_k(r_k+\gamma V(x_{k+1})-V(x_k))\\\ c_{i}&:=\lambda \min\left(\bar c, {\pi(a_i|x_i)\over \mu(a_i|x_i)}\right)\\\ \rho_k&:=\min\left(\bar\rho, {\pi(a_k|x_k)\over \mu(a_k|x_k)}\right) \end{align}\]The advantage of V-trace is that 1) it’s a multi-step algorithm, allowing reward signal propagate through multiple steps and enables efficient learning; 2) it reduces the potentially infinite variance of traditional multi-step target with importance sampling by clipping the importance ratio at \(\bar c\) and \(\bar\rho\). In the next section, we analysis V-trace and conclude that it is only suitable for near-on-policy learning as the clipped importance sampling introduces some bias to the value function.
Analysis
V-trace Convergence
Denote Equation \((1)\) as the V-trace operator \(\mathcal R\):
\[\begin{align} \mathcal RV(x_t) :=& V(x_t)+\mathbb E_\mu\left[\sum_{k\ge t}\gamma^{k-t}\left(\prod_{i=t}^{k-1}c_i\right)\delta_kV\right]\tag 2\\\ where\quad \delta_kV=&\rho_k(r_k+\gamma V(x_{k+1})-V(x_k)) \end{align}\]where the expectation \(\mathbb E_\mu\) is with respect to the behavior policy \(\mu\). Here we consider the infinite horizon operator but very similar results hold for the n-step truncated operator.
Theorem 1. Let \(c_{i}=\lambda \min\left(\bar c, {\pi(a_i\vert x_i)\over \mu(a_i\vert x_i)}\right)\) and \(\rho_k:=\min\left(\bar\rho, {\pi(a_k\vert x_k)\over \mu(x_k\vert x_k)}\right)\) be the truncated importance sampling weights, with \(\bar\rho\ge \bar c\). Assume that there exists \(\beta\in(0,1]\) such that \(\mathbb E_\mu\rho_t\ge\beta\). Then the operator \(\mathcal R\) has a unique fixed point \(V^{\pi_{\bar\rho}}\), which is the value function of the policy \(\pi_{\bar\rho}\) defined by
\[\begin{align} \pi_{\bar\rho}(a|x)={\min(\bar\rho\mu(a|x),\pi(a|x))\over{\sum_{b\in A}\min(\bar\rho\mu(b|x),\pi(b|x))}}\tag 3 \end{align}\]Furthermore, \(\mathcal R\) is an \(\eta\)-contraction mapping in sup-norm with
\[\begin{align} \eta:=\gamma^{-1}-(\gamma^{-1}-1)\mathbb E_\mu\left[\sum_{k\ge t}\gamma^{k-t}\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]\le 1-(1-\gamma)\beta< 1\tag 4 \end{align}\]Before we prove it, let’s see the role of \(\bar c\) and \(\bar \rho\) play. \(\bar c\) appearing the contraction modulus \(\eta\) affects the speed at which V-trace converges to \(V^{\pi_\bar\rho}\)—a small \(\bar c\) corresponds to lower variance but worse contraction rate. On the other hand, \(\bar\rho\) influences the policy \(\pi_{\bar\rho}\) and thus controls the fixed point \(V^{\pi_\bar\rho}\). Moreover, \(\bar\rho\) biases the policy and thus the value function whenever \(\bar\rho\mu(a\vert x)<\pi(a\vert x)\), which makes the V-trace operator inappropriate for data way off policy.
Proof. First notice that we can rewrite Equation \((2)\) as
\[\begin{align} \mathcal R V(x_t)=(1-\mathbb E_\mu\rho_t)V(x_t)+\mathbb E_\mu\left[\sum_{k\ge t}\gamma^{k-t}\left(\prod_{i=t}^{k-1}c_i\right)\big(\rho_kr_k+\gamma(\rho_k-c_k\rho_{k+1})V(x_{k+1})\big)\right] \end{align}\]where we move \(-\rho_{k+1}V(x_{k+1})\) in \(\delta_{k+1}V\) into \(\delta_{k}V\). Therefore, we have
\[\begin{align} \mathcal RV_1(x_{t})-\mathcal RV_2(x_t)&=(1-\mathbb E_\mu\rho_t)(V_1(x_t)-V_2(x_t))+\mathbb E_\mu\left[\sum_{k\ge t}\gamma^{k-t+1}\left(\prod_{i=t}^{k-1}c_i\right)\big((\rho_k-c_k\rho_{k+1})(V_1(x_{k+1})-V_2(x_{k+1})\big)\right]\\\ &=\mathbb E_\mu\left[\sum_{k\ge t}\gamma^{k-t}\left(\prod_{i=t}^{k-2}c_i\right)\big(\underbrace{(\rho_{k-1}-c_{k-1}\rho_{k})}_{\alpha_k}(V_1(x_{k})-V_2(x_{k})\big)\right] \end{align}\]with the notation that \(c_{t-1}=\rho_{t-1}=1\) and \(\prod_{i=t}^{k-2}c_i=1\) for \(k=t\) and \(t+1\).
We shall see the coefficients \((\alpha_k)_{k\ge t}\) are non-negative in expectation. Because \(\bar\rho\ge \bar c\), we have
\[\begin{align} \mathbb E_\mu\alpha_k=\mathbb E_\mu[\rho_{k-1}-c_{k-1}\rho_k]\ge\mathbb E_\mu[c_{k-1}(1-\rho_k)]\ge 0 \end{align}\]since \(\mathbb E_\mu\rho_k\le\mathbb E_\mu\log{\pi(a_k\vert x_k)\over\mu(a_k\vert x_k)}=1\). Thus \(V_1(x_{t})-V_2(x_t)\) is a linear combination of the values \(V_1-V_2\) at other states, weighted by non-negative coefficients whose sum is
\[\begin{align} &\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)(\rho_{k-1}-c_{k-1}\rho_{k})\right]\\\ =&\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]-\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-1}c_i\right)\rho_{k}\right]\\\ &\qquad\color{red}{\text{add }\gamma^{-1}(\mathbb E_{\mu}\rho_{t-1}-1)\text{ to the second term and rearange}}\\\ =&\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]-\gamma^{-1}\left(\sum_{k\ge t}\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]-1\right)\\\ =&\gamma^{-1}-(\gamma^{-1}-1)\sum_{k\ge t}{\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]}=\eta\\\ &\qquad\color{red}{\sum_{k\ge t}{\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]}\ge\sum_{k=t}^{t+1}{\gamma^{k-t}\mathbb E_\mu\left[\left(\prod_{i=t}^{k-2}c_i\right)\rho_{k-1}\right]}=1+\gamma\mathbb E_\mu\rho_t}\\\ &\qquad\color{red}{\gamma<1\text{ and }(\gamma^{-1}-1)>0}\\\ \le&\gamma^{-1}-(\gamma^{-1}-1)(1+\gamma\mathbb E_\mu\rho_t)\\\ =& 1-(1-\gamma)\mathbb E_\mu\rho_t\\\ &\qquad\color{red}{\mathbb E_\mu\rho_t\ge\beta}\\\ \le& 1-(1-\gamma)\beta\\\ &\qquad\color{red}{\beta\in(0, 1]}\\\ \le&\gamma<1 \end{align}\]We deduce that \(\Vert\mathcal RV_1(x_t)-\mathcal RV_1(x_t)\Vert\le \eta\Vert V_1-V_2\Vert_\infty\), with \(\eta\) defined in Equation \((4)\), so \(\mathcal R\) is a contraction mapping. Thus \(\mathcal R\) possesses a unique fixed point. Let us now prove that this fixed point is \(V^{\pi_\bar\rho}\). We have
\[\begin{align} &\mathbb E_\mu[\rho_t(r_t+\gamma V^{\pi_\bar\rho}(x_{t+1})-V^{\pi_\bar\rho}(x_t))|x_t]\\\ =&\sum_a\mu(a|x_t)\min\left(\bar\rho, {\pi(a|x_t)\over \mu(a|x_t)}\right)\left(r_t+\gamma\sum_{x_{t+1}} p(x_{t+1}|x_t,a)V^{\pi_\bar\rho}(x_{t+1})-V^{\pi_\bar\rho}(x_t)\right)\\\ =&\sum_a\left(r_t+\gamma\sum_{x_{t+1}} p(x_{t+1}|x_t,a)V^{\pi_\bar\rho}(x_{t+1})-V^{\pi_\bar\rho}(x_t)\right)\min\left(\bar\rho\mu(a|x_t), {\pi(a|x_t)}\right)\\\ =&\underbrace{\sum_a\pi_{\bar\rho}(a|x_t)\left(r_t+\gamma\sum_{x_{t+1}} p(x_{t+1}|x_t,a)V^{\pi_\bar\rho}(x_{t+1})-V^{\pi_\bar\rho}(x_t)\right)}_{=0,\text{ since }V^{\pi_\bar\rho}\text{ is the value function of }{\pi_\bar\rho}}\sum_{b\in A}\min(\bar\rho\mu(b|x_t),\pi(b|x_t))\\\ =&0 \end{align}\]Therefore \(\delta_kV^{\pi_\bar\rho}=0\) and \(\mathcal RV^{\pi_\bar\rho}=V^{\pi_\bar\rho}\), i.e, \(V^{\pi_\bar\rho}\) is the unique fixed point of \(\mathcal R\).
V-trace Policy Gradient
In the previous subsection, we showed that the value function learned by the V-trace operator converge to \(V^{\pi_{\bar\rho}}\) other than \(V^*\). Now we show that even with the optimal value function \(V^*\), the V-trace policy gradient does not converge to a locally optimal \(\pi^*\) when data is collected by off policies \(\mu\)
The V-trace policy gradient
\[\begin{align} \nabla \mathcal J^\mu(\pi)=\mathbb E_{\mu}[\rho_t(r_t+\gamma v(x_{t+1}))\nabla \log\pi(a_t|x_t)]\tag 5 \end{align}\]where we omit the baseline and entropy term in the original IMPALA policy gradient.
Proposition 1. The V-trace policy gradient is biased: given the optimal value function \(V^*\), the V-trace policy gradient does not converge to a locally optimal \(\pi^*\) for all off-policy behavior distributions \(\mu\).
Proof. We analyze the V-trace policy gradient
\[\begin{align} \nabla \mathcal J^\mu(\pi)=&\mathbb E_{\mu}[\rho_t(r_t+\gamma V^\*(x_{t+1}))\nabla \log\pi(a_t|x_t)]\\\ =&\mathbb E_{\mu}[\rho_tQ^\*(x_t,a_t)\nabla \log\pi(a_t|x_t)]\\\ =&\mathbb E_{\mu}\left[\min\left(\bar\rho,{\pi(a_t|x_t)\over \mu(a_t|x_t)}\right)Q^\*(x_t,a_t)\nabla \log\pi(a_t|x_t)\right]\\\ =&\mathbb E_{\mu}\left[{\pi(a_t|x_t)\over \mu(a_t|x_t)}\min\left(1,\bar\rho{\mu(a_t|x_t)\over \pi(a_t|x_t)}\right)Q^\*(x_t,a_t)\nabla \log\pi(a_t|x_t)\right]\\\ =&\mathbb E_{\pi}\left[\underbrace{\min\left(1,\bar\rho{\mu(a_t|x_t)\over \pi(a_t|x_t)}\right)}_{\omega(x_t,a_t)}Q^\*(x_t,a_t)\nabla \log\pi(a_t|x_t)\right]\\\ =&\mathbb E_{\pi}\left[Q^\omega(x_t,a_t)\nabla \log\pi(a_t|x_t)\right]\tag 6 \end{align}\]when \(\bar\rho\mu(a_t\vert x_t)< \pi(a_t\vert x_t)\), \(\omega(x_t,a_t)<1\) and thus the policy gradient is biased. More specifically, when \(\bar\rho\mu(a_t\vert x_t)< \pi(a_t\vert x_t)\), the V-trace policy gradient penalizes the action values by weighting \(Q^*(x_t,a_t)\) with \(\omega(x_t,a_t)\). From the probabilistic view, this changes the policy from one proportional to \(\exp(Q^*)\) to one proportional to \(\exp(Q^\omega)\). In the greedy case, it may or may not change the optimal policy depending on how much \(Q^\omega\) is distorted from \(Q^*\): regrets are introduced as long as there exists some \(x\) such that \(\arg\max_a(Q^\omega(x,a))\ne\arg\max(Q^*(x,a))\).
Mixing On- and Off-Policy Experiences
In Equation \((6)\), we showed that the V-trace policy gradient is biased when \(\bar\rho\mu(a_t\vert x_t)< \pi(a_t\vert x_t)\). We now show that by mixing the on- and off-policy data, the bias can be reduced. Furthermore, it is possible to select a proportion of on-policy data so that the learned policy is unbiased in the greedy case.
Proposition 2. Mixing on-policy data into the V-trace policy gradient with the ratio \(\alpha\) reduces the bias by providing a regularization to the implied state-action values. In the general function approximation case it changes the off-policy V-trace policy gradient from \(\sum_x d^\mu(x)\mathbb E_\pi[(Q(x, a)\nabla\log\pi(a\vert x)]\) to \(\sum_x\mathbb E_\pi [Q^\alpha(x, a)\nabla\log\pi(a\vert x)]\) where \(Q^\alpha = Qd^\pi(x)\alpha + Q^\omega d^\mu(x)(1 − \alpha)\) is a regularized state action estimate and \(d^\pi\), \(d^\mu\) are the state distributions for \(\pi\) and \(\mu\) . Note that there exists \(\alpha\le 1\) such that \(Q^\alpha\) has the same argmax (i.e. best action) as \(Q\).
Proof. Note that the on-policy policy gradient is given by
\[\begin{align} \nabla \mathcal J^\pi(\pi)=\sum_xd^\pi(x)\mathbb E_{\pi}[Q(x,a)\nabla \log\pi(a|x)]\tag 7 \end{align}\]Similarly the off-policy V-trace gradient is given by
\[\begin{align} \nabla \mathcal J^\mu(\pi)=\sum_xd^\mu(x)\mathbb E_{\pi}[\omega(x,a)Q(x,a)\nabla \log\pi(a|x)]\tag 8 \end{align}\]The \(\alpha\)-interpolation of both gradients can be transformed as follows:
\[\begin{align} \nabla(\alpha\nabla \mathcal J^\pi+(1-\alpha)\nabla\mathcal J^\mu)(\pi)=&\alpha\sum_xd^\pi(x)\mathbb E_{\pi}[Q(x,a)\nabla \log\pi(a|x)]\\\ &+(1-\alpha)\sum_xd^\mu(x)\mathbb E_{\pi}[\omega(x,a)Q(x,a)\nabla \log\pi(a|x)]\\\ =&\sum_x\mathbb E_\pi\Big[\big(Q(x,a)d^\pi(x)\alpha +Q^\omega(x,a)d^\mu(x)(1-\alpha)\big)\nabla\log\pi(a|x)\Big]\\\ =&\sum_x\mathbb E_\pi [Q^\alpha(x, a)\nabla\log\pi(a|x)] \end{align}\]for \(Q^\alpha(x,a) = Q(x,a)d^\pi(x)\alpha + Q^\omega(x,a) d^\mu(x)(1 − \alpha)\).
Interpretation. We show that in the greedy sense, choosing \(\alpha\) such that
\[\begin{align} {\alpha\over1-\alpha}>\max_{b\ne A^\*}\left[{Q^\omega(x,b)-Q^\omega(x,a^\*)\over Q(x,a^\*)-Q(x,b)}\right]{d^\mu(x)\over d^\pi(x)}\tag 9 \end{align}\]will resulting in a policy that produces the same optimal action as the optimal policy, i.e.,
\[\begin{align} \arg\max_a[Q(x,a)]=\arg\max_a[Q^\alpha(x,a)]\quad \forall x \tag 10 \end{align}\]Let \(a^*=\arg\max_a Q(x,a)\) be the best action and \(A^*\) the set of best actions. Equation \((9)\) is equivalent to
\[\begin{align} Q^\alpha(x,a^\*)>Q^\alpha(x,b)\quad\forall b\notin A^\* \end{align}\]By the definition of \(Q^\alpha\), we have
\[\begin{align} Q(x,a^\*)d^\pi(x)\alpha + Q^\omega(x,a^\*) d^\mu(x)(1 − \alpha)&>Q(x,b)d^\pi(x)\alpha + Q^\omega(x,b) d^\mu(x)(1 − \alpha)\quad\forall b\notin A^\*\\\ \big(Q(x,a^\*)d^\pi(x)-Q(x,b)d^\pi(x)\big)\alpha&>\big(Q^\omega(x,b) d^\mu(x)-Q^\omega(x,a^\*) d^\mu(x)\big)(1 − \alpha)\quad\forall b\notin A^\*\\\ {\alpha\over1-\alpha}&>\max_{b\ne A^\*}\left[{Q^\omega(x,b)-Q^\omega(x,a^\*)\over Q(x,a^\*)-Q(x,b)}\right]{d^\mu(x)\over d^\pi(x)} \end{align}\]This inequality provides us several interesting observations
- The larger the action value gap in the real \(Q\)-function \(Q(s,a^*)-Q(s,b)\), the less on-policy data is required
- If \(\max_{b\ne A^*}Q^\omega(x,b)-Q^\omega(x,a^*)\) is negative, \(\alpha\) may be as small as zero. That is, when \(Q^\omega\) has the same optimal action as \(Q\), the proportion of on-policy data does not matter.
- Less on-policy data is required if \(d^\mu(x)\over d^\pi(x)\) is small, i.e. if \(\pi\) visits state \(x\) more often than \(\mu\).
It is worth noting that Proposition 2 only says that by mixing on- and off-policy data, it is possible for V-trace to learn an optimal greedy policy. However, in the case where a stochastic policy is preferred, such guarantees are no longer held.
References
Espeholt, Lasse, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Boron Yotam, et al. 2018. “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.” 35th International Conference on Machine Learning, ICML 2018 4: 2263–84.
Schmitt, Simon, Matteo Hessel, and Karen Simonyan. 2019. “Off-Policy Actor-Critic with Shared Experience Replay.” ArXiv, no. Figure 2: 1–20.