Introduction
In the previous post, we discussed distributional reinforcement learning algorithms—QR-DQN and IQN—that modeled the distribution over returns with a set of quantile values and optimize these quantile values using the Huber quantile regression loss. In this post, we follow the work of Yang et al. 2019, which extends IQN by utilizing an additional network to propose quantiles. Experiments show that the resultant algorithm Fully parameterized Quantile Function(FQF) significantly outperforms previous distributional RL algorithms.
Fully Parameterized Quantile Function
QR-DQN uses a parameterized quantile function to estimate the quantile value of a set of fixed quantiles and optimize the quantile function with respect to the Wasserstein distance between the Bellman updated and current return distributions. Later, IQN extends QR-DQN to learn a full quantile function that maps probabilities to returns. This dispenses with the fixed set of quantiles, allowing us to incorporate preknowledge of desired risk-sensitive policy. FQF further automates the quantile proposal process, enabling the agent to adjust the quantiles and adapt its behavior to the environment.
In this section, we discuss the fraction proposal network that produces a set of probabilities \(\tau_0,\dots,\tau_{N-1}\) and some details differing from IQN. We refer more detailed discussion on IQN to our previous post.
Fraction Proposal Network
Proposition 1. For any continuous quantile function \(F^{-1}_Z\) that is non-decreasing, define 1-Wasserstein loss of \(F_Z^{-1}\) and \(F_Z^{-1,\tau}\) by
\[\begin{align} W_1(Z,\tau)=&\sum_{i=0}^{N-1}\int_{\tau_i}^{\tau_{i+1} }|F_Z^{-1}(\omega)-F_Z^{-1}(\hat\tau_i)|d\omega\tag {1}\\\ where\quad\hat\tau_{i}=&{\tau_{i+1}+\tau_{i}\over 2} \end{align}\]\(\partial W_1\over\partial\tau_i\) is given by
\[\begin{align} \forall i\in(0,N),\ {\partial W_1\over\partial \tau_i}=2F_Z^{-1}(\tau_i)-F^{-1}_Z(\hat\tau_i)-F^{-1}_Z(\hat\tau_{i-1})\\\ \tag {2} \end{align}\]Furthermore, \(\forall \tau_{i-1},\tau_{i+1}\in [0, 1]\), \(\tau_{i-1}<\tau_{i+1}\), \(\exists \tau_i\in(\tau_{i-1}, \tau_{i+1})\) s.t. \({\partial W_1\over\partial_{\tau_i} }=0\).
Like the QR-DQN network, a fraction proposal network aims to propose \(\tau_0,\dots,\tau_{N-1}\) that minimize \(W_1\) defined in Equation 1. However, becaus e of the integral, computing \(W_1\) without bias is usually impractical. Equation \(2\) provides a way to minimize \(W_1\) without computing it. We leave the proof to the Supplementary Materials but it is worth noticing that \(W_1\) is minimized(with zero gradients) when \(\forall i\in(0,N), F_Z^{-1}(\tau_i)={F_Z^{-1}(\hat\tau_i)+F_Z^{-1}(\hat\tau_{i-1})\over 2}\).
Implementation Details
In practice, we first obtain probabilities by applying a softmax layer to the observation embedding. This gives us \(N-1\) probabilities \(q_1, \dots,q_{N}\) that sum to one. Then we compute the cumulative probabilities as \(\tau_i=\sum_{j=0}^{i}q_j\) with \(q_0=0\), resulting in non-decreasing order of \(\tau_0,\dots,\tau_N\). As \(W_1\) is not computable, we compute the gradient \(\partial W_1\over\partial\tau_i\) and apply the chain rule to get the gradients. In addition, we may add an entropy of \(q\) as a regularization term \(H(q)=-\sum_{i=0}^{N-1} q_i\log q_i\) to prevent the distribution from degenerating into a deterministic one(i.e. all \(\tau\)s are either \(0\) or \(1\)).
Quantile Function
In QR-DQN and IQN with an identity distortion risk measure, the distribution of the return is approximated by a uniform mixture of Diracs,
\[\begin{align} Z(x,a)={1\over N}\sum_{i=1}^N\delta_{\theta_i(x,a)} \end{align}\]In FQF, because the quantile is neither fixed nor uniform, we denote the distribution as follows
\[\begin{align} Z_{\theta,\tau}(x, a):=\sum_{i=0}^{N-1}(\tau_{i+1}-\tau_i)\delta_{\theta_i(x,a)} \end{align}\]where \(\tau_1,\dots,\tau_{N-1}\) represent the non-decreasing quantiles produced by the fraction proposal network.
This also changes the action-value function accordingly
\[\begin{align} Q(x,a)=\sum_{i=0}^{N-1}(\tau_{i+1}-\tau_i){F_Z^{-1}(\hat\tau_i)} \end{align}\]where \(F_Z^{-1}\) is the learned quantile function(see IQN) and \(\hat \tau_i={\tau_i+\tau_{i+1}\over 2}\) is the minimizer of the following 1-Wasserstein metric,
\[\begin{align} \int_{\tau_i}^{\tau_{i+1} }\left|F_Z^{-1}(\omega)-F^{-1}_Z(\hat\tau_i)\right|d\omega \end{align}\]References
Yang, Derek, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tieyan Liu. 2019. “Fully Parameterized Quantile Function for Distributional Reinforcement Learning,” no. NeurIPS. http://arxiv.org/abs/1911.02140.
Supplementary Materials
Proof for proposition 1
For any continuous quantile function \(F^{-1}_Z\) that is non-decreasing, define 1-Wasserstein loss of \(F_Z^{-1}\) and \(F_Z^{-1,\tau}\) by
\[\begin{align} W_1(Z,\tau)=\sum_{i=0}^{N-1}\int_{\tau_i}^{\tau_{i+1} }|F_Z^{-1}(\omega)-F_Z^{-1}(\hat\tau_i)|d\omega \end{align}\]\(\partial W_1\over\partial\tau_i\) is given by
\[\begin{align} \forall i\in(0,N),\ {\partial W_1\over\partial \tau_i}=2F_Z^{-1}(\tau_i)-F^{-1}_Z(\hat\tau_i)-F^{-1}_Z(\hat\tau_{i-1})\\\ where\ \hat\tau_{i}={\tau_{i+1}+\tau_{i}\over 2} \end{align}\]Furthermore, \(\forall \tau_{i-1},\tau_{i+1}\in [0, 1]\), \(\tau_{i-1}<\tau_{i+1}\), \(\exist\tau_i\in(\tau_{i-t}, \tau_{i+1})\) s.t. \({\partial W_1\over\partial_{\tau_i} }=0\).
Proof: Note that \(F_Z^{-1}\) is non-decreasing. We have
\[\begin{align} {\partial W_i\over\partial{\tau_i} } &={\partial\over\partial\tau_i}\left(\int^{\tau_i}_{\tau_{i-1} }|F^{-1}_Z(\omega)-F^{-1}_Z(\hat\tau_{i-1})|d\omega+\int^{\tau_{i+1} }_{\tau_i}|F^{-1}_Z(\omega)-F^{-1}_Z(\hat\tau_i)|d\omega\right)\\\ &={\partial\over\partial\tau_i}\left(\int^{\hat\tau_{i-1} }_{\tau_{i-1} }\underbrace{F^{-1}_Z(\hat\tau_{i-1})}_{1}-\underbrace{F^{-1}_Z(\omega)d\omega}_2 +\int^{\tau_i}_{\hat \tau_{i-1} }\underbrace{F^{-1}_Z(\omega)}_3-\underbrace{F^{-1}_Z(\hat\tau_{i-1})d\omega}_4 +\int^{\tau_{i+1} }_{\tau_i}|F^{-1}_Z(\omega)-F^{-1}_Z(\hat\tau_i)|d\omega\right)\\\ &=\underbrace{ {1\over 2}F_{Z}^{-1}(\hat\tau_{i-1})+{\tau_i-\tau_{i-1}\over 4}{ {\partial\over\partial\tau_i} }F^{-1}_Z(\hat\tau_{i-1})}_1 -\underbrace{ {1\over 2}F_Z^{-1}(\hat\tau_{i-1})}_{2}\\\ &\quad\quad+\underbrace{F_Z^{-1}(\tau_i)-{1\over 2}F_Z^{-1}(\hat\tau_{i-1})}_3 -\underbrace{\left({1\over 2}F_{Z}^{-1}(\hat\tau_{i-1})+{\tau_i-\tau_{i-1}\over 4}{\partial\over\partial\tau_i}F^{-1}_Z(\hat\tau_{i-1})\right)}_4\\\ &\quad\quad +{\partial\over\partial\tau_i}\int^{\tau_{i+1} }_{\tau_i}|F^{-1}_Z(\omega)-F^{-1}_Z(\hat\tau_i)|d\omega\\\ &=F_{Z}^{-1}(\tau_i)-F_{Z}^{-1}(\hat\tau_{i-1})+{\partial\over\partial\tau_i}\int^{\tau_{i+1} }_{\tau_i}|F^{-1}_Z(\omega)-F^{-1}_Z(\hat\tau_i)|d\omega\\\ &=F_{Z}^{-1}(\tau_i)-F_{Z}^{-1}(\hat\tau_{i-1}) +F_{Z}^{-1}(\tau_i)-F_{Z}^{-1}(\hat\tau_i)\\\ &=2F_{Z}^{-1}(\tau_i)-F_{Z}^{-1}(\hat\tau_{i-1})-F_{Z}^{-1}(\hat\tau_i) \end{align}\]where in the penultimate step, we divide the integral into \([\tau_i,\hat\tau_i]\) and \([\hat\tau_{i},\tau_{i+1}]\) and follow the same process as we compute \({\partial\over\partial\tau_i}\int^{\tau_i}{\tau{i-1} }\vert F^{-1}Z(\omega)-F^{-1}_Z(\hat\tau{i-1})\vert d\omega\).
As \(F_Z^{-1}\) is non-decreasing we have \({\partial W_1\over\partial\tau_i}\vert {\tau_i=\tau{i-1} }\le0\) and \({\partial W_1\over\partial \tau_i}\vert {\tau_i=\tau{i+1} }\ge0\). Moreover, because \(F_Z^{-1}\) is continuous, \(\exist\tau_i\in(\tau_{i-1},\tau_{i+1})\) s.t. \({\partial W_1\over\partial\tau_i}=0\)