Optimal Policy and Value Functions for KL regularized RL

We start with the following KL-regularized RL objective:

\[\begin{align} \mathcal J(\pi)=\mathbb E_\pi\left[\sum_{t=0}^\infty \gamma^t r_t-\alpha\gamma^t KL_t\right]\tag 1 \end{align}\]

where \(KL_t=\log{\pi(a\vert s)\over\pi_0(a\vert s)}\), in which \(\pi\) is the policy we aim to optimize, \(\pi_0\) is a policy prior. One may also make the KL term action independent, i.e., \(KL_t=D_{KL}(\pi(\cdot\vert s_t)\Vert\pi_0(\cdot\vert s_t))\), the advantage of which is that we always ensures the non-negativity of \(KL_t\).

Define the state-value function as the expected return:

\[\begin{align} V^\pi(s_t)=\mathbb E_\pi\left[\sum_{t'\ge t} \gamma^{t'-t}\Big( r_{t'}-\alpha KL_{t'})\right]\tag 2 \end{align}\]

Then we shall have the \(Q\)-function as

\[\begin{align} Q^\pi(s_t, a_t)&=\mathbb E[r(s_t,a_t)]+\mathbb E_\pi\left[\sum_{t'> t} \gamma^{t'-t}\Big( r_{t'}-\alpha KL_t'\Big)\right]\tag 3\\\ &=\mathbb E[r(s_t,a_t)+\gamma V^\pi(s_{t+1})]\tag 4 \end{align}\]

Note that this \(Q\)-function does not include the KL term at time step \(t\). This leads to the following relationship between \(Q^\pi\) and \(V^\pi\)

\[\begin{align} V^\pi(s_t)=\mathbb E_{a\sim\pi}[Q^\pi(s_t,a)-\alpha KL_t]\tag 5 \end{align}\]

We now derive the optimal policy for Equation \((1)\), \(\pi^*(a_t\vert s_t)\) as follows

\[\begin{align} \pi^\*(a_t|s_t)&=\arg\max_{\pi}\mathbb E_\pi\left[\sum_{t=0}^\infty \gamma^t r_t-\alpha\gamma^t KL_t\right]\\\ &=\arg\max_{\pi}\mathbb E_\pi\left[Q^\pi(s_t,a_t)-\alpha {KL}_t\right]\\\ &=\arg\max_{\pi}\alpha\mathbb E_\pi\left[\log \exp \left({1\over\alpha}Q^\pi(s_t,a_t)\right)-\log\pi(a_t|s_t)+\log\pi_0(a_t|s_t))\right]\\\ &=\arg\max_{\pi}\alpha\mathbb E_\pi\left[\log {\pi_0(a_t|s_t)\exp \left({1\over\alpha}Q^\pi(s_t,a_t)\right)\over\pi(a_t|s_t)}\right]\\\ &=\arg\max_{\pi}\alpha\mathbb E_\pi\left[\log { {1\over Z^\pi}\pi_0(a_t|s_t)\exp \left({1\over\alpha}Q^\pi(s_t,a_t)\right)\over\pi(a_t|s_t)}\right]+\alpha\log Z^\pi\\\ &\qquad\qquad\qquad\qquad\qquad\qquad where\ Z^\pi=\mathbb E_{a\sim\pi_0}\left[\exp \left({1\over\alpha}Q^\pi(s_t,a)\right)\right]\\\ &=\arg\max_\pi-\alpha D_{KL}\left(\pi(a_t|s_t)\Big\Vert \pi_0(a_t|s_t)\exp\left({1\over\alpha}\Big(Q^\pi(s_t,a_t)-\alpha\log Z^\pi\Big)\right)\right)+\alpha\log Z^\pi\\\ &=\pi_0(a_t|s_t)\exp\left({1\over\alpha}\Big(Q^\*(s_t,a_t)-\alpha\log Z^\*\Big)\right)\qquad\color{red}{\text{since argmax is obtained when }D_{KL}=0}\\\ \end{align}\]

Because

\[\begin{align} V^\*(s_t)=\mathcal J(\pi^\*)=\alpha\log Z^\*=\alpha\log\mathbb E_{a\sim\pi_0}[\exp(Q^\*(s_t,a))/\alpha]\tag 6 \end{align}\]

so that

\[\begin{align} Q^\*(s_t,a_t)&=\mathbb E[r_t+\gamma V^\*(s_{t+1})]\tag 7\\\ \pi^\*(a_t|s_t)&=\pi_0(a_t|s_t)\exp\left({1\over\alpha}(Q^\*(s_t,a_t)-V^\*(s_t))\right)\\\ &=\pi_0(a_t|s_t)\exp\left({1\over\alpha}A^\*(s_t,a_t)\right)\tag 8 \end{align}\]

In the following section, we derive several algorithms based on Equations \((5-8)\).

Algorithms

Soft Q-Learning

Naive Version

From Equations \((6-7)\), we conclude the loss for soft Q-learning

\[\begin{align} \mathcal L_Q(\theta)=\mathbb E_{s,a,s'}\left[\left(Q(s,a;\theta)-\left(r+\gamma\alpha\log\sum_{a'}\pi_0(a'|s')\exp\left({1\over \alpha}Q(s',a')\right)\right)\right)^2\right]\tag 9 \end{align}\]

where \(Q\) is parameterized by \(\theta\), and we intend to omit \(\theta\) in the target value, implying no gradients being passed through.

Dueling Version

In the following discussion, we use the framework of entropy regularized RL, where the prior is assumed to be uniform and ignored. We’ll discuss more later.

Schulman et al. 2017 parameterize \(Q\) function as a dueling network: \(Q(s,a;\theta)=V(s;\theta)+\alpha\log\pi(a\vert s;\theta)\) where \(\log\pi(a\vert s;\theta)\) is the advantage term. One problem with this architecture is that \(\alpha\) is usually somewhere around \(0.01\), making gradients that go into the advantage stream way smaller than gradients that go into value function stream. Schulman et al. 2017 therefore scale gradients that go into advantage stream by \(1\over\gamma\)(?, a potential typo. \(1\over \alpha\) may be more desirable) and scale gradients that go into value function stream by \(c=0.5\). This matches the gradient scale in traditional on policy algorithm, where the loss is \(\mathcal L=\mathcal L_{policy}+c\mathcal L_{value}\), \(c=0.5\). On the other hand, this also means that we have to use separate optimizers for two streams, introducing separate gradient backups.

Similar dueling architecture was also adopted by O’Donoghue et al. 2017. Different from the work of Schulman et al. 2017, they 1) use the canonical definition of the value function \(V(s_t)=\sum_{a_t}Q(s_t,a_t)\), and regularize the policy with an additional entropy term; 2) put a constraint on the policy so that the probabilities sum to \(1\). Together, 1) and 2) results in \(Q(s, a)=\alpha(\log\pi(s,a)+\mathcal H(s))+V(s)\)—I admit this one is confounding, but the derivation seems sound. The problem boils down to whether we should add the policy constraint which has already been guaranteed by the softmax output of the policy network; 3) update the network with an on-policy method(e.g., A3C) and periodically optimize the \(Q\)-learning loss with replayed experiences.

### Soft Actor Critic

Similarly, we derive the losses for soft actor critic from Equations \((5),(7),(8)\)

\[\begin{align} \mathcal L_Q(\theta)&=\mathbb E_{s,a,s'}\left[\left(Q(s,a;\theta)-\left(r+\gamma\mathbb E_{a'\sim\pi(\cdot|s')}[Q(s',a')-\alpha KL']\right)\right)^2\right]\tag{10}\\\ \mathcal L_\pi(\phi)&=\mathbb E_s[D_{KL}(\pi(\cdot|s;\phi)\Vert\pi^\*(\cdot|s))]\\\ &=\mathbb E_s\left[\mathbb E_{a\sim\pi(\cdot|s;\phi)}\left[\log\pi(a|s;\phi)-\log\pi_0(a|s)-{1\over\alpha}A(s,a)\right]\right]\\\ &=\mathbb E_s\left[\mathbb E_{a\sim\pi(\cdot|s;\phi)}\left[\alpha\log\pi(a|s;\phi)-\alpha\log\pi_0(a|s)-A(s,a)\right]\right]\tag{11} \end{align}\]

When the action space is continuous, we apply the reparameterization trick and \(A(s,a)\) in Equation \((11)\) can be simplified to \(Q(s,a)\) as \(V(s)\) does not depend on \(a\). When the action space is discrete, we analytically compute \(\mathbb E_{a\sim\pi(\cdot\vert s;\phi)}[\cdot]\) and propagate gradient through \(\pi(\phi)\). Things become trickier when the policy and value function share some of their parameters, in which we have to balance the weights of different losses. In most cases, we find simply stoping the gradients of the policy loss from propagating to the shared parameters works just fine.

Soft Policy Gradient

The value loss for a policy gradient method is

\[\begin{align} \mathcal L_V(\psi)&=\mathbb E_{s}\left[\left(V(s;\psi)-y_s\right)^2\right]\tag{12} \end{align}\]

where \(y_s\) could be the n-step return or \(TD(\lambda)\) with the reward function defined as \(\hat r(s,a)=r(s_t,a_t)-\alpha KL_t\).

The policy objective at \(s_t, a_t\) is

\[\begin{align} \mathcal J_\pi(\phi)&=\mathbb E_\pi\left[\sum_{t'\ge t}^\infty \gamma^{t'-t} (r_{t'}-\alpha KL_{t'})\right]\\\ &=\mathbb E_\pi\left[r_{t}+\sum_{t'> t}^\infty \gamma^{t'-t} \big(r_{t'}-\alpha {KL}_{t'}\big)\right]-\alpha D_{KL}(\pi(\cdot|s_t;\phi)\Vert \pi_0(\cdot|s_t))\\\ &=\mathbb E_\pi[Q^\pi(s_t,a_t)]-\alpha D_{KL}(\pi(\cdot|s_t;\phi)\Vert \pi_0(\cdot|s_t))\\\ &\qquad\color{red}{\text{subtract baseline }V^\pi(s_t)}\\\ &=\mathbb E_\pi[Q^\pi(s_t,a_t)-V^\pi(s_t)]-\alpha D_{KL}(\pi(\cdot|s_t;\phi)\Vert \pi_0(\cdot|s_t))\\\ &=\mathbb E_\pi[A^\pi(s_t,a_t)]-\alpha D_{KL}(\pi(\cdot|s_t;\phi)\Vert \pi_0(\cdot|s_t))\tag{13} \end{align}\]

Here we add superscript \(\pi\) to value functions to indicate that they are directly computed from on-policy data. This gives us the following gradient estimate

\[\begin{align} g_\pi(\phi)=\mathbb E_\pi[A^\pi(s_t,a_t)\nabla_\phi\log\pi(a_t|s_t;\phi)]-\alpha\nabla_\phi D_{KL}(\pi(\cdot|s_t;\phi)\Vert\pi_0(\cdot|s_t)) \end{align}\]

In practice, \(A^\pi\) could be the GAE estimator with the KL-regularized reward function. It is easy to see that Equation \((13)\) is simply the negative of Equation \((11)\) under a different expectation, both optimizing Equation \((1)\).

Learning Prior

Previous algorithms are all assume that there is a fixed policy prior given. It turns out that this prior knowledge is not necessary and can be learned in practice. Several algorithms exploit this

MIRL(Grau-Moya et al. 2019) learns a state-independent action prior as a slow moving average of the current policy and use the prior and \(Q\)-function to compute a behavior policy \(\pi\) from Equation \((8)\).

MPO optimizes \(Q\) and \(\pi\) alternatively following the EM framework. In particular, Abdolmaleki et al. 2018ab find it better 1)to compute a sample-based policy \(\pi\), and learn \(\pi_0\) by minimizing the KL divergence between \(\pi\) and \(\pi_0\); and 2) to replace the soft KL constraint in Equation \((10)\) with a hard constraint.

Galashov et al. 2019 propose learning a policy prior \(\pi_0\) from a limited information which omits task specific information(information asymmetry). Such policy prior, learned by minimizing the KL divergence between \(\pi\) and \(\pi_0\), is expected to learn meaningful and consistent behavior in the environment without knowledge of any tasks.

More recently, Tirumala et al. 2020 study the effect of behavior priors learned from asymmetric information in multi-task and transfer learning and extend these ideas to latent variable models to learn a hierarchical prior.

On-Policy and Off-Policy Differences

The fact that SQL and SAC are an off-policy algorithm while SPG is an on-policy algorithm brings several subtleties.

  • To make use of partial trajectories, SPG uses GAE or V-trace, while SQL and SAC use the Retrace algorithm.
  • Because \(A^\pi(s_t,a_t)\) in Equation \((13)\) is computed from trajectories, we need to resort to the score function estimator to compute the gradient. In SQL and SAC, we can analytically compute Equation \((11)\) and directly differentiate through \(\pi\) since \(A(s,a)\) is computed using \(Q\)-function. One could also estimate \(A(s,a)\) from trajectories, using the Retrace algorithm for example, and adopt the score function estimator for SQL and SAC.
  • When memory is involved, we need to consider the shift of the initial state for SQL and SAC.

Multi-Step Backup

Denote \(n\)-step backup operator by \(\mathcal T^{\pi,n}\). We have

\[\begin{align} \mathcal T^{\pi, n}V(s_t)&=\mathcal T^{\pi,n}Q(s_t,a_t)-\alpha KL(s_t)\\\ &=\mathbb E\left[\sum_{i=0}^{n-1}\gamma^i(r_{t+i}-\alpha KL_{t+i})+\gamma^n\Big(\mathbb E_{a\sim\pi}[Q(s_{t+n},a)]-KL_{t+n}\Big)\right]\\\ &=\mathbb E\left[\sum_{i=0}^{n-1}\gamma^i(r_{t+i}-\alpha KL_{t+i})+\gamma^n V(s_{t+n})\right]\\\ \end{align}\]

Because multi-step advantage is widely used in sequential setting(e.g., GAE concerns multi-step advantages and Retrace concerns multi-step TD errors), we further derive multi-step advantages

\[\begin{align} A^n(s_t,a_t)&=\mathcal T^{\pi,n}Q(s_t,a_t)-V(s_t)\\\ &=r(s_t,a_t)+\sum_{i=1}^{n-1}\gamma^i(r_{t+i}-\alpha KL_{t+i})+\gamma^n V(s_{t+n})-V(s_t)\\\ &=\sum_{i=0}^{n-1}\gamma^i(r_{t+i}-\alpha KL_{t+i})+\gamma^n V(s_{t+n})-V(s_t)+\alpha KL_t\\\ &=\sum_{i=0}^{n-1}\gamma^i\delta_{t+i}+\alpha KL_t\\\ \quad\text{where }\delta_{t+i}&=r_{t+i}-\alpha KL_{t+i}+V(s_{t+i+1})-V(s_{t+i}) \end{align}\]

This gives us the following GAE estimator

\[\begin{align} A^{GAE}(s_t,a_t)&=(1-\lambda)\sum_{n=0}^\infty\lambda^nA^n(s_t,a_t)\\\ &=(1-\lambda)\sum_{n=0}^\infty\lambda^n\left(\alpha KL_t+\sum_{i=0}^{n-1}\gamma^i\delta_{t+i}\right)\\\ &=\alpha KL_t+\sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i} \end{align}\]

Following a similar process, one can derive the Retrace target.

Entropy Regularized RL

Entropy regularized RL can be obtained by regarding the prior as uniform and therefore being ignored. However, there is a fundamental difference between KL-regularized and entropy-regularized RL. That is, the KL regularized objective adds a negative value to the value function while the entropy-regularized one adds a positive value. This affects the agent’s will to live.

References

Schulman, John, Xi Chen, and Pieter Abbeel. 2017. “Equivalence Between Policy Gradients and Soft Q -Learning,” 1–15.

O’Donoghue, Brendan, Rémi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. 2017. “Combining Policy Gradient and Q-Learning.” 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–15.

Abdolmaleki, Abbas, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. 2018. “Maximum a Posteriori Optimization.”

Abdolmaleki, Abbas, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin Riedmiller. 2018. “Relative Entropy Regularized Policy Iteration.” ArXiv.

Grau-Moya, Jordi, Felix Leibfried, and Vrancx Peter. 2019. “Soft Q-Learning with Mutual-Information Regularization.” ICLR 2019, 1–9.

Galashov, Alexandre, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, and Nicolas Heess. 2019. “Information Asymmetry in KL-Regularized RL.” ArXiv, 1–25.

>Tirumala, Dhruva, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, et al. 2020. “Behavior Priors for Efficient Reinforcement Learning,” 1–58. http://arxiv.org/abs/2010.14274.

Supplementary Materials

Soft \(Q\) loss and Soft Policy Gradient

Following Schulman et al. 2017, we demonstrate that the gradient of the n-step soft \(Q\) loss for soft Q-learning and soft actor-critic is the same as the soft policy gradient. We reparameterize \(Q\) function in terms of Equation \((5)\): \(Q(s_t,a_t)=V(s_t)+\alpha KL_t\). For convenience, we define 1-step TD error \(\delta_t=(r(s_t,a_t)-\tau KL_t)+\gamma V(s_{t+1})-V(s_t)\) and n-step TD error \(\Delta_t=\sum_{d=0}^{n-1}\gamma^d\delta_{t+d}\). The n-step soft \(Q\) target is

\[\begin{align} y^q_t&=r(s_t,a_t)+\sum_{d=1}^{n-1}\gamma^d (r(s_{t+d},a_{t+d})-\alpha KL_{t+d})+\gamma^n V(s_{t+d})\\\ &=\alpha KL_{t}+\sum_{d=0}^{n-1}\gamma^d(r(s_{t+d},a_{t+d})-\alpha KL_{t+d})+\gamma^n V(s_{t+d})\\\ &=\alpha KL_{t}+V(s_t)+\Delta_t \end{align}\]

Similarly, the n-step soft \(V\) target is

\[\begin{align} y^v_t=V(s_t)+\Delta_t \end{align}\]

Now, let’s consider the gradient of the n-step soft \(Q\) loss

\[\begin{align} &\nabla_\theta\mathbb E_{\pi}\left[{1\over 2}\Vert Q_\theta(s_t,a_t)-y^q_t\Vert^2\right]\\\ =&\mathbb E_{\pi}[(Q(s_t,a_t)-y^q_t)\nabla_\theta Q_\theta(s_t,a_t)]\\\ =&\mathbb E_{\pi}\left[\left(V(s_t)+\alpha KL_t-(\alpha KL_t+V(s_t)+\Delta_t)\right) \nabla_\theta Q_\theta(s_t,a_t)\right]\\\ &\qquad\color{red}{\text{ use }D_{KL}(s_t)=D_{KL}(\pi\Vert\pi_0)\text{ to approximate the second }KL_t}\\\ \approx&\mathbb E_{\pi}\left[\left(V(s_t)+\alpha KL_t-(\alpha D_{KL}(s_t)+V(s_t)+\Delta_t)\right) \nabla_\theta Q_\theta(s_t,a_t)\right]\\\ =&\mathbb E_{\pi}[(\alpha KL_t-\alpha D_{KL}(s_t)-\Delta_t) \nabla_\theta Q_\theta(s_t,a_t)]\\\ =&\mathbb E_{\pi}[(\alpha KL_t-\alpha D_{KL}(s_t)-\Delta_t) (\alpha\nabla_\theta \log\pi_\theta(a_t|s_t)+\nabla_\theta V_\theta(s_t))]\\\ =&\mathbb E_{\pi}\left[\alpha^2 KL_t\nabla_\theta \log\pi_\theta(a_t|s_t)-\alpha^2 \underbrace{D_{KL}(s_t)\nabla_\theta \log\pi_\theta(a_t|s_t)}_{=0,\text{ since }{KL}_t\text{ is const and }\mathbb E_\pi[\text{const}\cdot\nabla_\theta\log\pi_\theta]=0}-\alpha\Delta_t\nabla_\theta \log\pi_\theta(a_t|s_t) \\\ \qquad +\underbrace{\alpha KL_t\nabla_\theta V_\theta(s_t)-\alpha D_{KL}(s_t)\nabla_\theta V_\theta(s_t)}_{=0,\text{ since }\mathbb E_\pi[KL_t]=D_{KL}(s_t)}-\Delta_t\nabla_\theta V_\theta(s_t)\right]\\\ =&\mathbb E_\pi[\alpha^2 KL_t(\theta)\nabla_\theta \log\pi_\theta(a_t|s_t)-\alpha\Delta_t\nabla_\theta \log\pi_\theta(a_t|s_t)-\alpha\Delta_t\nabla_\theta V_\theta(s_t)]\\\ &\qquad \color{red}{\text{rearange terms and assume }\pi_0\text{ is uniform}}\\\ =&\underbrace{\mathbb E_\pi[-\alpha\Delta_t\nabla_\theta\log\pi_\theta(a_t|s_t)]+\alpha^2\nabla_\theta D_{KL}(\pi_\theta(\cdot|s_t;\theta)\Vert\pi_0(\cdot|s_t))}_{\text{policy grad}}+\underbrace{\mathbb E_\pi[\nabla_\theta{1\over 2}\Vert V_\theta(s_t)-y_t^v \Vert^2]}_\text{value function grad} \end{align}\]

Notice that the above expectation is over policy \(\pi\), which is important for establishing the equivalence.

Convergence of Soft Bellman Update

We prove the convergence of the soft Bellman update in the entropy regularized RL setting. Let \(\pi_\alpha\) be the soft policy learned with temperature \(\alpha\), \(\mathcal T^*\) the optimal Bellman update, and \(\mathcal T^{\pi_\alpha}\) the soft Bellman update. We show that \(\Vert \mathcal T^*Q^{\pi_\alpha} -Q^{\pi_\alpha}\Vert \rightarrow 0\) with decreasing temperature \(\alpha\). First we have \(\mathcal T^* Q^{\pi_\alpha}\ge\mathcal T^{\pi_\alpha} Q^{\pi_\alpha}=Q^{\pi_\alpha}\), so \(\Vert \mathcal T^*Q^{\pi_\alpha} -Q^{\pi_\alpha}\Vert \ge 0\). We now show \(\mathcal T^*Q^{\pi_\alpha} -Q^{\pi_\alpha}\) is upper bounded by a function of \(\alpha\)

\[\begin{align} &\mathcal T^\*Q^{\pi_\alpha} -Q^{\pi_\alpha}\\\ =&\mathcal T^\*Q^{\pi_\alpha} -\mathcal T^{\pi_\alpha}Q^{\pi_\alpha}\\\ =&\mathbb E_{s'}\left[\max_cQ^{\pi_\alpha}(s',c)-\sum_{b}\pi_\alpha(s',b)Q^{\pi_\alpha}(s',b)\right]\\\ =&\mathbb E_{s'}\left[\sum_{b}\pi_\alpha(s',b)\left(\max_cQ^{\pi_\alpha}(s',c)-Q^{\pi_\alpha}(s',b)\right)\right]\\\ =&\mathbb E_{s'}\left[\sum_b\exp(({Q^{\pi_\alpha}(s',b)}-V^{\pi_\alpha}(s'))/\alpha)(\max_cQ^{\pi_\alpha}(s',c)-Q^{\pi_\alpha}(s',b))\right]\\\ &\qquad \color{red}{\text{since }V^{\pi_\alpha}(s)=\sum_{t\ge 0}r_t+\mathcal H_t(\pi_\alpha)\ge r_0+\sum_{t>0} r_t+\mathcal H_t(\pi_\alpha)=\max_aQ^{\pi_\alpha}(s,a)}\\\ \le&\mathbb E_{s'}\left[\sum_b\exp(({Q^{\pi_\alpha}(s',b)}-\max_c Q^{\pi_\alpha}(s',c))/\alpha)(\max_cQ^{\pi_\alpha}(s',c)-Q^{\pi_\alpha}(s',b))\right]\\\ =&\mathbb E_{s'}\left[\sum_bf_\alpha\left(\max_c Q^{\pi_\alpha}(s',c)-Q^{\pi_\alpha}(s',b)\right)\right] \end{align}\]

Where \(f_\alpha(x)=x\exp(-x/\alpha)\). Notice that \(f_\alpha(x)\le \sup_x f_\alpha(x)=f_\alpha(\alpha)=\alpha e^{-1}\), which yields

\[\begin{align} 0\le \mathcal T^\*Q^{\pi_\alpha} -Q^{\pi_\alpha}\le |\mathcal A |\alpha e^{-1} \end{align}\]