Introduction
We discuss HIDIO, a hierarchical RL algorithm that learns a skill-conditioned lower-level policy and a scheduler to sc
TL; DR
- The higher-level policy maximizes the discounted cumulative rewards from the environment: \(\sum_{h}\gamma^{hk}R_h + \beta_H\mathcal H(\pi_\theta(\cdot\vert \pmb s_{h,0}))\), where \(R_h=\mathbb E_{\pi_\phi}[\sum_{k}\gamma^k r(\pmb s_{h,k},\pmb a_{h,k})]\). The entropy term is added because SAC is used for training.
- The option-conditioned lower-level policy optimizes \(\sum_{h,k}\log q_\psi(\pmb u_h\vert \bar{\pmb{a}}{h,k},\bar{\pmb s}{h,k+1})+\beta\mathcal H(\pi_\phi(\cdot\vert \bar{\pmb s}_{h,k},\bar{\pmb u}_h))\) using SAC. The reward term encourages state and action pairs that are more distinguishable for the option.
- No correction is made for the higher-level policy training despite experiences becomes outdated as the lower-level policy evolves.
Method
Overview
We assume each episode has a length of \(T\) and the scheduler(i.e., the higher-level policy) \(\pi_\theta\) outputs an option every \(K\) steps. The schedule option \(\pmb u\in[-1,1]^D\) is a latent representation that will be learned from scratch given the environment task. Modulated by \(\pmb u\), the worker(i.e., the lower-level policy) \(\pi_\phi\) executes \(K\) steps before the scheduler outputs the next option. Let the time horizon of the scheduler be \(H=\lceil{T\over K}\rceil\). We define
\[\begin{align} &\text{Scheduler policy:}&\pmb u_h\sim\pi_\theta(\cdot|\pmb s_{h,0})\qquad&0\le h<H\\\ &\text{Worker policy:}&\pmb a_{h,k}\sim\pi_\phi(\cdot|\pmb s_{h,k},\pmb u_k)\qquad&0\le h<H, 0 \le k<K\\\ &\text{Environment dynamics:}&\pmb s_{h,k+1}\sim \mathcal P(\cdot|\pmb s_{h,k},\pmb a_{h,k})\qquad&0\le h<H, 0 \le k<K \end{align}\]Note that the final state of \(\pmb u_h\) is the first state of \(\pmb u_{k+1}\), i.e., \(\pmb s_{h,K}=\pmb s_{h+1,0}\).
We further define \(\bar{\pmb s}{h,k}=(\pmb s{h,0},\dots,\pmb s_{h,k})\) and \(\bar{\pmb a}{h,k}=(\pmb a{h,0},\dots,\pmb a_{h,k})\), and call pairs \({\bar{\pmb a}{h,k},\bar{\pmb s}{h,k+1}}\) option sub-trajectories. We dig into details of the lower- and higher-level policy next.
Higher-Level Policy
The higher-level policy schedules the lower-level policy to maximize the discounted cumulative rewards from the environment:
\[\begin{align} \sum_{h}\gamma^{hk}R_h + \beta_H\mathcal H(\pi_\theta(\cdot|\pmb s_{h,0})) \end{align}\]where the entropy is added as SAC is used for training. Note that the higher-level policy does not produce goals as general HRL does; it produces options that dictate which skill should use in the following \(K\) steps.
The higher-level policy is jointly trained with the lower-level policy. Ablation study shows this is especially important for difficult tasks.
As in HIRO, when training in the off-policy fashion, option sub-trajectories will be outdated as the lower-level policy evolves. One may apply importance sampling for correction when learning the \(Q\)-function. However, in practice, the ratio has a very high variance and hinders the training. Zhang et al. 2021 found it performs empirically well even without any correction.
Lower-Level Policy
HIDIO trains the option-conditioned lower-level policy using a similar information objective from DIAYN, which maximizes
\[\begin{align} \sum_{h,k}\mathbb E_{\pi_\phi}[\log q_\psi(\pmb u_h|\bar{\pmb{a}}_{h,k},\bar{\pmb s}_{h,k+1})]+\beta\mathcal H(\pi_\phi(\cdot|\bar{\pmb s}_{h,k},\bar{\pmb u}_h)) \end{align}\]where \(q_\psi\) computes the conditional probability of the option \(\pmb u_h\). The reward term encourages state and action pairs that are more distinguishable for the option and \(\log p(\pmb u_h)\) in DIAYN is omitted to encourage efficient behavior.
The discriminator \(q_\psi\) computes the probability of predicting the option \(\pmb u_h\). Assume the discriminator produces a unit Gaussian distribution with mean \(f_\psi(\pmb u_h\vert \bar{\pmb{a}}{h,k},\bar{\pmb s}{h,k+1})\), we have \(\log q_\psi(\pmb u_h\vert \bar{\pmb{a}}{h,k},\bar{\pmb s}{h,k+1})=-\Vert f_\psi(\pmb u_h\vert \bar{\pmb{a}}{h,k},\bar{\pmb s}{h,k+1})-\pmb u_h\Vert^22\), where constants and scales are omitted. Zhang et al. 2021 experiments six different inputs for \(f\psi\), finding that \([s_{h,0}, a_{h,k}]\), \([s_{h,k+1}-s_{h,k}]\) and \([a_{h,k}, s_{h,k+1}]\) generally works best. Note that Zhang et al. 2021 recompute the reward term for every update of \(\pi_\phi\). However, there is no ablation study for that.
Zhang et al. 2021 also experiment with another objective which takes into account the discounted information rewards from the future option sub-trajectories. However, this generally yields suboptimal performance in practice.
References
Zhang, Jesse, Haonan Yu, and Wei Xu. 2021. “Hierarchical Reinforcement Learning By Discovering Intrinsic Options,” 1–18. http://arxiv.org/abs/2101.06521.