Introduction
We discuss a novel Hierarchical Reinforcement Learning(HRL) framework that can efficiently learn multiple levels of policies in parallel. Experiments shows, this framework, proposed by Andrew Levy et al. 2019, can significantly accelerate learning in sparse reward problems, specifically those whose objective is to reach some goal state. Noticeably, this is the first framework that succeeds in learning 3-level hierarchies in parallel in tasks with continuous state and action space. Some experiments done by the authors even demonstrate its capability to harness 4-level hierarchies. This video shows its competence in 2- and 3-level hierarchies, and this gives a simple demonstration with 4-level agents.
Overview
We start by characterizing the overall architecture and purpose of this framework — from here on, we focus on its derivative algorithm in continuous domains, namely Hierarchical Actor-Critic(HAC); the algorithm in discrete domains, Hierarchical Q-learning(HierQ), should be almost the same, except it uses Q-learning at each level.
-
HAC is designed for goal-directed tasks, in which the agent tries to reach some goal state—by reaching a goal, we mean the agent gets close to the goal within a certain range. As we will see soon, this provides HAC the flexibility to design its own reward in different situations.
-
HAC uses an off-policy RL method at each layer: it could be Q-learning variants for discrete action space or DDPG variants for continuous space. Transitions are defined as tuples of form [initial state,action,reward,next state,goal,discount rate]. For low-level policies, HAC puts an explicit constraint on the maximum number H of actions the policy takes to reach a subgoal.
For a better understanding, we also present a brief comparison between HAC and HIRO, an HRL algorithm we discussed in the previous post.
Both HIRO and HAC aim to efficiently learn multiple levels of value-based policies in parallel, but they differ in some fundamental aspects: 1) At the high level, HAC takes the target state as subgoals, while HIRO uses as subgoals the difference between the current and target states 2) HAC does not rely on any extrinsic reward; it gives negative rewards to transitions not achieving the goal and 0 otherwise. HIRO, on the other hand, employs the Euclidean distance as low-level rewards and extrinsic rewards as high-level rewards. From this point of view, HIRO is more general than HAC since the later is specialized for goal-directed problems. 3) There are three sources of transitions in HAC, in contrast to one in HIRO: hindsight action transitions, hindsight goal transitions, and subgoal testing transitions.
In the rest of the post, we concentrate on the transitions in HAC, seeing how they help to jointly learn multiple level policies and deal with the sparse reward issue. Once we figure the transitions out, there is no trick left in HAC.
Hindsight Action Transitions (HAT)
Most value-based(off-policy) reinforcement learning algorithms are built upon the Bellman equations, which count on a stationary transition function to guarantee theoretical convergence. In hierarchical reinforcement learning, the high-level policies often face non-stationary transition functions as high-level transition functions usually depend on low-level policies, which constantly evolve over time. This may be further exacerbated by low-level exploration since exploration generally introduces randomness, making the high-level transition function more uncertain. HIRO and NORL-HRL discussed in the previous posts have recourse to goal relabeling to reinstate transition tuples. But this is more like an ad hoc technique and may be hard to generalize to more than 2-level hierarchies. HAC overcomes these non-stationary issues by training high-level policy with respect to hindsight action transitions, which simulates a transition function using the optimal lower-level policy hierarchy.
Take the example from the paper. Consider a two-level agent and assume the agent starts from s0 and tries to reach the yellow flag. The high-level policy plans subgoal g0 for the low-level policy to achieve. The low-level policy executes H primitive actions using some behavior policy but is unable to achieve g0, instead landing in s1. Now we can compose the first hindsight action transition, [s0,s1,−1,s1,yellow flag,γ]. Notice that the reward is always −1 if the goal(yellow flag) has not been achieved and 0 otherwise. Furthermore, we no longer use subgoal g0 as the action, and instead, we use s1, which is exactly the next state in the tuple. To get a betters sense of it, one might take it as if g0 denotes the action selected by the high-level target policy and s1 is the result of some high-level behavior policy that takes into account exploration. In this way, transitions are always taken as if the low-level policy is optimal and therefore succeeds in achieving the subgoal.
There is also a potential benefit of these transitions: since subgoals produced by the high-level are always H low-level actions away from the initial states, through these transitions, the high-level policy learns subgoals that fit the time scale of H low-level actions per high-level actions.
Hindsight Goal Transitions (HGT)
Hindsight action transitions suffer the sparse reward problem, which makes it hard to learn. To enable each level to learn more effectively, the authors propose hindsight goal transitions that extend the idea of Hindsight Experience Replay(HER) to the hierarchical setting. After each of H actions being executed, for each transition, we select one of the future states as its new goal. The reward is −1 if the new goal has not been achieved and 0 otherwise. We do this for all levels.
We again take the example in the previous section for illustration. Let’s say we want to apply HER to [s0,s1,−1,s1,yellow flag,γ]. If we happen to take s1 as the new goal, then the hindsight goal transition is [s0,s1,0,s1,s1,γ]; if we take s4 as the new goal, then we will have [s0,s1,−1,s1,s4,γ].
Subgoal Testing Transitions (STT)
Hindsight action and hindsight goal transitions enable multi-level policies to learn effectively in parallel, but some key issues remain. The most serious flaw is that the strategy only enables a level to learn about a restricted set of subgoal states that its lower-level policy hierarchy is able to achieve. This could be problematic, especially to the reward function we use above. Recall we define the reward to be −1 if the goal has not been achieved and 0 otherwise. Q-values for those state-action pairs achievable by the lower-level policy hierarchy are, in general, negative values and are likely to be smaller than those cannot be achieved by the lower-level policy hierarchy. As a result, the high-level policy will prefer a path of subgoals that cannot yet be achieved by its lower-level policy hierarchy, resulting in risk-seeking behaviors of high-level policy.
To address these issues, we need to penalize those subgoals not achievable by the lower-level policy hierarchy. In tabular RL, one can easily achieve this by initializing Q-values with large negative values, but this is not the story for function approximations, in which all Q-values share the same set of parameters. Then, what should we do? Levy et al. introduce subgoal testing transitions to penalizes those subgoals not achievable by the current policy. Here’s how it works (please be aware of the difference between subgoal testing and lower-level subgoal testing: the former refers to the subgoal testing at the layer we are talking about).
- For each level, we define a boolean variable, subgoal testing. Because the subgoal for each layer is specified by its higher-level, this variable indicates whether its higher-level adds subgoal testing transitions to the replay buffer. These variables are initialized to be false and only modified by higher-levels. Note the subgoal testing in the highest-level policy is always false.
- At the action selection stage of some level, if its subgoal testing is true, then its lower-level subgoal testing is set to true; if not, we give it some chance(30% in their implementation) to set its lower-level subgoal testing to true.
- At each level, if its subgoal testing is set to true by its higher-level, we take deterministic action produced by the policy since we want to know if the current policy is able to achieve the subgoal; if not, we add some noise to the action for exploration as we usually do.
- At each layer, if the lower-level subgoal testing is true and lower-level policy hierarchy fails to achieve the subgoal, we add subgoal testing transitions to the replay buffer. Such transitions take subgoals as actions and use a low reward, penalty—In their experiments, they set penalty to −H, the negative of the maximum horizon of a subgoal.
In 2, we only give 30% chance to set the lower-level subgoal testing to true. For those subgoals that are not achievable by the current policy but still might be achieved by some optimal policy in H steps, this helps their Q-values lean toward the target Q-values prescribed by hindsight action/goal transitions.
Summary of Transitions
Hindsight action transitions(HAT) relabel actions to the next state so that the transitions are always achievable by the lower-level policy. With hindsight action transitions, each level in the hierarchy can focus on learning the sequence of actions that achieve the goal given by its higher-level policy without worrying about if its lower-level policy is able to carry out such actions.
Hindsight goal transitions(HGT) further relabels the goal in HAT to help each level to learn a goal-conditioned policy in sparse reward tasks(recall that HAT gives −1 reward to all transitions that does not achieve the goal).
Subgoal testing transitions(STT) penalizes transitions that cannot yet be achieved by the lower-level policy hierarchy to mitigate risk-seeking behaviors.
Algorithm
We now present the pseudocode for completeness
DDPG is adopted by HAC to learn each level policy. Notice that there is no subgoal testing transitions at the lowest-level i=0 as it’s the ground policy.
Reference
Andrew Levy et al. Learning Multi-Level Hierarchies with Hindsight
Code: https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-