Introduction

Finally, come back to this one. Time flies :-(

This post is closely related to the previous post of HIRO. Hmm, yeah, that’s the whole introduction.

TL; DR

NORL-HRL learns a representation function f and an inverse goal model φ by maximizes the mutual information(MI) between [st,π] and st+k,

J(θ,st,π)=ck=1γk1wk(EPπ(st+k|st)[D(fθ(st+k),φθ(st,π))] +logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π)))])

where π is the lower-level policy, D is a distance function(e.g., the Huber function), f maps the target state st+k to the goal space, φ predicts which goal g was induced by the higher-level policy such the lower-level policy π was selected at the start state st. In practice, we replace π with the actions taken by the lower level policy during t and t+k, i.e., π=at:t+c. Notice that the above objective is the conditional version of MINE estimator; that’s why we said NORL-RL learns f and φ to maximize the MI between [st,π] and st+k.

With the representation function and the inverse goal model, NORL-HRL trains the higher and lower-level policies in a similar way as HIRO except

  1. The higher-level policy produces goals in the goal space, a space of lower dimension than the state space.

  2. The lower-level reward function now becomes

˜rk=wk(D(f(st+k),g)(D(f(st+k),φ(st,π))logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π))])) wherewk={1k<c 1γk=c

where π=at:t+c. Here, the first term encourages the lower-level policy to approach the goal g, while the second term regularizes the policy by minimizing the MI between st+k and [st,π].

  1. NORL-HRL is more like a theoretic model for representation learning rather than an HRL algorithm. Nachum et al. only did some experiments on the representation learning part.

Framework

Different from HIRO, in which goals serve as a measure of dissimilarity between the current state and the desired state, goals here are used to directly produce a lower-level policy in conjunction with the current state. Formally, we use the higher-level policy to sample a goal gtπhigh(g|st) and translate it to a lower-level policy \(\pi^{low}t=\Psi(s_t,g_t)\), which is used to sample actions \(a{t+k}\sim\pi^{low}t(a\vert s{t+k}, k)\) for k[0,c1]. The process is then repeated from st+c. (Note that the lower-level policy description in the above is a theoretical model, slightly different from the one in practice we will discuss shortly)

The mapping Ψ is usually expressed as the result of an RL optimization over the policy space

Ψ(st,gt)=argmaxπck=1γk1EPπ(st+k|st)[D(f(st+k),gt)]

where Pπ(st+k|st) denotes the probability of being in state st+k after following lower-level policy π for k steps starting from st, f denotes a representation function encoding a state to a low dimensional representation, and D denotes a distance function (e.g. the Huber function), of which the negative represents the intrinsic lower-level reward. Note that D here is mainly different from the one defined in HIRO in two aspects: 1) There is no st involved since gt is no longer a measure of dissimilarity — gt now is more like the projection of st+c onto the goal space. 2) The distance is now measured on the goal space (usually a lower-dimensional space) instead of the raw state space.

Good Representations Lead to Bounded Sub-Optimality

Eq.(1) gives us a glimpse at what the desired lower-level policy looks like, but it does not offer any insight about how the representation learning function f should be defined. Claim 4 in the paper shows that if we define Ψ as a slight modification of the traditional objective given in Eq.(1), then we may translate sub-optimality of Ψ to a practical representation learning objective for f. (Do not freak out by the notations and terminologies, explanations will come shortly)

Claim 4: Let ρ(s) be a prior over state space S. Let f and φ be such that

supst,π1ˉwck=1γk1wkDKL(Pπ(st+k|st)K(st+k|st,π))ϵ2/8 whereK(st+k|st,π)ρ(st+k)exp(D(f(st+k),φ(st,π)))

If the lower-level objective is defined as

Ψ(st,g)=argmaxπck=1γk1wkEPπ(st+k|st)[D(f(st+k),g)+logρ(st+k)logPπ(st+k|st)]

then the sub-optimality of Ψ is bounded by Cϵ

We do not prove this claim here since the detailed proof provided by the paper is already pretty neat but still takes more than two pages. Instead, we address those notations and terminologies used above to provide some insight:

  • f is the representation learning function we discussed before. φ is an auxiliary inverse goal model, which aims to predict which goal will cause Ψ to yield a policy ˜π=Ψ(s,g) that induces subsequent distribution P˜π(st+k|st) similar to Pπ(st+k|st) for k[1,c].

  • We weight the KL divergence between the distributions by weights wk=1 for k<c and wk=(1γ)1 for k=c. We further denote ˉw=ck=1γk1wk, which normalizes all weights so that they are summed to one.

  • Nachum et al. interpret K as the conditional P(state=s|repr=φ(s,a)) of the joint distribution P(state=s)P(repr=z|state=s)=1Zρ(s)exp(D(f(s),z)) for normalization constant Z. In this way, we may regard P(repr=z|stat=s) as a Boltzmann distribution with input logits D(f(s),z). However, although Nachum et al. claim K is designed for a reason, to my best effort, I still cannot find any evidence in their proof. But as we will see later here and here, by designing K in this way, we could align the representation learning to MINE and present a nice interpretation based on PGM.

  • The sub-optimality of Ψ measures the loss, in terms of expected value, of using policy Ψ against the optimal policy. Formally, it is defined as the maximum value difference between the optimal policy π and the hierarchical policy πhier learned through Ψ, i.e., SubOpt(Ψ)=supsVπ(s)Vπhier(s).

  • The lower-level objective can also be transformed into a KL

Ψ(st,φ(st,ˉπ))=argminπck=1γk1wkDKL(Pπ(st+k|st)K(st+k|st,ˉπ))

where we replace g in Eq.(3) with φ(st,ˉπ) and ˉπ denotes a fixed policy from which the inverse goal model produce g. In fact, Eq.(4) is identical to the LHS of Eq.(2), which brings us a nice explanation for the correlation between policy optimization and representation learning. We will come back to this in Discussion.

Put All Together

Now we develop some intuition on Claim 4. Claim 4 equally says that if we have the representation learning function f and inverse goal model φ minimizes the following loss

argminf,φ1ˉwck=1γk1wkDKL(Pπ(st+k|st)K(st+k|st,π)) whereK(st+k|st,π)ρ(st+k)exp(D(f(st+k),φ(st,π)))

and a lower-level policy πlow maximizes an RL objective with reward function defined as

˜rk=wk(D(f(st+k),g)+logρ(st+k)logPπ(st+k|st)) wherewk={1if kc\(1γ)1if k=c

then a hierarchical policy πhier learned through πlow has a bounded sub-optimality.

Now that we have grasped the overall theory, let’s dive into each part and see what they are actually doing.

Representation Learning

Let us first parameterize the representation learning function f and inverse goal model φ by θ. The supremum in Eq.(2) indicates that f and φ should minimize the inner part of the supremum. In practice, however, we do not have access to policy representation π. Thus, Nachum et al. propose choosing st sampled uniformly from the replay buffer and using the subsequent c actions at:t+c1 as a representation of the policy. As a result, we have the representation learning objective as

argminθ L(θ)=Est,at:t+c1replay[L(θ,st,at:t+c1)] whereL(θ,st,at:t+c1)=ck=1γk1wkDKL(Pat:t+c1(st+k|st)K(st+k|st,at:t+c1))

To simplify notation, we use π=at:t+c1 and Eθ(st+k,st,π)=exp(D(fθ(st+k),φθ(st,π)) in the following discussion. Also, recall Kθ(st+k|st,π)=ρ(st+k)Eθ(st+k,st,π)/ρ(st+k)Eθ(st+k,st,π)dst+k. We now further expand Eq.(6)

L(θ,st,π)=ck=1γk1wkDKL(Pπ(st+k|st)Kθ(st+k|st,π)) =ck=1γk1wkEPπ(st+k|st)[logKθ(st+k|st,π)] =ck=1γk1wk(EPπ(st+k|st)[D(fθ(st+k),φθ(st,π))] logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π)))]) 

where we omit terms irrelevant to θ whenever possible. Note that the content in Eq.(7) is similar to the objective of MINE estimator. This suggests our representation learning objective is in fact maximizing the mutual information between [st,π] and st+k, discounting as k increases. As in MINE, the gradient of the second term in brackets will introduce bias, the authors propose to replace the second term with

E˜sρ[Eθ(˜s,st,π)]E˜sreplay[Eθ(˜s,st,π)]

where the denominator is approximated using an additional mini-batch of states sampled from replay buffer, and there is no gradient back-propagating through it.

Since Eq.(7) is essentially maximizing the mutual information between [st,π] and st+k for k[1,c], other methods such as DIM may be further used to improve the performance.

lower-level Policy Learning

Eq.(3) suggests optimizing a policy πst,g(a|st+k1,k) for every st,g. This is equivalent to maximizing the parameterization π(a|st,g,st+k,k). Standard RL algorithms may be employed to maximize the lower-level reward implied by

˜rk=wk(D(f(st+k),g)+logρ(st+k)logPπ(st+k|st))

where the first term in brackets is straightforward to compute but the rest terms are in general unknown. To approach this issue, we replace Pπ(st+k|st) with Kθ(st+k|st,π) (because of Eqs. (5) and (6)), ending up with

˜rk=wk(D(f(st+k),g)+logρ(st+k)logKθ(st+k|st,π)) =wk(D(f(st+k),g)+D(f(st+k),φ(st,π))+logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π))]) =wk(D(f(st+k),g)(D(f(st+k),φ(st,π))logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π))]))

where π is approximated by at:t+c1 as before. Now that we have the lower-level reward computable from Eq.(9), the lower-level policy can be learned as we did in HIRO using some off-policy method. (Wait, off-policy?)

Now let’s analyze our lower-level reward a bit to develop some intuition. The first term D(f(st+k),g) simply encourages the agent to reach states that is close to g in the goal space; the second term D(f(st+k),φ(st,π))logE˜sρ(st+k)[exp(D(fθ(˜s),φθ(st,π))] estimates the mutual information between [st,π] and st+k as we stated in the previous sub-section. By penalizing this mutual information, we are regularizing the low-policy

Recap and Algorithm

So far we have basically discussed everything we need to build the hierarchical algorithm. Let us do a brief review: We first compared architectures of NORL-HRL and HIRO. Then we started from claim 4, seeing how to learn good representations that lead to bounded sub-optimality and how the intrinsic reward for the lower-level policy is defined. Now it is time to present the whole algorithm

Pseudocode for NORL-HRL

Discussion

In this section, I humbly discuss some personal thinking when reading this paper.

A Different Interpretation of K(st+k|st,π)ρ(st+k)exp(D(f(st+k),φ(st,π)))

If we take D(f(st+k),φ(st,π)) as reward function r(st,π,st+k), and relate the definition of K to the probabilistic graphical model(PGM) that we’ve discussed in posts [1] [2], we may obtain the following PGM

A probabilistic graphical model for K

Then we have

K(st+k|st,π)=ρ(st+k)P(Ot+k|st,π,st+k)=P(Ot+k,st+k|st,π)

Now we may interpret K(st+k|st,π) as the probability that state st+k is optimal given the current state st and policy π.

Correlation Between lower-level Policy Optimization and Representation Learning

As indicated by Eqs.(4) and (6), both lower-level policy optimization and representation learning essentially minimize the KL divergence as below if we neglect all weights and discounted factors

DKL(Pπ(st+k|st)Kθ(st+k|st,ˉπ))k[1,c] whereKθ(st+k|st,ˉπ)=ρ(st+k)exp(D(fθ(st+k),φθ(st,ˉπ)))

where, we take it as if φθ(st,ˉπ) and g are interchangeable. We can see from this KL divergence: lower-level policy optimization draws Pπ close to Kθ, while representation learning draws Kθ close to Pπ. Intuitively, when we optimize the lower-level policy, we want the distribution of st+k under that policy to be equal to the probability of st+k being optimal. On the other hand, when we do representation learning, we maximize the mutual information between st+k and [st,π], making the lower-level policy easier to optimize. In this regard, the representation learning plays a similar role as value function—both evaluate the current lower-level policy.

Should we relabel goals here as we do in HIRO?

This question has been haunting me for a while since it is not mentioned in the paper. So far, my personal answer is that maybe we should also do relabeling, but it is not as urgent as that is in HIRO. The same reason why HIRO requires relabeling goals still works here: as lower-level policy evolves, transition tuples collected before may no longer be valid to higher-level policy. However, this problem may, in some sense, be mitigated by the fact that we represent goals in a lower dimension.

Well, maybe the paper has mentioned that, in a very implicit way though… The experience used to train the higher-level policy includes state and lower-level action sequence, which is only useful for off-policy correction.

This question has been confirmed by Ofir Nachum: goals are indeed relabelled as in HIRO.

Is it a good idea to learn NORL-HRL in an off-policy way?

Nachum et al. propose learning NORL-HRL in an off-policy way to gain sample efficiency. However, is it really a good way to do so? Aware that we are dealing with nonstationary environments here; both higher-level and lower-level rewards are not reusable. The higher-level transitions soon becomes invalid when lower-level policy evolves, and the lower-level reward function constantly changes since it has a complex relationship with the representation learning function(through f), the inverse goal model(through φ), higher-level policy(through g), and even the lower-level policy(through π, albeit in practice, we appriximate it with actions). As a result, it may be preferable to learn these using some on-policy algorithm.

References

Mohamed Ishmael Belghazi et al. Mutual Information Neural Estimation

Ofir Nachum et al. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

Code: https://github.com/tensorflow/models/tree/master/research/efficient-hrl