Preliminaries
We extend the policy-gradient objective by conditioning it on goal g
∇θJ(θ)=∑gp(g)∑τp(τ|g,θ)⏟trajectory probabilityT−1∑t=1∇logp(at|st,g,θ)A(st,at,g)where A(st,at,g) is some advantage function. Notice that we expanded the expectation using summation for future usage.
Hindsight experience replay, samples future states along the trajectory as additional goals so as to provide more training signal to the agent. This technique has been demonstrated to significantly improve the training speed and performance of the agent in goal-directed problems where the reward signal is sparse and binary.
Hindsight Policy Gradients
It is theoretically sound to directly apply hindsight experience to methods of one-step Q-learning style as the action at the current step has been speficied by the Q-function and no importance sampling is required. This, however, is not the case for policy-gradient methods. Therefore, we have to apply importance sampling to Eq.(1)
∇θJ(θ)=∑gp(g)∑τp(τ|g′,θ)p(τ|g′,θ)p(τ|g,θ)T−1∑t=1∇logp(at|st,g,θ)A(st,at,g) =∑gp(g)∑τp(τ|g′,θ)T−1∏t=1p(at|st,g,θ)p(at|st,g′,θ)⏟expand trajectoryT−1∑t=1∇logp(at|st,g,θ)A(st,at,g) =∑gp(g)∑τp(τ|g′,θ)T−1∑t=1∇logp(at|st,g,θ)t∏t′=1p(at′|st′,g,θ)p(at′|st′,g′,θ)⏟causalityA(st,at,g)where we expand trajectory and cancel out transition probabilities in the second step, and, in the last step, we move in the importance sampling ratios and apply causality to remove future ratios unrelated to reward at the current timestamp.
In practice, we approximate Eq.(2) with a batch of trajectories and goals (τi,gi)Ni=1 as follows
∇θJ(θ)=N∑i=1T−1∑t=1∇logp(ait|sit,gi,θ)t∏t′=1p(ait′|sit′,gi,θ)p(ait′|sit′,g′,θ)A(sit,ait,gi)In the preliminary experiments, Rauber et al. found that this estimator leads to unstable learning progress, which is probably due to its potential high variance. Therefore, they propose applying weighted importance sampling to trade variance for bias, which gives us the final gradient estimate:
∇θJ(θ)=N∑i=1T−1∑t=1∇logp(ait|sit,gi,θ)∏tt′=1p(ait′|sit′,gi,θ)p(ait′|sit′,g′,θ)∑Nj=1∏tt′=1p(ait′|sit′,gi,θ)p(ait′|sit′,g′,θ)A(sit,ait,gi)Interstingly, the authors found that applying baselines does not help HPG much in their experiments.
Experimental Results
The author test the agent on several environments where the agent receives the remaining number of time steps and one as a reward only for reaching the goal state, which also ends the episode.
As the experiments are out of interests to us, we refer readers to the official website for HPG for more information about experimental results: http://paulorauber.com/hpg
References
Andrychowicz, Marcin, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. 2017. “Hindsight Experience Replay.” Advances in Neural Information Processing Systems 2017-Decem (Nips): 5049–59.
Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Jürgen Schmidhuber. 2019. “Hindsight Policy Gradients.” ICLR.