Introduction
When a task requires a precise sequence of \(N\) actions to achieve a reward, the probability that a random agent obtains the reward scales exponentially with \(N\). This makes naive exploration inefficient as \(N\) grows. We discuss a curriculum learning algorithm, proposed by Salimans et al. 2018, that trains a policy-based algorithm starting from somewhere near the end of a successful demonstration and gradually moving backward in time.
For simplicity, we call the method Backward in the rest of the post.
Method
Given a previously recorded demonstration \({\tilde s_t,\tilde a_t,\tilde r_t}{t=0}^T\), Backward works by letting each rollout worker starts its episode from \(\tilde s{\tau^*}\) in the demonstration. \(\tau^*\) is sampled uniformly from \([\tau-D,\tau]\), where \(D\) is the number of starting points, and \(\tau\) is near the end of the demonstration at time \(T\) early on in training and gradually move back in time as training proceeds. The data produced by rollout workers is fed to a central optimizer that updates the policy. In addition, the central optimizer calculates the proportion of rollouts that beat or at least tie the score of the demonstration on the corresponding part of the game. If the proportion is higher than a threshold \(\rho=20\%\), we move the reset point \(\tau\) backward in the demonstration.
Backward uses an RNN for contextual inference. In each episode, we initialize the RNN’s hidden states by taking \(K\) actions based on the demonstration segment \({\tilde s_i,\tilde a_i,\tilde r_i}_{i=\tau^*-K}^{\tau^*-1}\) directly preceding the local starting point \(\tau^*\), after which the agent takes actions based on its current policy. The demonstration segment is only used for the hidden state initialization but not used in training.
Experiments
Backwards trains PPO with a network significantly larger than the nature CNN used in DQN, which is found to be necessary to beat the demonstration. Moreover, it was trained for about 50 billion frames over \(128\) GPUs with \(8\) workers each, for a total of \(M=1024\) rollout workers.
Although Backward successfully beats the demonstration on Montezuma’s Revenge, it does not work on other games like Gravitar and Pitfall—Salimans et al. 2018 were unable to find hyperparameters that worked for training the full curriculum for these games.
References
Salimans, Tim, and Richard Chen. 2018. “Learning Montezuma’s Revenge from a Single Demonstration.” NIPS 2018.