Introduction
Model-free algorithms have been shown to achieve good results in a variety of tasks, but they generally suffer from very high sample complexity. On the other hand, model-based methods are more sample-efficient and more flexible, but their asymptotic performance is usually worse than model-free methods due to model bias. Anusha Nagabandi et al, in 2017 illustrated that using model-based algorithms to initialize a model-free learner could significantly speed up the learning process(gain \( 3-5\times \) sample efficiency) and may even achieve a better final result.
Model-Based Deep Reinforcement Learning
Neural Network Dynamics Function
A straightforward method to learn dynamics function \( f \), as a deep neural network, is to take as input the current state \( s_t \) and \( a_t \), and output the predicted next state \( \hat s_{t+1} \). This function, however, can be difficult to learn when the states \( s_t \) and \( s_{t+1} \) are too similar and the action has seemingly little effect on the output, which may be further excerbated when the time interval \( \Delta t \) is small.
To overcome this issue, the authors propose to predict changes in state over time step duration \( \Delta t \). That is, instead of learning \( f(s_t,a_t)=\hat s_{t+1} \), \( f \) learns \( f(s_t,a_t)=\hat s_{t+1}-s_t \).
Network Training Details
The loss for the dynamics function is defined to be the average \(H\)-step open-loop prediction error. That is,
\[\begin{align} \mathcal L(\theta)=&\mathbb E_{}\left[{1\over H}\sum_{h=1}^H{1\over 2}\Vert s_{t+h}-\hat s_{t+h}\Vert^2\right]\tag 1\\\ \mathrm{where\quad}\hat s_{t+h}=&\begin{cases}s_t& h=0\\\ \hat s_{t+h-1}+f(\hat s_{t+h-1}, a_{t+h-1})&h>0\end{cases} \end{align}\]This loss function measures how well the model can predict in \(H\) steps. This makes sense since we will ultimately use this model for long-horizon control.
The detailed algorithm is given below
\[\begin{align} 1:\ &\mathrm{Collect\ dataset\ }\mathcal D_{rand}\ \mathrm{by\ executing\ random\ actions} \\\ 2:\ &\mathrm{Initialize\ empty\ dataset\ }\mathcal D_{f},\ \mathrm{and\ model\ network\ }f_{\theta}\\\ 3:\ &\mathbf{For}\ i=1,\dots:\\\ 4:\ &\quad \mathrm{Optimize\ }f_{\theta}\mathrm{\ on\ loss\ defined\ in\ Eq.(1)\ using\ }\mathcal D_{rand}\ \mathrm{and\ }\mathcal D_f\\\ 5:\ &\quad \mathbf{For}\ t=1,\dots:\\\ 6:\ &\quad\quad \mathrm{Use\ the\ model\ to\ do\ planning\ from\ the\ current\ state\ }s_t\\\ 7:\ &\quad\quad \mathrm{Execute\ the\ first\ action\ }a_t\\\ 8:\ &\quad\quad \mathrm{Add\ }(s_t,a_t)\ \mathrm{to}\ \mathcal D_f \end{align}\]Some additional notes:
- The dataset is normalized to ensure the loss function weight the different parts of the inputs equally. Gaussian noises are added to both inputs and outputs to increase model robustness
- In step 6, we can perform rollout algorithms or other more complicated methods like MCTS or iLQR. Together with step 7, it comprises the model predictive controller (MPC).
- Steps 5-8 are the standard process of data aggregation (DAgger), which mitigates the mismatch between the data’s state-action distribution and the model-based controller’s distribution, thereby improving performance.
- A benefit of this MPC controller is that the model can be reused to achieve a variety of goals at run time by simply changing the reward function.
Model-Free Deep Reinforcement Learning
The neural network policy used by the model-free algorithm is first initialized by the MPC controller through imitation learning with DAgger. The initialization algorithm is given as follows:
\[\begin{align} 1:\ &\mathrm{Collect\ dataset\ }\mathcal D_{mb}\ \mathrm{using\ MPC\ controller\ with\ help\ of\ model\ }f_\theta \\\ 2:\ &\mathrm{Initialize\ policy\ network\ }f_{\phi}\\\ 3:\ &\mathrm{Optimize\ }f_{\phi}\mathrm{using}\ \mathcal D_{mb}\\\ 4:\ &\mathbf{For}\ i=1,\dots:\\\ 5:\ &\quad \mathrm{Collet\ dataset\ }\{s_1,\dots,s_T\}\ \mathrm{using\ learned\ policy\ network}\ f_{\phi}\\\ 6:\ &\quad \mathrm{Query\ MPC\ controller\ to\ add\ action\ label,\ }\mathcal D=\{(s_1, a_t),\dots,(s_T,a_T)\}\\\ 7:\ &\quad \mathrm{Optimize\ }f_\phi\ \mathrm{using}\ \mathcal D\\\ \end{align}\]This policy net is then used as an initial policy for a model-free reinforcement learning algorithm.
References
Nagabandi et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning