Sherwin Chen
by Sherwin Chen
2 min read

Categories

Tags

Introduction

Model-free algorithms have been shown to achieve good results in a variety of tasks, but they generally suffer from very high sample complexity. On the other hand, model-based methods are more sample-efficient and more flexible, but their asymptotic performance is usually worse than model-free methods due to model bias. Anusha Nagabandi et al, in 2017 illustrated that using model-based algorithms to initialize a model-free learner could significantly speed up the learning process(gain \( 3-5\times \) sample efficiency) and may even achieve a better final result.

Model-Based Deep Reinforcement Learning

Neural Network Dynamics Function

A straightforward method to learn dynamics function \( f​ \), as a deep neural network, is to take as input the current state \( s_t​ \) and \( a_t​ \), and output the predicted next state \( \hat s_{t+1}​ \). This function, however, can be difficult to learn when the states \( s_t​ \) and \( s_{t+1}​ \) are too similar and the action has seemingly little effect on the output, which may be further excerbated when the time interval \( \Delta t​ \) is small.

To overcome this issue, the authors propose to predict changes in state over time step duration \( \Delta t \). That is, instead of learning \( f(s_t,a_t)=\hat s_{t+1} \), \( f \) learns \( f(s_t,a_t)=\hat s_{t+1}-s_t \).

Network Training Details

The loss for the dynamics function is defined to be the average \(H\)-step open-loop prediction error. That is,

\[\begin{align} \mathcal L(\theta)=&\mathbb E_{}\left[{1\over H}\sum_{h=1}^H{1\over 2}\Vert s_{t+h}-\hat s_{t+h}\Vert^2\right]\tag 1\\\ \mathrm{where\quad}\hat s_{t+h}=&\begin{cases}s_t& h=0\\\ \hat s_{t+h-1}+f(\hat s_{t+h-1}, a_{t+h-1})&h>0\end{cases} \end{align}\]

This loss function measures how well the model can predict in \(H\) steps. This makes sense since we will ultimately use this model for long-horizon control.

The detailed algorithm is given below

\[\begin{align} 1:\ &\mathrm{Collect\ dataset\ }\mathcal D_{rand}\ \mathrm{by\ executing\ random\ actions} \\\ 2:\ &\mathrm{Initialize\ empty\ dataset\ }\mathcal D_{f},\ \mathrm{and\ model\ network\ }f_{\theta}\\\ 3:\ &\mathbf{For}\ i=1,\dots:\\\ 4:\ &\quad \mathrm{Optimize\ }f_{\theta}\mathrm{\ on\ loss\ defined\ in\ Eq.(1)\ using\ }\mathcal D_{rand}\ \mathrm{and\ }\mathcal D_f\\\ 5:\ &\quad \mathbf{For}\ t=1,\dots:\\\ 6:\ &\quad\quad \mathrm{Use\ the\ model\ to\ do\ planning\ from\ the\ current\ state\ }s_t\\\ 7:\ &\quad\quad \mathrm{Execute\ the\ first\ action\ }a_t\\\ 8:\ &\quad\quad \mathrm{Add\ }(s_t,a_t)\ \mathrm{to}\ \mathcal D_f \end{align}\]

Some additional notes:

  1. The dataset is normalized to ensure the loss function weight the different parts of the inputs equally. Gaussian noises are added to both inputs and outputs to increase model robustness
  2. In step 6, we can perform rollout algorithms or other more complicated methods like MCTS or iLQR. Together with step 7, it comprises the model predictive controller (MPC).
  3. Steps 5-8 are the standard process of data aggregation (DAgger), which mitigates the mismatch between the data’s state-action distribution and the model-based controller’s distribution, thereby improving performance.
  4. A benefit of this MPC controller is that the model can be reused to achieve a variety of goals at run time by simply changing the reward function.

Model-Free Deep Reinforcement Learning

The neural network policy used by the model-free algorithm is first initialized by the MPC controller through imitation learning with DAgger. The initialization algorithm is given as follows:

\[\begin{align} 1:\ &\mathrm{Collect\ dataset\ }\mathcal D_{mb}\ \mathrm{using\ MPC\ controller\ with\ help\ of\ model\ }f_\theta \\\ 2:\ &\mathrm{Initialize\ policy\ network\ }f_{\phi}\\\ 3:\ &\mathrm{Optimize\ }f_{\phi}\mathrm{using}\ \mathcal D_{mb}\\\ 4:\ &\mathbf{For}\ i=1,\dots:\\\ 5:\ &\quad \mathrm{Collet\ dataset\ }\{s_1,\dots,s_T\}\ \mathrm{using\ learned\ policy\ network}\ f_{\phi}\\\ 6:\ &\quad \mathrm{Query\ MPC\ controller\ to\ add\ action\ label,\ }\mathcal D=\{(s_1, a_t),\dots,(s_T,a_T)\}\\\ 7:\ &\quad \mathrm{Optimize\ }f_\phi\ \mathrm{using}\ \mathcal D\\\ \end{align}\]

This policy net is then used as an initial policy for a model-free reinforcement learning algorithm.

References

Nagabandi et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning