Sherwin Chen
by Sherwin Chen
4 min read

Categories

Introduction

Importance sampling is the cornerstone of the off-policy learning, which provides a way to estimate expected value under the target policy given experiences sampled from behavior policy.

Why Importance Sampling?

All control methods face a dilemma: they seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions. To deal with such a dilemma, on-policy approaches make a compromise: they make do with a near-optimal policy which is used in both policy evaluation and exploration. Off-policy methods, on the other hand, employs different behavior and target policies so that the agent could learn about the optimal policy while behaving according to an exploratory policy.

However, we cannot directly apply different policies without any modification, because the data used to evaluate policy is comes from a different policy. We need a technique for estimating expected value under one distribution (the target policy) given samples from another (the behavior policy), and this is where importance sampling comes in. More specifically, importance sampling transforms the expected returns under the behavior policy to the expected values under the target policy using importance sampling ratio, which weighs returns according to the relative probability of their trajectories occurring under the target and behavior policy:

\[\begin{align} \rho_t := {P_{target}(X_t)\over P_{behavior}(X_t)} \end{align}\]

where \( P_{target} \) is the distribution generated by the target policy and \( P_{behavior} \) is by the behavior policy. We can see that \( E_{P_{behavior}}\left[\rho_t X \right] = E_{P_{target}}\left[ X \right] \):

\[\begin{align} E_{P_{behavior}}\left[\rho_t X \right] &= \int {P_{target}(x)\over P_{behavior}(x)}x{P_{behavior}(x)} dx\\\ &= \int xP_{target}(x)dx \\\ &= E_{P_{target}}\left[ X \right] \end{align}\]

Next section will introduce the inference process of importance sampling in n-step TD learning, which covers both one-step TD and Monte Carlo methods as special cases.

Importance Sampling in n-step TD Learning

n-step TD Learning

Before starting to talk about importance sampling, let’s briefly recap n-step TD learning, which relies on n-step return, defined as below, to adjust the value function for the current state

\[\begin{align} G_t^{(n)}=\sum_{i=0}^{n-1}\gamma^iR(S_{t+i})+\gamma^nV(S_{t+n}) \end{align}\]

where \( \gamma \) is the discount factor, \( R(s) \) is the reward at state \( s \), and \( V(s) \) is the state value at state \( s \) . Also, for simplicity, we assume \( t+n<T \).

The update rule for the on-policy n-step TD learning is then

\[\begin{align} V(S_t)=V(S_t)+\alpha (G_t^{(n)}-V(S_t)) \end{align}\]

To make things simple, we refer the n-step TD error \( G_t^{(n)}-V(S_t) \) as \( \delta_t^{(n)} \).

Importance Sampling Ratio Conditioned on State

In the introduction, we described that importance sampling utilized the importance sampling ratio to transform the expected return under the behavior policy. The importance sampling ratio, in turn, is the ratio of the probability that the trajectory occurs under the target policy to that under the behavior policy. Hence, in order to compute the importance sampling ratio, we need to figure out what the probability of a trajectory occurring under a policy is first. Given a start state \( S_t \) and a policy \( \pi \), the probability is defined by

\[\begin{align} &P_\pi(A_t, S_{t+1}, A_{t+1},\dots,S_{t+n}|S_t)\\\ =&\pi(A_t|S_t)P(S_{t+1}|S_t,A_t)\dots\pi(A_{t+n-1}|S_{t+n-1})P(S_{t+n}|S_{t+n-1},A_{t+n-1})\\\ =&\prod_{k=t}^{t+n-1}\pi(A_k|S_k)P(S_{k+1}|S_k,A_k) \end{align}\]

Thus, the importance sampling ratio given a start state \( S_t \) is

\[\begin{align} \rho_{t:t+n-1}&={P_{\pi}(A_t,S_{t+1},A_{t+1}\dots S_{t+n}|S_t)\over P_{b}(A_t,S_{t+1},A_{t+1}\dots S_{t+n}|S_t)}\\\ &={\prod_{k=t}^{t+n-1} \pi(A_{k}|S_{k})p(S_{k+1}|S_k, A_k)\over \prod_{k=t}^{t+n-1} b(A_{k}|S_{k})p(S_{k+1}|S_k, A_k)}\\\ &=\prod_{k=t}^{t+n-1}{\pi(A_k|S_k)\over b(A_k|S_k)} \end{align}\]

Where \( \pi \) is the target policy and \( b \) is the behavior policy.

Notice that the transition probabilities appear identically in both the numerator and denominator, and thus cancel. The importance sampling ratio ends up depending only on the sequence and the two policies.

Now we’re good to formulate the update rule for state value

\[\begin{align} V(S_t)=V(S_t)+\alpha \rho_{t:t+n-1}\delta_t^{(n)} \end{align}\]

Importance Sampling Ratio Conditioned on State and Action

Given a start state-action pair \( S_t, A_t \), the probability of a trajectory occuring under a policy \( \pi \) becomes

\[\begin{align} &P_\pi(A_t, S_{t+1}, A_{t+1},\dots,S_{t+n}|S_t\\\ =&P(S_{t+1}|S_t,A_t)\prod_{k=t+1}^{t+n-1}\pi(A_k|S_k)P(S_{k+1}|S_k,A_k) \end{align}\]

Hence, the importance sampling ratio given a start state-action \( S_t, A_t \) is

\[\begin{align} \rho_{t:t+n-1}&={P_{\pi}(S_{t+1},A_{t+1}\dots S_{t+n}|S_t,A_t)\over P_{b}(S_{t+1},A_{t+1}\dots S_{t+n}|S_t,A_t)}\\\ &={p(S_{t+1}|S_t, A_t)\prod_{k=t+1}^{t+n-1} \pi(A_{k}|S_{k})p(S_{k+1}|S_k, A_k)\over p(S_{t+1}|S_t, A_t)\prod_{k=t+1}^{t+n-1} b(A_{k}|S_{k})p(S_{k+1}|S_k, A_k)}\\\ &=\prod_{k=t+1}^{t+n-1}{\pi(A_k|S_k)\over b(A_k|S_k)} \end{align}\]

Notice that importance sampling ratio starts from \( t \) in the case of state value, whereas, for action value, that starts from \( t + 1 \). Such difference arises due to the fact that action value has specified the action at \( t \).

This also well explains why off-policy one-step TD algorithms that use action value instead of state value, such as Q-learning and Expected SARSA, seem not rely on importance sampling at the first glance — the importance sampling ratios for those algorithms are \( 1 \) since \( t+1>t+1-1 \)

The update rule for action value is almost identical to that for state value except that we now use action value instead of state value and the definition of importance sampling ratio has changed

\[\begin{align} Q(S_t,A_t)=Q(S_t,A_t)+\alpha\rho_{t:t+n-1}\delta_t^{(n)} \end{align}\]

Variance in Importance Sampling

In practice, the target policy is always optimal. That is, the target policy generally assign probability \( 1 \) to some action \( a \) and leave the other actions \( 0 \). Thus, \( \pi(a\vert s)\over b(a\vert s) \) is either \( 0 \) or some value greater than \( 1 \) (since the behavior policy should always maintain a certain rate of exploration). This leads to the fact that \( \rho_{t:t+n-1} \) might either get exploded or vanished as \( n \) increases. This attribute of the importance ratio makes it hard to work with large \( n \), let alone Monte Carlo method. Therefore, we usually clip \( \pi(a\vert s)\over b(a\vert s) \) at \(r=1\).