Introduction
Similar to the way of thinking from humans, recursive reasoning refers to the belief reasoning process where each agent considers the reasoning process of other agents, based on which it expects to make better decisions. Importantly, it allows an opponent to reason about the behavior of the modeling agent rather reasoning based on history(on which traditional opponent modeling methods are based); the process can, therefore, be nested in a form as “I believe that you believe that I believe …”
Deficiency of Opponent Modeling
Traditional opponent modeling(OM) methods model how the opponent behaves based on the history. However, the opponent may change its policy from time to time, making the history less informative. As a result, OM algorithms usually require to know the exact Nash equilibrium policy of the opponent during training so that it can effectively model the opponent’s behavior, which restricts their application in practice.
Deficiency of Centralized-Critic methods
Centralized-critic methods usually require strong assumptions that the opponent’s policy are fully observable, letting alone the centralized \(Q\)-network potentially prohibits the algorithms from scaling up.
Decoupling the joint policy simplifies the algorithm, but it ignores the agents’ connections, e.g. impact of one agent’s action on other agents, and the subsequent reactions from other agents.
Preliminaries
We define the MDP for an \(n\)-agent stochastic game as \((\mathcal S, \mathcal A^1,\dots,\mathcal A^n,r^1,\dots,r^n,p,\gamma)\), where \(\mathcal S\) denotes the state space, \(p\) is the distribution of the initial state, \(\gamma\) is the discount factor for future rewards, \(\mathcal A^i\) and \(r^i=r^i(s,a^i,a^{-i})\) are the action space and the reward function for agent \(i\in{1,\cdots,n}\). Agent \(i\) chooses its action \(a^i\in \mathcal A^i\) according to the policy \(\pi^i_{\theta^i}(a^i\vert s)\) parameterized by \(\theta^i\). Let us define the joint policy as the collection of all agents’ policies \(\pi_\theta\) with \(\theta\) representing the joint parameter. It is convenient to interpret the joint policy from the perspective of agent \(i\) such that \(\pi_\theta=(\pi_{\theta^i}^i(a^i\vert s),\pi_{\theta^{-i}}^{-i}(a^{-i}\vert s))\), where \(a^{-i}=(a^j){j\ne i}\), \(\theta^{-i}=(\theta^j){j\ne i}\), and \(\pi^{-i}=(\pi^j)_{j\ne i}\) is a compact representation of the joint policy of all complementary agents of \(i\). Each agent is presumed to pursue the maximal cumulative reward, expressed as
\[\begin{align} \max\eta^i(\pi_\theta)=\mathbb E_{\pi_\theta}\left[\sum_{t=0}^\infty \gamma^tr^i(s_t,a_t^i,a_t^{-i})\right]\tag {1} \end{align}\]The \(Q\)-function of each agent, \(Q^i_{\pi_\theta}(s_t,a_t^i,a_t^{-i})=\mathbb E\left[\sum_{t=0}^\infty \gamma^tr(s_t,a_t^i,a_t^{-i})\right]\), is subject to the joint policy \(\pi_\theta\) consisting of all agents’ policy. One common approach in opponent modeling is to decouple the joint policy assuming conditional independence of actions from different agents:
\[\begin{align} \pi_\theta(a^i,a^{-i}|s)=\pi_{\theta^i}^i(a^i|s)\pi_{\theta^{-i}}^{-i}(a^{-i}|s) \end{align}\]This, however, does not follow the recursive reasoning we discussed in Introduction. To model the recursive reasoning, we re-formulate the joint policy by
\[\begin{align} \pi_\theta(a^i,a^{-i}|s)=\underbrace{\pi_{\theta^i}^i(a^i|s)\pi_{\theta^{-i}}^{-i}(a^{-i}|s,a^i)}_{Agent\ i's\ perspective}=\underbrace{\pi_{\theta^i}^{-i}(a^{-i}|s)\pi_{\theta^{i}}^{i}(a^{i}|s,a^{-i})}_{The\ opponent's\ perspective}\tag {2} \end{align}\]where \(\pi_{\theta^{-i}}^{-i}(a^{-i}\vert s,a^i)\) considers the action would be taken by the opponent given the fact that the opponent know the current state of the environment and the agent \(i\)’s action. Intuitively, Eq.\((2)\) allows the agent \(i\) to consider how the opponents would response if it acted like this, how the opponents would response if it acted like that, etc. The same inference logic can be applied to the opponents from their perspectives, as shown in the second equality of Eq.\((2)\).
Probabilistic Recursive Reasoning
In Probabilistic Recursive Reasoning(PR2), agent \(i\) maximizes the total rewards under the joint policy distribution(this distinguishes PR2 from MADDPG, which evaluate the total rewards simply under the current agent’s policy):
\[\begin{align} \underset{\theta^i}{\arg\max}\eta^i\left(\pi_{\theta^i}^i(a^i|s)\pi_{\theta^{-i}}(a^{-i}|s,a^i)\right)\tag {3} \end{align}\]where \(\eta^i\) is defined in Eq.\((1)\). We can compute the policy gradient of Eq.\((3)\) w.r.t. \(\theta^i\) as follows
\[\begin{align} \nabla_{\theta^i}\eta^i=\mathbb E_{s\sim p, a^i\sim\pi^i_{\theta^i}}\left[\nabla_{\theta^i}\log\pi^i_{\theta^i}(a^i|s)\int_{a^{-i}}\pi^{-i}_{\theta^{-i}}(a^{-i}|s,a^i)Q^i(s,a^i,a^{-i})da^{-i}\right] \end{align}\]As in MADDPG, we may like to do off-policy learning to improve sample efficiency, which gives us the following gradient estimator:
\[\begin{align} \nabla_{\theta^i}\eta^i=\mathbb E_{s\sim p, a^i\sim\mu^i_{\theta^i}}\left[\nabla_{\theta^i}\log\mu^i_{\theta^i}(s)\mathbb E_{ {a^{-i}}\sim \pi^{-i}_{\theta^{-i}}(a^{-i}|s,a^i)} \big[\nabla_{a^i}Q^i(s,a^i,a^{-i})|_{a^i=\mu^i_{\theta^i}(s)}\big]\right] \end{align}\]However, agent \(i\) might not have access to \(\pi_{\theta^{-i}}^{-i}(a^{-i}\vert s,a^i)\) as we want to do decentralized training, it is often needed to approximate \(\pi_{\theta^{-i}}^{-i}(a^{-i}\vert s,a^i)\) by \(\rho_{\phi^-i}^{-i}(a^{-i}\vert s,a^i)\). This gives us
\[\begin{align} \nabla_{\theta^i}\eta^i=\mathbb E_{s\sim p, a^i\sim\mu^i_{\theta^i}}\left[\nabla_{\theta^i}\log\mu^i_{\theta^i}(s)\mathbb E_{ {a^{-i}}\sim \rho^{-i}_{\theta^{-i}}(a^{-i}|s,a^i)}\big[\nabla_{a^i}Q^i(s,a^i,a^{-i})|_{a^i=\mu^i_{\theta^i}(s)}\big]\right]\tag {4} \end{align}\]Now the last piece missing is how to find the best-fit approximation of \(\rho_{\phi^{-i}}^{-i}(a^{-i}\vert s,a^i)\).
Variational Inference on Opponent Conditional Policy
The authors propose to use the variational inference to find the best approximation of \(\rho_{\phi^{-i}}^{-i}(a^{-i}\vert s,a^i)\) as we did in this post. Almost the same reasoning works here if we regard \(Q(s,a^i,a^{-i})\) and \(Q(s, a^i)\) as \(Q\) and \(V\) in that post, respectively. These substitutions are feasible since now we approximate the opponent policy, which takes \((s,a)\) pair as its ‘state’. Working through the math, we have
\[\begin{align} \rho_{\phi^{-i}}^{-i}(a^{-i}|s,a^i)=\exp(Q^i(s,a^i,a^{-i})-Q^i(s,a^i)) \end{align}\]In continuous control, the authors adopt the amortized Stein Variational Gradient Descent in sampling from the soft Q-function, one might as well use Soft Actor-Critic(SAC) as a replacement.
Summary
In PR2, there are four types of networks(without counting target networks and double Q-networks) for each agent \(i\):
- Deterministic policy network \(\mu^i\)
- Approximator for the opponent’s policy network \(\rho^{-i}\). In the official implementation, all opponents’ policy are considered equally. That is, all oppenents’ policy are generated by the same network, i.e. \(\rho(a^j\vert s,a^i) = \rho(a^k\vert s,a^i)\) where \(j,k\in-i\)
- \(Q\)-function \(Q(s,a^i)\)
- Joint \(Q\)-function \(Q(s,a^i,a^{-i})\)
The official implementation trains \(\mu^i\) according to Eq.\((4)\), and trains \(\rho^{-i}\), \(Q(s,a’)\) and \(Q(s,a^i,a^{-i})\) with a MaxEnt algorithm.
Deficiency
PR2 requires all agents be able to observe the same global state so that agent \(i\) can model other agents policy. However, this is often not feasible in practice.
Discussion
In the paper, the authors compute \(\rho^{-i}{\theta^{-i}}(a^{-i}_t\vert s_t,a^i_t)\), a replacement of the opponents’ policy, by minimizing the KL divergence, or equivalently, maximizing the maximum entropy reinforcement learning objective. But opponents in general do not act in the interest of the current agent — they more often act against it. So why would it make sense to optimize \(\rho^{-i}{\theta^{-i}}(a^{-i}_t\vert s_t,a^i_t)\) in this way?
I’m glad that someone in OpenReview had asked the same question. The authors response that, in multi-agent case, all agents act towards an equilibrium that no one want to deviate from. Such an equalibrium may not be optimal for any agent: One may deviate from the equilibrium and gain better results, but its components will quickly find a policy that beats it in long run. At the end of the day, they will again converge to the equilibrium. Therefore, variables \(O\) could be regarded as the corresponding optimality after considering all possible opponents’ response.
Here’s the original answer from the author
Despite the high level similarity between single-agent energy-based RL framework and our probabilistic recursive reasoning framework, the fundamental graphical model is different (see Fig. 8 in Appendix E). The “most probable trajectory” in the multi-agent case, represented together by the variables \({O, O^{-i}}\), does not necessarily stand for the trajectory where each agent just chooses the action that will give him the maximum reward (namely the “optimal trajectory” in single-agent case), but rather some kinds of equilibrium that no one would want to deviate from. In the example of the matrix game in Section 5.1, both agents reach the Nash equilibrium at \((0.5, 0.5)\) in the end, that is because agent \(1\) knows choosing the action \(1\) which gives the maximum reward \(3\) (at the same time assuming agent \(2\) choose action \(2\)) will not last because agent \(2\) will simply defect to choose action \(1\) to avoid the case of reward \(0\) for itself; therefore, the trajectory of (action \(1\), action \(2\)) is not optimal to agent \(1\) anymore after considering the consequent influence on agent \(2\)’s action. Another example is to think about the prisoner’s dilemma, (cooperate, cooperate) is not a probable trajectory, because it is always agent’s interest to defect, thereby the \({O=1, O^{-i} =1}\) will only occur at the (defect, defect) instead. To sum up, \({O, O^{-i}}\) describes the likelihood of certain trajectory being observed, in the multi-agent scenario, the goal of equilibrium certainly allows the case where the opponent reward is different from \(r^i\).
Theoretically speaking, we have also proved in Theorem 2 that PR2 methods converge in the games with either fully-cooperative equilibrium or fully-competitive equilibrium. In addition, we have further added one extra zero-sum game, i.e. matching penny, in Fig. 9 of Appendix E. PR2 methods present convergent results as expected.
You may find appendix E here
References
Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic Recursive Reasoning For Multi-Agent Reinforcement Learning in ICLR 2019
code: https://github.com/ml3705454/mapr2