Introduction
When dealing with complex tasks, human cooperation usually involves decomposing the task into a set of subtasks and letting each individual be responsible for solving a subtask. This strategy effectively shrinks the observation and action space for each individual, making learning much easier. Inspired by that, Wang et al. 2021 propose RODE, which first decomposes the action space into a set of subspaces according to their effects on the environment, and then trains a hierarchical architecture where the higher-level policy(namely role selector) selects an action subspace from which the lower-level policy(namely role policy) picks an action.
Method
As shown in Figure 1, RODE consists of three parts. The first part factors the action space according to actions’ effects. Then the role selector selects an action subspace every \(c\) steps. Last, the role policy chooses an action from the selected action subspace. We discuss each components in this section.
Determining Role Action Subspaces by Learning Action Representations
RODE learns an action representation for each discrete action via a dynamics predictive model shown in Figure 1a. In experiments, the action \(a_i\) is encoded by an MLP with one hidden layer and \((o_i,\pmb a_{-i})\) is encoded by another MLP with one hidden layer. The concatenation of both representations are used to predict the next observation and reward. The model is trained to minimize the following prediction loss function
\[\begin{align} \mathcal L_e=\mathbb E_{(o_i,\pmb a_i,r_i, o_i')\sim\mathcal D}\left[\Vert p(o_i'|a_i,o_i,\pmb a_{-i})-o_i'\Vert_2^2+\lambda_e(p(r_i|a_i,o_i,\pmb a_{-i})-r_i)^2\right] \end{align}\]where \(\lambda_e=10\) controls the weights of the reward prediction loss.
The above model learns an action encoder that maps \(a_i\) to \(\pmb z_{a_i}\). We now apply \(k\)-means clustering to divide actions into several groups—the implementation is actually more complicated, which involves some heuristics specific to the environment so that some actions are shared among groups.
Note the predictive model is trained separately from the other two components and usually requires far less samples(\(50K\)).
Role Selector
The role selector is a variant of QMIX network with “action” defined as the mean representation of available actions
\[\begin{align} \pmb z_{\rho_j}={1\over |A_j|}\sum_{a\in A_j}z_a \end{align}\]where \(A_j\) is the \(j^{th}\) action subspace. We call \(z_{\rho_j}\) the role representation.
Different from QMIX, we compute the distribution of “\(Q\)-values” \(Q_i^\beta\) for agent \(i\) by computing the cosine similarity between the trajectory representation \(\pmb z_{\tau_i}\) and the role representation \(\pmb z_{\rho_j}\), i.e., \(Q_i^\beta(\tau_i,\rho_j)=\pmb z_{\tau_i}^\top\pmb z_{\rho_j}\). To better coordinate role assignments, we combine all \(Q_i^\beta\) using a monotonic mixing network to estimate a global \(Q\)-value \(Q_{tot}^\beta\). Then we optimize \(Q_{tot}^\beta\) by minimizing the following TD loss
\[\begin{align} \mathcal L_{\beta}=\mathbb E_{\mathcal D}\left[\left(\sum_{i=0}^{c-1}r_{t+i}+\gamma\max_{\pmb \rho'}\bar Q^{\beta}_{tot}(s_{t+c},\pmb \rho')-Q_{tot}^\beta(s_t,\pmb \rho)\right)^2\right] \end{align}\]where \(\pmb \rho={\rho_1,\dots,\rho_k}\) is the joint role of all agents, \(\bar Q_{tot}^\beta\) is the fixed target network.
Role Policy
The role policy is also a variant of QMIX network, which outputs a distribution of “\(Q\)-values” \(Q_i\) for agent \(i\) by computing the cosine similarity between the trajectory representation \(\pmb z_{\tau_i}\) and the action representation \(\pmb z_{a_j}\), i.e., \(Q_i(\tau_i,a_j)=\pmb z_{\tau_i}^\top\pmb z_{a_j}\). Similar to the role selector, we combine all \(Q_i\) by using a monotonic mixing network to estimate a global \(Q\)-value \(Q_{tot}\) and optimize it by minimizing the following TD loss
\[\begin{align} \mathcal L=\mathbb E_{\mathcal D}\left[\left(r+\gamma\max_{\pmb a'}\bar Q_{tot}(s',\pmb a')-Q(s,\pmb a)\right)^2\right] \end{align}\]Note that Agent \(i\)s in the Figures 1b and 1c are separate networks and do not share any parameters.
Experimental Results
Action Representations
Figure 2 shows the predictive model effective learns action representations that reflect effects, where actions with similar effects are clustered together.
Performance
RODE shows outstanding performance on hard and super hard environments from the SMAC suite, demonstrating that dividing action space into subspaces helps exploration. On the other hand, on easy maps, RODE typically needs more samples to learn a successful strategy.
Ablations
Figure 6 shows ablation studies on 3 most difficult map from the SMAC suite, where
- “RODE No Action Repr” does not condition the \(Q\)-network on the action representations(i.e., without the attention mechanism). The performance of “RODE No Action Repr” is most close to that of RODE and better than other ablations, showing the most performance gain are from the restricted action space
- “RODE Random Restricted Action Spaces” uses random action subspaces and “RODE Full Action Space” does not divide the action space at all.
One explanation for why restricted action spaces work is that they introduce a bias in exploration space, guiding the exploration in a more directional way.
References
Wang, Tonghan, Tarun Gupta, Anuj Mahajan, and Bei Peng. 2021. “RODE : Learning Roles To Decompose Multi-Agent Tasks.” LCLR.