Introduction
In social psychology, cognitive consistency theories show that people work together usually have consistent cognition about their environments. Inspired by that, Mao et al. 2020 propose to incorporate such consistent cognition into the value network to facilitate large-scale agent cooperations.
Method
Overview
Suppose an agent’s neighboring agents is provided by the environment. Neighborhood cognition consistency(NCC) is achieved in three steps. First, it adopts graph convolutional network to extract a high-level representation from the joint observation of all neighboring agents. Second, it decomposes this high-level representation into an agent-specific cognitive representation and a neighborhood-specific cognitive representation. Third, it aligns the neighborhood-specific representations of all neighboring agents.
Due to the consistency between neighborhood-specific cognitions as well as the difference between agent-specific cognitions, the neighboring agents can achieve coordinated and still personalized policies based on the combination of both cognitions. Meanwhile, since some agents belong to multiple neighborhoods, they are able to act as a bridge for all agents. Thus, NCC can facilitate the coordination among a large number of agents at the whole team level.
Despite the difference in the detail, the idea of NCC in fact resembles how OpenAI Five achieves cooperation, which shares \(25\%\) latent features among agents by max-pooling them.
NCC-Q
 
  The Q-learning instantiation of NCC, namely NCC-Q, consists of five modules as show on the left of Figure 1a. Before diving into details, we first introduce the cognition variable \(C\) in Figure 1a, which defines the consistent neighborhood cognition. Different from the conventional POMDP, we assume the global state can be decomposed into several hidden cognitive variables \(C_k\), from which the observations are derived. In the following discussion, we consider the neighborhood with hidden cognitive variable \(C\).
- 
    The fully-connected(FC) module encodes the local observation \(o_i\) into a high-level representation \(h_i\), which contains only agent-specific information. No matter which neighborhood we consider, \(h_i\) is the same for agent \(i\), which is no longer true for representations of higher level where we begin to consider neighborhood cognition consistency. 
- 
    The graph convolutional network(GCN) module aggregates all \(h_i\) within the neighborhood, and further extracts a high-level cognition vector \(H_i\) by 
where \(\sqrt{\vert N(j)\vert \vert N(i)\vert }\) normalizes feature \(h_j\) to avoid large features produced by the summation. We use \(\sqrt{\vert N(j)\vert \vert N(i)\vert }\) instead of \(N(i)\) since it’s a more adaptive way to down weight high-degree neighbors. \(W\) is shared among all neighborhood.
Truth be told, I’m not familiar with GCNs, and literally have no idea how Equation \((1)\) connects to convolution; unfortunately, the paper does not offer any related references. What I can conceive is that if \(W\) is only specific to some neighborhood and different for different neighborhoods, then the convolution makes some sense; though we still may not simplify Equation \((1)\) into a 1D convolutional layer if the number of agents in each neighborhood varies/the neighboring agents cannot be arranged sequentially. However, in that case, an agent may end up with different \(Q\)-values(see step 4) in different neighborhoods, making the action selection process complicate. Besides that, the choice of the sigmoid function as activation is also odd to me.
- The cognition module decomposes \(H_i\) into two branches: \(A_i\) and \(\hat C_i\), where \(A_i\) represents the agent-specific cognition, and \(\hat C_i\) represents the neighborhood-specific cognition. We expect \(\hat C_i\) to recover \(C\) as shown on the right of Figure 1a. Because directly computing \(p(C\vert o_i)={p(o_i\vert C)p(C_k)\over p(o_i)}\) is hard due to the integral involved in the denominator, we approximate it using the variational inference, which gives us the VAE structure on the right of Figure 1a and the following EBLO objective
In practice, \(p(C_k)\) is usually unknown. Therefore, we use the neighboring agents’ cognitive distribution \(q(\hat C_i\vert o_i)\) as a surrogate, which gives us the following objective
\[\mathbb E_{C_k\sim q(\hat C_i|o_i)}[\log p(o_i|C)]-{1\over |N_i|}\sum_{j\in N(i)}D_{KL}(q(\hat C_i|o_i)\Vert q(\hat C_j|o_j))\tag 3\]Note that there is a slight difference in Equations \((2)\) and \((3)\); Equation \((2)\) optimizes all \(C_i\)
- 
    The Q-value module sums \(A_i\) and \(\hat C_i\) and generate \(Q_i\). As in VDN and QMIX, \(Q_i\) here is more like a utility function instead of the state-action value function. 
- 
    The mixing module combines(sums up) all \(Q_i\) in the neighborhood to generate a joint \(Q\)-value function \(Q_{total}\). We optimize \(Q_{total}\) by minimizing the TD loss. 
NCC-AC
The AC architecture is similar to MADDPG, except that the critic is implemented to take into account consistent neighborhood cognition. Figure 1b illustrates the critic architecture. Two design differences are made compared to NCC-Q: 1) the GCN module integrates the action and observation separately as shown by the two branches (i.e., \(H_i^a\) and \(H_i^o\)) 2) we directly generate the agent-specific cognition \(A_i\) from \(h_i^a\).
Experimental Results
 
  Figures 8&9 show the efficacy of the CD loss(Equation \((3)\)) on Google football environments.
References
Mao, Hangyu, Wulong Liu, Jianye Hao, Jun Luo, Dong Li, Zhengchao Zhang, Jun Wang, and Zhen Xiao. 2019. “Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning.” ArXiv. https://doi.org/10.1609/aaai.v34i05.6212.
 
      