Introduction
Unlike the other posts, we here only distill some useful tips about how to do contrastive representation learining from Chen et al. 2020.
Tips for Representation Learning
-
Data augmentation helps representation learning: random crop (and resize) and color distort are especially useful.
- It is beneficial to add an additional small net(so called projection head) on top of the representation model that maps representations to the space where representation learning loss is applied
- The representation before the projection head(learned by the representation model) is more suitable for the downstream tasks than the representation after the projection head. This is because the projection head loses some information as it is trained to be invariant to data transformation and simply to minimize the contrastive loss. Also, by leveraging the projection head, more information can be formed and maintained in the representation
- Representation learning needs stronger data augmentation: a stronger color augmentation substantially improves representation learning model, whereas it hurts supervised model.
- Contrastive learning benefits from larger batch sizes and longer training compared to its supervised counterpart. Like unsupervised learning, contrastive learning benefits from deeper and wider network.
- The cross-entropy loss works best for the downstream classification tasks when coupling with the cosine similarity. Moreover, nomalization and a proper temperature help extract good representations. We further detail this a bit in the next section.
- In distributed training, the mean and variance of batch normalization are typically aggregated locally per device. The model may learn to exploit the local information leakage to decrease the loss without improving representations. Chen et al. 2020 address this issue by aggregating BN mean and variance over all devices during the training.
Contrastive Loss Comparison
Similarities
We usually measure similarities between two vectors using one of the following ways
\[\begin{align} &\text{dot/cos}&\pmb u^{\top}\pmb v\\\ &\text{general}&\pmb u^{\top}W\pmb v\\\ &\text{concat}&w^\top\tanh(W_v\pmb v+W_u\pmb u) \end{align}\]where sometimes L2 normalization may be performed before computing the similarities to ensure vectors are of the same scale.
Contrastive losses based on similarities
We lists the following contrastive loss commonly used in literature:
\[\begin{align} &\text{Cross-entropy}&- s(\pmb u, \pmb v^+)/\tau+\log\sum_{\pmb v\in\{\pmb v^+,\pmb v^-\}}\exp(s(\pmb u,\pmb v)/\tau)\\\ &\text{Logistic}&-\log\sigma(s(\pmb u, \pmb v^+))-\log\sigma(-s(\pmb u,\pmb v^-)/\tau)\\\ &\text{Margin Triplet}&\max(s(\pmb u,\pmb v^-)-s(\pmb u,\pmb v^+)+m,0) \end{align}\]where \(s(\cdot, \cdot)\) computes the similarity, \(\tau\) is the temperature(usually around \(.1\)), \(m\) is the margin. Their negative gradients are defined as follows
\[\begin{align} &\text{Cross-entropy}&\left(\left(1-{\exp(s(\pmb u,\pmb v^+))\over Z(\pmb u)}\right){d\over d\pmb u}s(\pmb u,\pmb v^+)-{\exp(s(\pmb u,\pmb v^-))\over Z(\pmb u)}{d\over d\pmb u}s(\pmb u,\pmb v^-)\right)/\tau\\\ &\text{Logistic}&\Big(\sigma(-s(\pmb u, \pmb v^+)/\tau){d\over d\pmb u}s(\pmb u,\pmb v^+)-\sigma(s(\pmb u, \pmb v^-)/\tau){d\over d\pmb u}s(\pmb u,\pmb v^-)\Big)/\tau\\\ &\text{Margin Triplet}&{d\over d\pmb u}s(\pmb u,\pmb v^+)-{d\over d\pmb u}s(\pmb u,\pmb v^-)\text{ if }s(\pmb u, \pmb v^+)-s(\pmb u, \pmb v^-) < m\text{ else }\pmb 0 \end{align}\]where \(Z(\pmb u)=\sum_{\pmb v\in{\pmb v^+,\pmb v^-}}\exp(s(\pmb u,\pmb v))\). Looking at the gradient, we observe 1) \(\ell_2\) normalization along with temperature effectively weights different examples, and an appropriate temperature can help the model learn from hard negatives; and 2) unlike cross-entropy, other objective functions do not weigh the negatives by their relative hardness.
Chen et al. emperically demonstrates that the cross-entropy loss works best for the downstream classification tasks when coupling with the cosine similarity, and normalization helps. In their experiments, they also found \(\tau=0.1\) performs better than others for the downstream classification tasks.
References
Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. “A Simple Framework for Contrastive Learning of Visual Representations.” http://arxiv.org/abs/2002.05709.