AdaNorm

Introduction

We briefly analyze the effectiveness of layer normalization in deep learning and discuss its improvement AdaNorm proposed by Xu et al.

Takeaways

The main takeaways are listed as follows:

Many of the previous studies believe that the success of LayerNorm comes from forward normalization, which normalizes inputs to standard normal distribution. Xu et al. find that forward normalization has little to do with the effectiveness. Instead, their experiments show that derivatives of the mean and variance play a significant role in LayerNorm. Theoretically, this is because the derivative of \(\mu\) re-centers \(\partial\ell\over\partial\pmb x\) to zero and derivative of \(\sigma\) reduces the variance of \(\partial\ell\over\partial\pmb x\)(due to the length of the derivation, we refer to Section 3.3 in the paper for details). Furthermore, they find the derivative of variance becomes more important on deeper network in their ablation study
The gain \(\gamma\) and bias \(\beta\) learned by the LayerNorm increase the risk of overfitting. This may be because they are learned from the training set and ignore the inpute distributions of the test data.
From the abover observations, Xu et al. propose AdaNorm to adaptively control scaling weights towards different inputs(again we refer interested reader to Theorem 2 in the paper for details):

\[\begin{align} \pmb z&=\phi(\pmb y)\cdot\pmb y\\\ where\ \pmb y&={(\pmb x-\mu_x)\over\sigma_x}\\\ \phi(y_i)&=C(1-ky_i) \end{align}\]

where \(C>0\) is a hyperparameter. \(k\) should be chosen such that \(\vert y_i\vert <{1/k}\). As \(\pmb y\) is normalized to be of zero mean \(\mathbb I\) standard deviation. From Chebyshev’s Inequality, we have

\[P(|y_i|<1/k)>1-k^2\]

If we expect that \(\vert y_i\vert <{1/k}\) holds with a probability higher than \(99\%\), \(k=0.1\) should be chosen.

References

Xu, Jingjing, and Xu Sun. 2019. “Understanding and Improving Layer Normalization,” no. NeurIPS: 1–19.

AdaNorm

Categories

Tags

Introduction

Takeaways

References