Sherwin Chen
by Sherwin Chen
3 min read

Categories

Tags

Introduction

In the previous post, we discussed Differentiable Neural Computers(DNCs) that can effectively write to and retrieve information from an external memory. In this post, we further discuss several improvements on DNC proposed by Csordas&Schmidhuber.

Masked Content-Based Addressing

Problem

The goal of content-based addressing is to find memory cells similar to a given key. The query key may contain partial information, and require the content-based memory read to complete its missing part based on memories. However, controlling which part of the memory vector to search for is difficult because there is no key-value separation: the entire key and entire memory value are compared to produce the similarity score. This means that the part of the cell value that is unknown during search time and should be retrieved is also used in the normalization part of the cosine similarity, resulting in an unpredictable score.

Let’s take an example to demonstrate this effect. Consider \(\pmb k\) and \(\pmb M\) as follows

\[\begin{align} \pmb k&=\begin{bmatrix} 0\\\1\\\1\\\ \end{bmatrix}\\\ \pmb M&=\begin{bmatrix} 10&1&1\\\ 1&1&0.1 \end{bmatrix} \end{align}\]

where \(\pmb k\) only contains partial information in the last two dimensions. The unnormalized&normalized similarity score are

\[\begin{align} \pmb M\pmb k&=\begin{bmatrix} 2\\\1.1\end{bmatrix}\\\ {\pmb M\pmb k\over|\pmb M||\pmb k|}&\approx \begin{bmatrix} 0.14\\\0.55 \end{bmatrix} \end{align}\]

where the matching weighting in the normalized similarity score is reversed because the large \(\ell^2\) norm of \(\pmb M\) at the first location. This issue could be further exacerbated by the short known part of key \(\pmb k\) and long word size of memory \(\pmb M\).

Proposed Solution

The problem could be solved by explicitly masking the part that is unknown and should not be used in the query. This is done by introducing a separate mask vector through the controller and multiplying both the search key and the memory content by it:

\[\begin{align} \mathcal C(\pmb M,\pmb k,\beta,\pmb m)=\text{softmax}(\beta (\hat{\pmb M}\circ\pmb 1\pmb m^{\top})(\hat{\pmb k}\circ\pmb m)) \end{align}\]

where \(\hat {\pmb M}\) is \(\pmb M\) with \(\ell^2\)-normalized rows and \(\hat{\pmb k}\) is \(\ell^2\)-normalized \(\pmb k\).

De-allocation and Content-Based Look-Up

Problem

The DNC de-allocates memory cells relies on \(\pmb w_t^w\), which is a combination of allocation weighting and content-based weighting. This weighting governs both how much to erase and how much to write to, making it less effective for memory de-allocation. This would result in memory leakage that a read head may still find cells which was supposed to be removed beforehand.

Proposed Solution

Csordas&Schmidhuber propose to reuse the retention vector \(\pmb \psi_t\) from the allocation mechanism to zero out the memory content as follows:

\[\begin{align} \pmb M_t=\pmb M_{t-1}\circ\pmb\psi_t\pmb 1^\top\circ(\pmb E-\pmb w_t^w\pmb e_t^{\top})+\pmb w_t^w\pmb v_t^{\top} \end{align}\]

Note that the cosine similarity used in contend-based look-up is normalized by the \(\ell^2\)-norm which would cancel the effect of this operation—when a vector is small, its \(\ell^2\)-norm is also small, making its \(\ell^2\) normalized value big. However, in practice, due to numerical stability, we usually add a small constant to the denominator as follows:

\[\begin{align} D(\pmb u,\pmb v)={\pmb u\cdot\pmb v\over|\pmb u||\pmb v|+\epsilon} \end{align}\]

When memory cell \(\pmb u\) is erased and therefore small, the stabilizing constant \(\epsilon\) dominates the denominator. This will results in a low score to the erased cell in the content addressing.

Problem

The temporal memory linkage track the written order using a link matrix \(\pmb L\), which is updated as follows:

\[\begin{align} \hat {\pmb L}_t&=(\pmb E-\pmb w_t^w\pmb 1^\top-\pmb 1^{\top}(\pmb w_t^w)^\top)\circ\pmb L_{t-1}+\pmb w_t^w(\pmb p_{t-1})^\top\\\ \pmb L_t&=\hat{\pmb L_t}\circ(\pmb E-\pmb I)\qquad\text{removes self-links} \end{align}\]

where \(\pmb E\) is a matrix of ones and \(\pmb I\) is an identity matrix.

If \(\pmb w_t^w\) is not one-hot, sequential information about all non-zero addresses will be reduced in \(\pmb L_t\) and the noise from the current write will also be included repeatedly. This will make forward (\(f_t^i\)) and backward(\(b_t^i\)) distribution of long-term present cells noisier and noisier, and flatten them out. When chaining multiple reads by temporal links, the new address is generated through repeatedly multiplying by \(\pmb L_t\) , making the blurring effect exponentially worse.

Proposed Solution

Csordas&Schmidhuber propose to improve sharpness of the link distribution, which is done as follows

\[\begin{align} f_t^i&=S(\pmb L_t\pmb w_{t-1}^{r,i}, s_t^{f,i})\\\ b_t^i&=S(\pmb L_t^\top\pmb w_{t-1}^{r,i}, s_t^{b,i})\\\ \text{where}\quad S(\pmb d,s)&=\text{softmax}(\pmb d^s) \end{align}\]

Although this does not fix the noise accumulation in the link matrix \(\pmb L_t\), but it significantly reduces the effect of exponential blurring behavior when following the temporal links, making the noise in \(\pmb L_t\) less harmful.

Note that \(S(\pmb d,s)\) can be numerically unstable for small \(\pmb d\). We can stabilize it by:

\[\begin{align} S(\pmb d, s)=\text{softmax}\left({\pmb d+\epsilon\over \max(\pmb d+\epsilon)}\right) \end{align}\]

References

Robert Csordas, Jurgen Schmidhuber. Improving Differentiable Neural Computers Through Memory Masking, De-allocation, and Link Distribution Sharpness Control

Code: https://github.com/xdever/dnc