Gated Self-Attention is an improvement of self-attention mechanism. In this tutorial, we will discuss it for deep learning beginners.
Gated self-attention contains two parts: Gated and self-attention
Gated is a sigmoid function, for example:
\(g_t = sigmoid(W[h_t,s_t])\)
Here we can fuse \(h_t\) and \(s_t\) as follows:
\(u_t = g_t \cdot h_t + (1-g_t) \cdot s_t\)
Here \(h_t\) or \(s_t\) can be computed by self-attention.
Moreover, you also can concatenate \(h_t\) and \(s_t\) to get \(u_t\).
Meanwhile, if the number of features is bigger than 2, you can use self-attention to compute the weight of each feature.
When to use gated self-attention?
If you plan to fuse two features, you can use gated function to apply different weights for them.
Here is an example: