Attention mechanism is an important method to improve the performance of deep learning model. However, there are two basic forms. They are:

\(s_i = \sum_{j=1}^nf(a_{ij}w_{ij}) \) (1)

or

\(s_i = \sum_{j=1}^na_{ij}f(w_{ij})\) (2)

where \(a_{ij}\) is the attention weight of word \(w_{ij}\).

Which form is better? Equation (1) or (2)?

We can find the answer from Jensen’s Inequality.

## Jensen’s Inequality

As to convex functions, Jensen’s Inequality is:

Here is the full text:

http://www.cse.yorku.ca/~kosta/CompVis_Notes/jensen.pdf

The illustrative example looks like:

As to deep learning, the loss function is a convex function, in order to get the minimum value by SGD. We can use Equation (1).

In order to determine a function is convex or not, we can compute the second derivative of it.