# Understand Jensen’s Inequality and Attention Mechanism in Deep Learning – Deep Learning Tutorial

By | March 7, 2021

Attention mechanism is an important method to improve the performance of deep learning model. However, there are two basic forms. They are:

$$s_i = \sum_{j=1}^nf(a_{ij}w_{ij})$$    (1)

or

$$s_i = \sum_{j=1}^na_{ij}f(w_{ij})$$    (2)

where $$a_{ij}$$ is the attention weight of word $$w_{ij}$$.

Which form is better? Equation (1) or (2)?

We can find the answer from Jensen’s Inequality.

## Jensen’s Inequality

As to convex functions, Jensen’s Inequality is:

Here is the full text:

http://www.cse.yorku.ca/~kosta/CompVis_Notes/jensen.pdf

The illustrative example looks like:

As to deep learning, the loss function is a convex function, in order to get the minimum value by SGD. We can use Equation (1).

In order to determine a function is convex or not, we can compute the second derivative of it.