Layer Normalization is proposed by Jimmy Ba et al. in 2016. Here is the paper:

https://arxiv.org/abs/1607.06450

In this tutorial, we will introduce it for machine learning beginners.

## Layer Normalization in Neural Networks

Layer Normalization is common used right now, for example, as to multi-head attention network.

Layer Normalization is applied in each layer.

## What is Layer Normalization?

Layer Normalization can be viewed as:

It means y_{i} = LN(x_{i})

In neural networks, The l-th layer can be computed as:

where w_{i}^{l} is the weight matrix of l-th layer, b_{i}^{l} is the bias, f is the activation function.

In order to normalize the l-th layer, we can normalize a_{i}^{l} as follows:

where H denotes the number of hidden units in a layer. ε can be 0 or 1e-12. g^{l} is a gain parameters. Θ is the element-wise multiplication between two vector.

You should notice: g^{l} may be ignored if you do not want to scale normalization.

## Layer Normalization in RNN

In RNN, the t-th time step can be normalized as:

## How to implement layer normalizatin in tensorflow?

We can use tf.contrib.layers.layer_norm() to implement it.