In this tutorial, we will introduce post-norm and pre-norm residual units, they are often used to improve transformer in deep learning. In paper Learning Deep Transformer Models for Machine Translation you can find more detail.

## Post-Norm

Post-Norm is defined as:

## Pre-Norm

Pre-Norm is defined as:

Here LN() function is the layer normalization function. To implement layer normalization, you can view:

Layer Normalization Explained for Beginners – Deep Learning Tutorial

## Which one is better?

Both of these methods are good choices for implementation of Transformer. In our experiments, they show comparable performance in BLEU for a

system based on a 6-layer encoder

In paper Transformers without Tears: Improving the Normalization of Self-Attention, we can find pre-norm is better.

For example:

In paper Conformer: Convolution-augmented Transformer for Speech Recognition, pre-norm is also be used.

We use prenorm residual units with dropout which helps training and regularizing deeper models