Post-Norm and Pre-Norm Residual Units Explained – Deep Learning Tutorial

By | March 24, 2022

In this tutorial, we will introduce post-norm and pre-norm residual units, they are often used to improve transformer in deep learning. In paper Learning Deep Transformer Models for Machine Translation you can find more detail.


Post-Norm is defined as:

Post-Norm Residual Units


Pre-Norm is defined as:

Pre-Norm Residual Units

Here LN() function is the layer normalization function. To implement layer normalization, you can view:

Layer Normalization Explained for Beginners – Deep Learning Tutorial

Which one is better?

Both of these methods are good choices for implementation of Transformer. In our experiments, they show comparable performance in BLEU for a
system based on a 6-layer encoder

In paper Transformers without Tears: Improving the Normalization of Self-Attention, we can find pre-norm is better.

For example:

Pre-Norm and Post-Norm, which is better

In paper Conformer: Convolution-augmented Transformer for Speech Recognition, pre-norm is also be used.

We use prenorm residual units with dropout which helps training and regularizing deeper models