# An Introduction to Scaled Dot-Product Attention in Deep Learning – Deep Learning Tutorial

By | October 11, 2020

Scaled Dot-Product Attention is proposed in paper: Attention Is All You Need

Scaled Dot-Product Attention is defined as:

## How to understand Scaled Dot-Product Attention?

Scaled Dot-Product Attention contains three part:

## 1. Scaled

It means a Dot-Product is scaled. As to equation above, The $$QK^T$$ is divied (scaled) by $$\sqrt{d_k}$$.

## Why we should scale dot-product of two vectors?

Because the value of two vector dot product may be very large, for example:

$QK^T=1000$

Then when we compute: e1000 may cause overflow problem.

## 2. Dot-Product

It means the computation of $$QK^T$$.

## 3. Attention

It means the computation of $$softmax(\frac{QK^T}{\sqrt{d_k}})$$.

## The Thinking in Scaled Dot-Product Attention

Similar to attention above, we also can define our own scaled dot-product attention.

You only need to scale a dot product by $$\sqrt{d_k}$$.

For example:

$$softmax(\frac{QK^T}{\sqrt{d_m}})$$.

Where $$Q\in R^{m*n}$$ and $$K \in R^{1*n}$$, $$d_m$$ is a scalar n.