Scaled Dot-Product Attention is proposed in paper: Attention Is All You Need

Scaled Dot-Product Attention is defined as:

## How to understand Scaled Dot-Product Attention?

Scaled Dot-Product Attention contains three part:

## 1. Scaled

It means a Dot-Product is scaled. As to equation above, The \(QK^T\) is divied (scaled) by \(\sqrt{d_k}\).

## Why we should scale dot-product of two vectors?

Because the value of two vector dot product may be very large, for example:

\[QK^T=1000\]

Then when we compute: e^{1000} may cause overflow problem.

## 2. Dot-Product

It means the computation of \(QK^T\).

## 3. Attention

It means the computation of \(softmax(\frac{QK^T}{\sqrt{d_k}})\).

## The Thinking in Scaled Dot-Product Attention

Similar to attention above, we also can define our own scaled dot-product attention.

You only need to scale a dot product by \(\sqrt{d_k}\).

For example:

\(softmax(\frac{QK^T}{\sqrt{d_m}})\).

Where \(Q\in R^{m*n}\) and \(K \in R^{1*n}\), \(d_m\) is a scalar n.