In most Classification Problem (Text Classification, Sentiment Classification), Searchers often use **Cross Entropy** as loss function for their models. Do you understand why?

**Cross Entropy** is computed as:

It is also defined as:

Here you can read the relation of Cross Entropy, Entropy and Kullback-Leibler Divergence.

**Why cross entropy is often be used as loss function of deep learning model in classification problem?** We analysis it from its equation.

## 1. Consider only one class.

(1)* y* is the class label, it is *[0, 0 , 0, 0, 1]*

It means y =* [0, 0 , 0, 0, 1]*

We can compute its entropy:

*H(y) = 0*

(2) y_{pred} is the predicted class label computed by model, it may be [0.1, 0.15, 0.2, 0.35, 0.2]

It means *y _{pred }*=

*[0.1, 0.15, 0.2, 0.35, 0.2]*

To make our model can predict the class label more correctly, we should be sure that the error is minimum between* y* and *y _{pred}*.

It means:

*error = f(y,y _{pred})*

**How to minimize the error? It means what is f?**

The best value of *y _{pred}* is

*[0, 0 , 0, 0, 1]*

It means:

*y _{pred} = [0, 0 , 0, 0, 1]*

However, it is hard to get this best value for *y _{pred}* by model.

To minimize the error, we use cross entropy as *f*.

It means:

*f(y,y _{pred}) = H(y,y_{pred})*

However, we can not use:

*f(y _{pred}, y) = H(y_{pred}, y)*

*H(y,y _{pred}) ≠H(y_{pred}, y)*

Because:

*H(y,y _{pred}) = H(y) +D_{KL}(y||y_{pred})*

*H(y) = 0*

*H(y,y _{pred}) = D_{KL}(y||y_{pred})*

As to *y = [0, 0 , 0, 0, 1]*

*H(y,y _{pred}) = D_{KL}(y||y_{pred}) = 0log(0/y_{pred}[0]) + 0log(0/y_{pred}[1]) + 0log(0/y_{pred}[2]) + 0log(0/y_{pred}[3]) + 1log(1/y_{pred}[4])*

* = 0 + log(1/y _{pred}[4])*

It means: we only make sure the value of *y _{pred}*

*[4]*is maximum, the best is

*y*

_{pred}[4] ≈ 1**Why we can not use H(y_{pred}, y)?**

As to *y = [0, 0 , 0, 0, 1]*

*H(y _{pred}, y) =H(y_{pred}) + D_{KL}(y_{pred}||y) = H(y_{pred}) + y_{pred}[0]log(y_{pred}[0]/0) + y_{pred}[1]log(y_{pred}[1]/0) + y_{pred}[2]log(y_{pred}[2]/0) +y_{pred}[3]log(y_{pred}[3]/0) + y_{pred}[4]log(y_{pred}[4]/1)*

* =H(y _{pred}) + ∞ + y_{pred}[4]log(y_{pred}[4]/1)*

It can not minimize *H(y _{pred}, y).*

## 2. Consider multi classes.

The loss function vlaue is the sum of cross entropy of each class.

To sumary:

- Use cross entropy as loss function in classification problem in model can classify the classes.
- We can use
*H(y,y*, but we can not use_{pred})*H(y*_{pred},y)