**Kullback-Leibler divergence** (KL divergence) can measure the difference between two probability distributions over the same variable *x*. Specifically, the Kullback-Leibler (KL) divergence of *q(x)* from *p(x)*, denoted *D _{KL}(p(x), q(x))*, is a measure of the information lost when

*q(x)*is used to approximate

*p(x)*.

Let *p(x)* and *q(x)* are two probability distributions of a discrete random variable* x*. That is, both *p(x)* and* q(x)* sum up to 1, and* p(x) > 0* and *q(x) > 0 *for any* x* in *X*. *D _{KL}(p(x), q(x))* is defined in Equation:

The KL divergence measures the expected number of extra bits required to code samples from *p(x)* when using a code based on *q(x)*, rather than using a code based on* p(x)*. Typically *p(x)* represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution. The measure *q(x)* typically represents a theory, model, description, or approximation of* p(x)*.

The continuous version of the KL divergence is:

Although the KL divergence measures the “distance” between two distributions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from* p(x)* to *q(x)* is generally not the same as the KL from *q(x)* to *p(x)*. Furthermore, it need not satisfy triangular inequality. Nevertheless, *D _{KL}(P ||Q)* is a non-negative measure.

*D*and

_{KL}(P ||Q) ≥ 0*D*if and only if

_{KL}(P ||Q) = 0*P = Q.*

Notice that attention should be paid when computing the KL divergence when *p(x) = 0 *or* q(x) = 0.
*