Kullback-Leibler divergence (KL divergence) can measure the difference between two probability distributions over the same variable x. Specifically, the Kullback-Leibler (KL) divergence of q(x) from p(x), denoted DKL(p(x), q(x)), is a measure of the information lost when q(x) is used to approximate p(x).
Let p(x) and q(x) are two probability distributions of a discrete random variable x. That is, both p(x) and q(x) sum up to 1, and p(x) > 0 and q(x) > 0 for any x in X. DKL(p(x), q(x)) is defined in Equation:
The KL divergence measures the expected number of extra bits required to code samples from p(x) when using a code based on q(x), rather than using a code based on p(x). Typically p(x) represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution. The measure q(x) typically represents a theory, model, description, or approximation of p(x).
The continuous version of the KL divergence is:
Although the KL divergence measures the “distance” between two distributions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x). Furthermore, it need not satisfy triangular inequality. Nevertheless, DKL(P ||Q) is a non-negative measure. DKL(P ||Q) ≥ 0 and DKL(P ||Q) = 0 if and only if P = Q.
Notice that attention should be paid when computing the KL divergence when p(x) = 0 or q(x) = 0.