Graduate Descent

Interactive KL Divergence Fitting


statistics machine-learning probability interactive

In a previous post, I covered the theory behind the two directions of KL divergence for fitting a model $q_\theta$ to a target $p$. The key takeaway: $\textbf{KL}(p \| q)$ is inclusive (mean-seeking) while $\textbf{KL}(q \| p)$ is exclusive (mode-seeking). But reading about it is one thing—seeing it is another.

The widget below lets you watch both directions optimize simultaneously. The target $p$ is a Gaussian mixture (shaded region), and we fit a single Gaussian $q$ (colored curve) using Adam. Drag the modes of $p$ to rearrange them, scroll to resize them, or drag $q$ to set its starting point. Use the + and − buttons to add or remove modes.

KL(p || q) — inclusive / mean-seeking

KL(q || p) — exclusive / mode-seeking

What to look for:

  • Inclusive (top, red): $q$ spreads out to cover all of $p$'s mass. It finds the moment-matching solution—the mean and variance of the mixture. The dashed line shows this global optimum. The problem is convex, so the optimizer always converges.

  • Exclusive (bottom, blue): $q$ locks onto a single mode. The dashed lines show local optima (darker = global best). The optimizer may find a different local minimum each time you hit Reset—the landscape is nonconvex. Try it a few times!

  • Drag the handles on $p$'s modes (dark circles) to move them. Scroll on a handle to change its width. You can also drag $q$'s handle (colored circle) to set a new starting point. Use + and − to add or remove modes.

  • Fixed-variance mode: Switch to "Gaussian(μ, fixed σ²)" to see mode-seeking even more clearly—$q$ can only slide left and right.

Why do we care?

The choice of KL direction determines what your model learns to ignore:

  • Inclusive ($\textbf{KL}(p \| q)$): $q$ must cover everywhere $p$ has mass, so it overshoots, spreading too wide. It would rather be wrong everywhere than miss a mode. This is what maximum likelihood does.

  • Exclusive ($\textbf{KL}(q \| p)$): $q$ can safely ignore modes of $p$, collapsing onto just one. It would rather be precise about one thing than vaguely right about everything. This is what variational inference does.