Interactive Divergence Fitting

This page built off of Tim Vieira's interactive KL divergence fitting post

Monte Carlo gradient estimation — drag the target modes or the fitted distribution

... steps/s
$\textcolor{#7B2D8E}{q_\phi}$: $\phi\!=$
Minimize $\mathrm{KL}(p \| \textcolor{#7B2D8E}{q_\phi})$$\chi^2(p \| \textcolor{#7B2D8E}{q_\phi})$ — mean-seeking
$\mathrm{KL}(p \| \textcolor{#7B2D8E}{q_\phi})$$\chi^2(p \| \textcolor{#7B2D8E}{q_\phi})$ = ...
Minimize $\mathrm{KL}(\textcolor{#7B2D8E}{q_\phi} \| p)$$\chi^2(\textcolor{#7B2D8E}{q_\phi} \| p)$ — mode-seeking
$\mathrm{KL}(\textcolor{#7B2D8E}{q_\phi} \| p)$$\chi^2(\textcolor{#7B2D8E}{q_\phi} \| p)$ = ...
target $p(z)$ fitted $\textcolor{#7B2D8E}{q_\phi}$ optimal $q^*$ (dashed = global, dotted = local)
$w = \tilde{p}/\textcolor{#7B2D8E}{q_\phi}$$w^2 = (\tilde{p}/\textcolor{#7B2D8E}{q_\phi})^2$ (fwd only) rug: blue = positive gradient contribution; red = negative
Reading the plots
How the gradients are computed

Forward KL: $\nabla_\phi \mathrm{KL}(p \| q_\phi) = -\mathbb{E}_p[\nabla_\phi \log q_\phi(x)]$. For a Gaussian $q$ this depends only on $\mathbb{E}_p[x]$ and $\mathbb{E}_p[(x-\mu)^2]$, which are available in closed form for a Gaussian mixture — no integration or sampling needed. The forward KL is convex; the gradient always points toward the unique moment-matching solution.

Reverse KL: $\nabla_\phi \mathrm{KL}(q_\phi \| p) = -\nabla_\phi \mathbb{E}_{q_\phi}[\log p(x)] + \ldots$ requires $\mathbb{E}_q[\nabla_x \log p(x)]$, an expectation of a nonlinear function of $p$ under $q$. This has no closed form, so it is computed by numerical integration (trapezoidal rule, 400 grid points) — effectively exact in 1D. The reverse KL is nonconvex, so gradient descent finds local minima but cannot escape them.

Why this doesn't scale. In higher dimensions, numerical integration requires $N^d$ evaluations ($N$ grid points per dimension), and closed-form mixture moments are unavailable for complex targets like empirical data distributions. In maximum likelihood training, $\mathbb{E}_p[\cdot]$ is replaced with a sample average over training data — this is Monte Carlo estimation of the forward KL gradient. Switch to Monte Carlo mode to see how stochastic estimation changes the dynamics.

Forward $\chi^2$: $\nabla_\phi \chi^2(p \| q_\phi) = -\mathbb{E}_p[w(z)\,\nabla_\phi \log q_\phi(z)]$ where $w = p/q_\phi$. Unlike forward KL, no closed-form gradient exists — the extra $w(z)$ factor prevents reduction to simple moments. Computed by numerical integration (trapezoidal rule, 400 grid points). Minimizing forward $\chi^2$ is equivalent to minimizing IS weight variance.

Reverse $\chi^2$: $\nabla_\phi \chi^2(q_\phi \| p) = 2\,\mathbb{E}_{q_\phi}[(q_\phi/p)\,\nabla_\phi \log q_\phi(z)]$. Also computed by numerical integration. Like reverse KL, the landscape is nonconvex and mode-seeking, but the penalty for placing mass where $p \approx 0$ is quadratic rather than logarithmic.

Convexity: Like forward KL, forward $\chi^2(p \| q_\phi)$ is convex in the natural parameters — in fact log-convex, which is stronger. This is because $\chi^2 + 1 = \exp(A(\eta) + f(\eta))$ where both $A$ (log-partition) and $f(\eta) = \log \int p(x)^2 e^{-\eta \cdot T(x)} dx$ are convex. Reverse $\chi^2$ is nonconvex in all parameterizations.

Forward KL: $\nabla_\phi \mathrm{KL}(p \| q_\phi) = -\mathbb{E}_p[\nabla_\phi \log q_\phi(x)]$. The expectation is under $p$, which we can evaluate but not sample from, so we use importance sampling from $q_\phi$: sample $z^{(i)} \sim q_\phi$, compute self-normalized weights $\bar{w}_i \propto p(z^{(i)})/q_\phi(z^{(i)})$, and estimate the gradient as a weighted average of score functions.

Reverse KL: $\nabla_\phi \mathrm{KL}(q_\phi \| p) = -\nabla_\phi \mathbb{E}_{q_\phi}[\log p(x)] + \ldots$. The expectation is under $q_\phi$, which depends on $\phi$, so we need to differentiate through the sampling distribution. REINFORCE uses the log-derivative trick: weight each sample's score function $\nabla_\phi \log q_\phi(z^{(i)})$ by the reward $r_i = \log p(z^{(i)}) - \log q_\phi(z^{(i)})$. Reparameterization writes $z = \mu + \sigma\varepsilon$ with $\varepsilon \sim \mathcal{N}(0,1)$ and differentiates through the sample directly, using $\nabla_x \log p(z)$.

Why MC optimization can get stuck. All MC estimators sample from $q_\phi$, so the gradient can only reflect what $q_\phi$ "sees." If $q_\phi$ doesn't overlap a mode of $p$, the gradient has no signal to move toward it. This is most striking for the forward KL, which is convex — true gradient descent always converges, but MC estimation can get stuck indefinitely. Set $K$ low and drag $q_\phi$ onto one mode to see this. For the reverse KL, mode-locking is a genuine local minimum even with exact gradients, but MC noise can additionally trap the optimizer in places that are not true local minima.

Forward $\chi^2$: Like forward KL, uses importance sampling from $q_\phi$, but the per-particle weight is $\propto \bar{w}_i^2$ (squared IS weights) rather than $\bar{w}_i$. This concentrates the gradient on the highest-weight particles — the effective sample size $\mathrm{ESS} = 1/\sum_i \bar{w}_i^2$ directly quantifies how many particles contribute.

Reverse $\chi^2$: Sample $z^{(i)} \sim q_\phi$ and weight each score by $-q_\phi(z^{(i)})/p(z^{(i)}) = -e^{-r_i}$. Unlike REINFORCE, the weights are always negative: the gradient only pushes $q_\phi$ away from regions where it overestimates $p$, never directly pulling toward high-$p$ modes. No REINFORCE or reparameterization trick is needed — the pathwise and REINFORCE terms are equal, so backpropagation through $q_\phi(z)$ at stopped samples gives the correct gradient.

Why MC optimization can get stuck. The same issue as KL: MC gradients are blind to modes $q_\phi$ doesn't overlap. For forward $\chi^2$, the squared weighting makes this worse — even fewer particles contribute. For reverse $\chi^2$, the "push only" dynamics mean $q_\phi$ can only learn to flee bad regions, never actively seek good ones.