Temperature scaling

What modifying temperature does to a distribution.

Given a distribution \(p\) over a finite set of elements, writing \(p_i\) for the probability of element \(i\), for some temperature \(T\in\mathbb{R}\), define the temperature-scaled distribution \(p^{(T)}\) as

\[ p^{(T)}_i \;\triangleq\; \frac{p_i^{1/T}}{Z_T}, \qquad Z_T = \sum_j p_j^{1/T}. \]

Equivalently, \(p^{(T)}\) is proportional to a linear scaling of the log-probabilities,

\[ p^{(T)} = \operatorname{softmax}\!\big(\tfrac{1}{T} \log p\big), \qquad \text{where}\quad \operatorname{softmax}(x)_i \;\triangleq\; \frac{e^{x_i}}{\sum_j e^{x_j}}. \]

It’s convenient to parametrize with the inverse temperature parameter \(\beta \triangleq 1/T\).

At \(T = \beta = 1\), \(p^{(T)} = p\). As \(T \to 0\) (\(\beta \to \infty\)), mass concentrates on the argmax; as \(T \to \infty\) (\(\beta \to 0\)), \(p^{(T)}\) flattens toward uniform over the support of \(p\).

\(T\) = 1

\(\beta\) = 1

\(\beta\):

01∞

base \(p\):

top-\(p\) = 1

Mathematically, nothing stops us from taking \(T<0\), it just reverses things… allow \(T<0\)

Top row: the base distribution \(p\) (drag bars to edit) and the tempered \(p^{(T)}\). Bottom row: the same two distributions in log space, where temperature scaling is linear: bars are scaled by \(\beta\), then shifted by the common offset \(-\log Z_T\) (in the right panel, ticks mark the pre-normalization values \(\beta \log p_i\); the dotted segments are the shift). Dashed line: the uniform distribution, i.e. the \(\beta\to 0\) limit. The top-\(p\) slider truncates the tempered distribution to its nucleus (the smallest set of highest-probability elements whose mass reaches the threshold, with ties broken by index) and renormalizes.