Exploring some standard resampling schemes.⊕I drafted this post back in October 2023, with visualizations unfinished. It’s structured pedagogically as a standalone introduction aimed at my own understanding. Now (2026) I’ve had Claude reimplement and vastly improve the visualizations with interactivity, and thought the result was was worth sharing …with a vibe-coding disclaimer: I have not checked the code for simulations, which is all new. But even with that caveat, the illustrations should serve their purpose better than what I had originally.

I’ve been thinking recently about the way in which you do resampling in sequential Monte Carlo (SMC). Like many other things, while there are many asymptotically identical methods, in practice it matters which one you choose. In this post, I’m making some visualizations to explore some standard resampling schemes, and get intuitions about why they work, and why you might choose one over another.

1 Why to resample

SMC relies on importance sampling. So let’s start quickly recapping that, and defining some notation. We approximate a target distribution $\target(\cdot)$ with a set of samples from some surrogate proposal distribution, by assigning each sample an importance weight $\impwt^\idx$ (proportional to the density ratio of target to proposal). Then the set of weighted particles $(\state^\idx,\impwt^\idx)_{\idx=1}^\np$ defines an empirical approximation to the current target $\target(\cdot)$:

\[\widehat{\target}(\cdot) \triangleq \sum_{\idx=1}^\np \normwt^\idx \delta_{\state^\idx}(\cdot)\]

where $\normwt^\idx = \impwt^\idx / \sum_j \impwt^j$ are the normalized weights.Throughout the rest of this post, “weights” and weights in code refer to the normalized weights $\normwt^\idx$.

For any test function $f$, taking a weighted average of the function applied over the particles gives an estimator of its expectation under the target: $\widehat{\E_{\target}[f]}\triangleq\E_{\widehat{\target}}[f]=\sum_\idx \normwt^\idx f(\state^\idx)$.

In sequential importance sampling (SIS), we apply importance sampling sequentially, to approximate a sequence of target distributions by maintaining an evolving population of particles, iteratively propagating and reweighting. One main issue with SIS is that it can suffer from weight degeneracy: When the weights become concentrated and the budget of $\np$ particles effectively behaves as if it were just one sample, defeating the purpose of having multiple particles.

SISSMC
Particle weight evolution.
Bars show normalized weights $\normwt_t^\idx$ at each step $t$ in a bootstrap particle filter.Illustrated for a Gaussian random walk model: $\state_t \sim \mathcal{N}(\state_{t-1}, 1)$; $y_t \sim \mathcal{N}(\state_t, 0.25)$; $y_t{=}2$.

Click a particle to trace its lineage, or click a timestep label to see all ancestors.

This weight degeneracy issue is one key motivation for SMC, which generalizes from SIS by the addition of resampling steps. Resampling replaces a current set of particles (potentially with degenerate weights) with a new set of samples with weights all set to be equal, duplicating high-weight particles and dropping low-weight ones. This addresses weight degeneracy, but introduces a different problem: path degeneracy.I’m getting this terminology for the two kinds of degeneracy from the excellent Naesseth et al. (2019, Chapter 2). They note that adaptive resampling can be a partial mitigation for path degeneracy. Only resample when the effective sample size $\text{ESS} = 1/\sum_\idx (\normwt^\idx)^2$ (which ranges from 1 when one particle has all the weight to $\np$ when weights are uniform) drops below a threshold (e.g., $\np/2$), rather than at every step. In this post I want to focus just on what happens when we do resample, rather than on when to resample, but the ESS values in the illustration can give a sense of when resampling would be triggered in an adaptive resampling method.

After enough steps, all current particles may trace back to a single ancestor at earlier timesteps [click on a timestep label in the illustration to see ancestry].

Adding resampling can keep our particle approximation useful (fixing SIS’s weight degeneracy problem), but the method we use to resample can affect both the variance of our estimates and how quickly path diversity is lost. The rest of this post explores several resampling methods, visualizes their differences, to help build intuition for why some would perform better than others.

1.1 The first-moment (unbiasedness) condition

For any resampling method to be unbiased, the expected number of copies made of each particle must be proportional to its weight. That is, $\cnt^\idx\sim r$ be the number of copies of particle $\idx$ after resampling (writing $r$ for the resampling method, some distribution over nonnegative integers, conditioned on the current set of weighted particles)

\[\E_{r}[\cnt^\idx] = \np \normwt^\idx \quad \text{for all } \idx\]

This ensures that the equally-weighted resampled particles $\sum_{\idx=1}^{\np} \frac{1}{\np} f(\rstate^\idx)$ form an unbiased estimator of the original weighted sum $\sum_\idx \normwt^\idx f(\state^\idx)$.

2 Some standard methods

We’ll focus on visualizing and understanding four canonical methods: multinomial, stratified, systematic, and residual resampling. These methods all satisfy this ‘first-moment condition.’ However, higher moments can differ greatly between schemes (as we’ll see, looking at variance).

2.1 Inverse transform sampling

The first methods we’ll look at share the same core idea of inverse transform sampling.A little while ago, I made another post exploring density transformations in the continuous case. Here we are doing this in a discrete setting.

We can think of partitioning the unit interval into segments of width $\normwt^\idx$, and mapping from ‘probe’ locations on the unit interval to determine the resampled particles. More precisely, the cumulative distribution function (CDF) of the discrete distribution defined by the weights, $\cdf(\idx) = \sum_{j=1}^{\idx} \normwt^j$, defines this partition, and its inverse (the quantile function) it will map a value $\probe\in(0,1]$ to a selected (resampled) particle $\invcdf(\probe)$.⊕If you haven’t heard of inverse transform sampling before, this is actually a nice way to build the intuition. Think about how to randomly place $\np$ probes so that any particle $i$ gets selected $\np\normwt^\idx$ times on average.

Try clicking on the CDF plot at right to place probes, and consider what strategy would you use for placing $\np$ probes, in order to have the distribution of resampled particles recreate the weight histogram on average.

Drag bars on left to change weights, or choose from one of the preset options in the “Weights” dropdown below.
Click on plot to place ‘probes.’

2.2 Multinomial, stratified, and systematic resampling

You may have found that the most natural idea is to use $\np$ independent draws from the uniform distribution. This works, and leads to the first standard algorithm, multinomial resampling. We will then look at two other common strategies that reduce variance by spreading probes more evenly for lower variance.

2.2.1 Multinomial resampling

Draw $\np$ independent uniforms $\probe_1, \ldots, \probe_\np \overset{\text{iid}}{\sim} \mathrm{Uniform}(0,1)$ and map each through the inverse CDF. The counts $(\cnt^1, \ldots, \cnt^\np)$ follow a multinomial distribution.

def multinomial_resample(weights, N):
    cumulative_sum = np.cumsum(weights) # Step 1: inverse CDF
    positions = random(N) # Step 2: N independent uniform probes
    return np.searchsorted(cumulative_sum, positions) # Step 3: probes → indices

[Note, the one-liner indices = np.random.choice(N, size=N, replace=True, p=weights) would be functionally equivalent and looks simpler, but obscures the similarity to the other methods that we’ll look at next.]

Click Resample once to see the counts fluctuate; Run $K$ trials to see means settle toward the weights.

$K$: 1000

Unbiasedness Each probe $\probe_k$ is independently $\mathrm{Uniform}(0,1)$, so it lands in particle $\idx$’s CDF segment (of width $\normwt^\idx$) with probability $\normwt^\idx$. With $\np$ independent probes, $\cnt^\idx \sim \mathrm{Binomial}(\np, \normwt^\idx)$ and $\E[\cnt^\idx] = \np\,\normwt^\idx.$

2.2.2 Stratified resampling

You may have noticed that because the probes are all independently sampled in multinomial resampling, they can cluster and leave gaps. This leads to undesirable extra variance (consider in particulr the case where the weights are roughly uniform).

Stratified resampling introduces a technique to spread them more evenly: Partition $[0,1]$ into $\np$ equal strata $\bigl(\frac{k-1}{\np},\, \frac{k}{\np}\bigr)_{k=1}^{\np}$

def stratified_resample(weights, N):
    cumulative_sum = np.cumsum(weights) # Step 1 (same as above)
    positions = (random(N) + range(N)) / N # Step 2: one uniform per stratum
    return np.searchsorted(cumulative_sum, positions) # Step 3 (same as above)

Each stratum gets exactly one probe, so we can bound the counts $\lfloor \np\normwt^\idx \rfloor \leq \cnt^\idx \leq \lceil \np\normwt^\idx \rceil$.

$K$: 1000

Unbiasedness Write $\mathbf{1}_k^\idx$ for the indicator that stratum $k$’s probe lands in particle $\idx$’s segment. Within stratum $k$, the probe is $\mathrm{Uniform}\bigl(\frac{k-1}{\np},\, \frac{k}{\np}\bigr)$, so $\E[\mathbf{1}_k^\idx]$ equals the overlap between stratum $k$ and particle $\idx$’s CDF interval, scaled by $\np$. Summing over all strata: $\E[\cnt^\idx] = \sum_{k=1}^{\np} \E[\mathbf{1}_k^\idx] = \np\,\normwt^\idx$, since the strata tile $[0,1]$ and particle $\idx$’s segment has total length $\normwt^\idx.$

2.2.3 Systematic resampling

Notice that stratified resampling still uses $\np$ random draws to place the probe in each stratum. Another standard technique called systematic resampling uses just one random value, and places all the probes based on that single offset: Draw $U \sim \mathrm{Uniform}(0, 1/\np)$ and set $\probe_k = U + (k{-}1)/\np$.

The probes form an equally-spaced comb. Drag the green handle to slide it, or click Random offset to sample.

def systematic_resample(weights, N):
    cumulative_sum = np.cumsum(weights) # Step 1 (same as above)
    positions = (random() + np.arange(N)) / N # Step 2: one offset for all N probes
    return np.searchsorted(cumulative_sum, positions) # Step 3 (same as above)

$K$: 1000

Unbiasedness The offset $U$ is $\mathrm{Uniform}(0, 1/\np)$, so each probe $\probe_k = U + (k{-}1)/\np$ is marginally $\mathrm{Uniform}\bigl(\frac{k-1}{\np},\, \frac{k}{\np}\bigr)$, the same distribution as the stratified probe in stratum $k$. The same tiling argument gives $\E[\cnt^\idx] = \np\,\normwt^\idx.$

Note that the probes are no longer independent. A single $U$ determines all of them. The marginal distributions are identical to stratified, so unbiasedness holds, but the joint distribution differs (and so also the variance).

2.2.4 ⚠︎ Systematic can be higher variance than multinomial

It might seem like systematic resampling will always be at least as good as multinomial in terms of variance. And indeed Douc et al. (2005, Section 3.4) describe the “frequently encountered conjecture that systematic resampling dominates multinomial resampling in terms of conditional variance.” But, the conjecture is false, and they give the following counterexample.

Consider a case where weights are a ‘comb’ alternating large, small, large, small, and the test function $f(\state^\idx) = \mathbf{1}[\idx \text{ even}]$.⊕To see this in action, set weights to the “Alternating” preset, and $f$ to $\mathbf{1}[\idx \text{ even}]$, and look at the variance in the head to head comparison in § 2.4. Try switching to other test functions (e.g., mean position) to see the effect vanish when $f$ is not aligned with the weight periodicity.

In this case, systematic resampling leads to samples that are all on the ‘large’ weights, or split uniformly, producing a bimodal estimator. The systematic variance is $(\normwt_{\mathrm{even}}-\tfrac{1}{2})(1-\normwt_{\mathrm{even}})$, constant in $\np$, while multinomial’s decreases as $\normwt_{\mathrm{even}}(1 - \normwt_{\mathrm{even}})/\np$.

Note. This counterexample depends on the particle ordering.Douc et al. (2005, Section 3.4) observe that both stratified and systematic resampling are sensitive to particle ordering: permuting the indices before resampling changes the distribution of the resampled set. After random permutation, systematic resampling becomes empirically similar to residual/stratified. They conclude that the counterexample is likely a “rare” situation, but that it shows systematic resampling is not as robust a variance-reduction method as stratified or residual, and that its theoretical analysis is considerably harder.

2.3 Residual resampling

We could also consider a deterministic-stochastic hybrid, where some particles are deterministically set, and others are allocated randomly. In residual resampling, we give particle $\idx$ exactly $\lfloor \np\normwt^\idx \rfloor$ copies, then fill the remaining $R = \np - \sum_\idx \lfloor \np\normwt^\idx \rfloor$ slots by resampling on the residual weights $\resid^\idx = (\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor)/R$. This nondeterministic part of the algorithm could be done using any of the three CDF methods (select below).⊕$\Var_{\text{resid}} \leq \Var_{\text{mult}}$ when phase 2 uses multinomial or stratified resampling (Douc et al., 2005). The deterministic phase removes variance for the integer parts; the phase-2 choice affects only the residual variance. Note that residual-systematic does not have this guarantee, since systematic resampling can hit the same counterexample on the residual weights.

The right plot shows the residual CDF (solid) overlaid on the original weights CDF (dotted). For highly skewed weights, you can see that first allocating the deterministic weights makes the residual CDF much more even than the original was.

# Phase 1 (deterministic): guaranteed copies from the integer part
num_copies = np.floor(N * weights) # ⌊Nwⁱ⌋
R = N - sum(num_copies) # remaining slots

# Phase 2 (stochastic): resample on the fractional remainders
residual = weights * N - num_copies
residual /= sum(residual) # normalize residuals
indexes[k:N] = resample_residuals(residual, R)

where the phase-2 subroutine resample_residuals is any of the three CDF-based methods above (choose with the dropdown):

Phase 2 method:

resample_residuals = multinomial_resample

resample_residuals = stratified_resample

resample_residuals = systematic_resample

$K$: 1000

Unbiasedness Phase 1 gives particle $\idx$ exactly $\lfloor \np\normwt^\idx \rfloor$ copies. Phase 2 resamples $R = \np - \sum_j \lfloor \np\normwt^j \rfloor$ particles using normalized residual weights $\resid^\idx = (\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor)/R$. By the unbiasedness of whichever CDF method is used for phase 2, the expected number of phase-2 copies of particle $\idx$ is $R \cdot \resid^\idx = \np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor$. Adding the two phases: $\E[\cnt^\idx] = \lfloor \np\normwt^\idx \rfloor + (\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor) = \np\,\normwt^\idx.$

2.4 Comparison

Comparing all four methods on the same weights: Colored error bars show mean count $\pm$ 1 std over $K$ trials.Click the “set to counterexample” button to set weights and test function $f$ to the pathological values in order to show the counterexample described in § 2.2.4 above, where variance of systematic resampling is larger than multinomial.

$K$: 1000

Estimator distributions for $f(\state^\idx) =$ (dashed = true value)

Estimator std: Multinomial Stratified Systematic Residual

	Multinomial	Stratified	Systematic	Residual
$\Var \leq \Var_{\text{mult}}$?	baseline	always	not guaranteed	mult/strat phase 2; not syst
$\\|\cnt^\idx - \np\normwt^\idx\\|$ bound	up to $\np$	$\leq 1$	$\leq 1$	phase-2 dependent
Random draws	$\np$	$\np$	1	$R \leq \np$

2.5 Other resampling schemes

The four methods above are the most widely used, but they are not the only options.

2.5.1 Branch-kill resampling

The methods above all produce exactly $\np$ resampled particles. “Branch-kill” resampling relaxes that requirement: Each particle $\idx$ independently gets $\lfloor \np\normwt^\idx \rfloor$ deterministic copies plus one bonus copy with probability $\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor$. No shared CDF or residual phase needed. The total $\sum_\idx \cnt^\idx$ fluctuates around $\np$ rather than equalling it exactly.

# For each particle independently:
num_copies = np.floor(N * weights)
p_bonus = N * weights - num_copies # fractional part
u = np.random.rand(N) # one uniform draw per particle
bonus = (u >= 1 - p_bonus) # inverse CDF: right of step → bonus
num_copies += bonus # total may differ from N

Unbiasedness Particle $i$ receives $\lfloor \np\normwt^\idx \rfloor$ deterministic copies plus one bonus copy with probability $\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor$. So $\E[\cnt^\idx] = \lfloor \np\normwt^\idx \rfloor + (\np\normwt^\idx - \lfloor \np\normwt^\idx \rfloor) = \np\,\normwt^\idx.$

$K$: 1000

2.5.2 Even more methods…

Rounding-copy resampling. Like branch-kill, but fully deterministic: each particle gets $\mathrm{round}(\np\normwt^\idx)$ copies. This uses zero random draws, but sacrifices the strict unbiasedness condition (the bias per particle is at most $0.5/\np$, vanishing as $\np$ grows).
See Li et al. (2015) for this and a few more.

3 Resampling methods in action

To see how the choice of resampling method affects particle diversity in practice, here is the same random walk model from the introduction, now with a method selector.

Model details

State-space model (Gaussian random walk): Transition $\state_t \mid \state_{t-1} \sim \mathcal{N}(\state_{t-1},\, 1)$. Observation $y_t \mid \state_t \sim \mathcal{N}(\state_t,\, 0.25)$. All observations fixed at $y_t = 2$.

Bootstrap particle filter: Proposal $= $ transition (prior). Weight update $\normwt_t^\idx \propto p(y_t \mid \state_t^\idx)$ (after resampling resets weights to $1/\np$).

Resampling method: Fix seed:

3.1 Comparing methods over many runs

To see the typical behavior rather than a single random run, we can see what happens over many runs of the same method. The plot below shows path degeneracy (unique ancestors at t=1) over $K$ runs for all four methods. You can also the number of particles $\np$ (which we’ve fixed at $8$ for everything above).

We should see multinomial show worse path degeneracy than the lower-variance alternatives.

$\np$: 50 $K$: 50

References

Douc, R., Cappé, O., & Moulines, E. (2005). Comparison of resampling schemes for particle filtering. ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 64–69. https://doi.org/10.1109/ISPA.2005.195385

Li, T., Bolic, M., & Djuric, P. M. (2015). Resampling methods for particle filtering: Classification, implementation, and strategies. IEEE Signal Processing Magazine, 32(3), 70–86. https://doi.org/10.1109/MSP.2014.2330626

Naesseth, C. A., Lindsten, F., & Schön, T. B. (2019). Elements of sequential Monte Carlo. Foundations and Trends in Machine Learning, 12(3), 307–392. https://doi.org/10.1561/2200000074

	Multinomial	Stratified	Systematic	Residual
\(\Var \leq \Var_{\text{mult}}\)?	baseline	always	not guaranteed	mult/strat phase 2; not syst
\(\\|\cnt^\idx - \np\normwt^\idx\\|\) bound	up to \(\np\)	\(\leq 1\)	\(\leq 1\)	phase-2 dependent
Random draws	\(\np\)	\(\np\)	1	\(R \leq \np\)