# Algorithms for sampling without replacement

In this notebook, we'll describe, implement, and test some simple and efficient strategies for sampling without replacement from a categorical distribution.

Given a set of items indexed by $1, \ldots, n$ and weights $w_1, \ldots, w_n$, we want to sample $0 < k \le n$ elements without replacement from the set.

## Theory¶

The probability of the sampling without replacement scheme can be computed analytically. Let $z$ be an ordered sample without replacement from the indices $\{1, \ldots, n\}$ of size $0 < k \le n$. Borrowing Python notation, let $z_{:t}$ denote the indices up to, but not including, $t$. The probability of $z$ is $$\mathrm{Pr}(z) = \prod_{t=1}^{k} p(z_t \mid z_{:t}) \quad\text{ where }\quad p(z_t \mid z_{:t}) = \frac{ w_{z_t} }{ W_t(z) } \quad\text{ and }\quad W_t(z) = \sum_{i=t}^n w_{z_{t}}$$

Note that $w_{z_t}$ is the weight of the $t^{\text{th}}$ item sampled in $z$ and $W_t(z)$ is the normalizing constant at time $t$.

This probability is evaluated by p_perm (below), and it can be used to test that $z$ is sampled according to the correct sampling without replacement process.

In :
def p_perm(w, z):
"The probability of a permutation z under the sampling without replacement scheme."
n = len(w); k = len(z)
assert 0 < k <= n
wz = w[np.array(z, dtype=int)]
W = wz[::-1].cumsum()
return np.product(wz / W)


## Algorithms¶

### Baseline: The numpy implementation¶

In :
def swor_numpy(w, R):
n = len(w)
p = w / w.sum()     # must normalize w first, unlike Gumbel version
U = list(range(n))
return np.array([np.random.choice(U, size=n, p=p, replace=0)
for _ in range(R)])


### Heap-based sampling¶

Using heap sampling, we can do the computation in $\mathcal{O}(N + K \log N)$. It's possible that shrinking the heap rather than leaving it size $n$ could yield an improvement. The implementation that I am using is from my Python arsenal.

In :
from arsenal.maths.sumheap import SumHeap
def swor_heap(w, R):
n = len(w)
z = np.zeros((R, n), dtype=int)
for r in range(R):
z[r] = SumHeap(w).swor(n)
return z


### The Gumbel-sort trick¶

The running time for the Gumbel version is $\mathcal{O}(N + N \log K)$ assuming that we use a bounded heap of size $K$. The implementation does not include the bounded heap optimization. My experiments will use $k=n$.

In :
def swor_gumbel(w, R):
n = len(w)
G = np.random.gumbel(0,1,size=(R,n))
G += np.log(w)
G *= -1
return np.argsort(G, axis=1)


### Exponential-sort trick¶

Efraimidis and Spirakis (2006)'s algorithm, modified slightly to use Exponential random variates for aesthetic reasons. The Gumbel-sort and Exponential-sort algorithms are very tightly connected as I have discussed in a 2014 article and can be seen in the similarity of the code for the two methods.

In :
def swor_exp(w, R):
n = len(w)
E = -np.log(np.random.uniform(0,1,size=(R,n)))
E /= w
return np.argsort(E, axis=1)


## Test cases¶

### Correctness¶

In :
import numpy as np, pylab as pl
from numpy.random import uniform
from arsenal.maths import compare, random_dist

R = 50_000
v = random_dist(4)

methods = [
swor_numpy,
swor_gumbel,
swor_heap,
swor_exp,
]

S = {f.__name__: f(v, R) for f in methods}

In :
from collections import Counter
from arsenal.maths.combinatorics import permute

def counts(S):
"empirical distribution over z"
c = Counter()
m = len(S)
for s in S:
c[tuple(s)] += 1 / m
return c

D = {name: counts(S[name]) for name in S}

R = {}
n = len(v)
for z in permute(range(n)):
R[z] = p_perm(v, z)
for d in D.values():
d[z] += 0

# Check that p_perm sums to one.
np.testing.assert_allclose(sum(R.values()), 1)

In :
for name, d in sorted(D.items()):
compare(R, d).show(title=name);

Comparison: n=24
norms: [0.336428, 0.337442]
zero F1: 1
pearson: 0.999762
spearman: 0.99913
Linf: 0.00428132
same-sign: 100.0% (24/24)
max rel err: 0.105585
regression: [0.995 0.000] Comparison: n=24
norms: [0.336428, 0.337414]
zero F1: 1
pearson: 0.999894
spearman: 0.998261
Linf: 0.0025007
same-sign: 100.0% (24/24)
max rel err: 0.118721
regression: [0.995 0.000] Comparison: n=24
norms: [0.336428, 0.336196]
zero F1: 1
pearson: 0.999919
spearman: 0.997391
Linf: 0.00188318
same-sign: 100.0% (24/24)
max rel err: 0.118791
regression: [1.001 -0.000] Comparison: n=24
norms: [0.336428, 0.336499]
zero F1: 1
pearson: 0.999856
spearman: 0.998261
Linf: 0.00253601
same-sign: 100.0% (24/24)
max rel err: 0.126029
regression: [1.000 0.000] ### Efficiency¶

In :
from arsenal.timer import timers
T = timers()
R = 50
for i in range(1, 15):
n = 2**i
#print('n=', n, 'i=', i)
for _ in range(R):
v = random_dist(n)
np.random.shuffle(methods)
for f in methods:
name = f.__name__
with T[name](n = n):
S = f(v, R = 1)
assert S.shape == (1, n)   # some sort of sanity check
print('done')

done

In :
fig, ax = pl.subplots(ncols=2, figsize=(12, 5))
T.plot_feature('n', ax=ax)
fig.tight_layout()
T.plot_feature('n', ax=ax); ax.set_yscale('log'); ax.set_xscale('log'); In :
T.compare()

swor_exp is 1.5410x faster than swor_gumbel (median: swor_gumbel: 4.92334e-05 swor_exp: 3.19481e-05)
swor_exp is 1.1082x faster than swor_heap (median: swor_heap: 3.54052e-05 swor_exp: 3.19481e-05)
swor_exp is 10.4478x faster than swor_numpy (median: swor_numpy: 0.000333786 swor_exp: 3.19481e-05)


Remarks:

• The numpy version is not very competitive. That's because it's uses a less efficient base algorithm that is not optimized for sampling without replacement.

• The heap-based implementation is pretty fast. It has the best asymptotic complexity if the sample size is less then $n$.

• That said, the heap-based sampler is harder to implement than the Exp and Gumbel algorithm, and harder to vectorize, unlike Exp and Gumbel.

• The difference between the Exp and Gumbel tricks is just that the Gumbel trick takes does a few more floating-point operations. In fact, as I pointed out in a 2014 article, the Exp-sort and Gumbel-sort tricks produced precisely the same sample if we use the same random seed.

• I suspect that the performance of both the Exp and Gumbel methods could be improved with a bit of implementation effort. For example, currently, there are some unnecessary extra temporary memory allocations. These algorithms are also trivial to parallelize. The real bottleneck is the random variate generation time.

In [ ]: