The following post is about instance of "sampling in proportion to \(p\) is not optimal, but you probably think it is." It's surprising how few people seem to know this trick. Myself included! It was brought to my attention recently by Nikos Karampatziakis. (Thanks, Nikos!)
The paper credited for this trick is Press (2008). I'm borrowing heavily from that paper as well as an email exchange from Nikos.
Setting: Suppose you're an aspiring chef with a severe head injury affecting your long and short term memory trying to find a special recipe from a cookbook that you made one time but just can't remember exactly which recipe it was. So, based on the ingredients of each recipe, you come up with a prior probability \(p_i\) that recipe \(i\) is the one you're looking for. In total, the cookbook has \(n\) recipes and \(\sum_{i=1}^n p_i = 1.\)
A good strategy would be to sort recipes by \(p_i\) and cook the most promising ones first. Unfortunately, you're not a great chef so there is some probability that you'll messup the recipe. So, it's a good idea to try recipes multiple times. Also, you have no short term memory...
This suggests a sampling with replacement strategy, where we sample a recipe from the cookbook to try independently of whether we've tried it before (called a memoryless strategy). Let's give this strategy the name \(\boldsymbol{q}.\) Note that \(\boldsymbol{q}\) is a probability distribution over the recipes in the cookbook, just like \(\boldsymbol{p}.\)
How many recipes until we find the special one? To start, suppose the special recipe is \(j.\) Then, the expected number of recipes we have to make until we find \(j\) under the strategy \(\boldsymbol{q}\) is
The equation says that expected time it takes to sample \(j\) for the first time is the probability we didn't sample \(j\) for \((t1)\) steps times the probability we sample \(j\) at time \(t.\) We multiply this probability by the time \(t\) to get the expected time.
Note that this equation assumes that we know \(j\) is the special recipe with certainty when we sample it. We'll revisit this assumption later when we consider potential errors in executing the recipe.
Since we don't known which \(j\) is the right one, we take an expectation over it according to the prior distribution, which yields the following equation,
The first surprising thing: Uniform is just as good as \(\boldsymbol{p}\), yikes! \(f(\boldsymbol{p}) = \sum_{i=1}^n \frac{p_i}{p_i} = n\) and \(f(\text{uniform}(n)) = \sum_{i=1}^n \frac{p_i }{ 1/n } = n.\) (Assume, without loss of generality, that \(p_i > 0\) since we can just drop these elements from \(\boldsymbol{p}.\))
What's the optimal \(\boldsymbol{q}\)? We can address this question by solving the following optimization (which will have a nice closed form solution),
The optimization problem says minimize the expected time to find the special recipe. The constraints enforce that \(\boldsymbol{q}\) is a valid probability distribution.
The optimal strategy, which we get via Lagrange multipliers, turns out to be,
How much better is \(q^*\)?
which sometimes equals \(n\), e.g., when \(\boldsymbol{p}\) is uniform, but is never bigger than \(n.\)
What's the intuition? The reason why the \(\sqrt{p}\)scheme is preferred is because we save on additional cooking experiments. For example, if a recipe has \(k\) times higher prior probability than the average recipe, then we will try that recipe \(\sqrt{k}\) times more often; compared to \(k\), which we'd get under \(\boldsymbol{p}.\) Additional cooking experiments are not so advantageous.
Allowing for noise in the cooking process: Suppose that for each recipe we had a prior belief about how hard that recipe is for us to cook. Denote that belief \(s_i\), these beliefs are between zero (never get it right) and one (perfect every time) and do not necessarily sum to one over the cookbook.
Following a similar derivation to before, the time to cook the special recipe \(j\) and cook it correctly is,
That gives rise to a modified objective,
This is exactly the same as the previous objective, except we've replaced \(p_i\) with \(p_i/s_i.\) Thus, we can reuse our previous derivation to get the optimal strategy, \(q^*_i \propto \sqrt{p_i / s_i}.\) If noise is constant, then we recover the original solution, \(q^*_i \propto \sqrt{p_i}.\)
Extension to finding multiple tasty recipes: Suppose we're trying to find several tasty recipes, not just a single special one. Now, \(p_i\) is our prior belief that we'll like the recipe at all. How do we minimize the time until we find a tasty one? It turns out the same trick works without modification because all derivations apply to each recipe independently. The same trick works if \(p_i\) does not sums to one over \(n.\) For example, if \(p_i\) is the independent probability that you'll like recipe \(i\) at all, not the probability that it's the special one.
Beyond memoryless policies: Clearly, our choice of a memoryless policy can be beat by a policy family that balances exploration (trying new recipes) and exploitation (trying our best guess).

Overall, the problem we've posed is similar to a multiarmed bandit. In our case, the arms are the recipes, pulling the arm is trying the recipe and the reward is whether or not we liked the recipe (possibly noisy). The key difference between our setup and multiarmed bandits is that we trust our prior distribution \(\boldsymbol{p}\) and noise model \(\boldsymbol{s}.\)

If the amount of noise \(s_i\) is known and we trust the prior \(p_i\) then there is an optimal deterministic (withoutreplacement) strategy that we can get by sorting the recipes by \(p_i\) accounting for the error rates \(s_i.\) This approach is described in the original paper.
A more realistic application: In certain language modeling applications, we avoid computing normalization constants (which require summing over a massive vocabulary) by using importance sampling, negative sampling or noise contrastive estimation techniques (e.g., Ji+,16; Levy+,15). These techniques depend on a proposal distribution, which folks often take to be the unigram distribution. Unfortunately, this gives too many samples of stop words (e.g., "the", "an", "a"), so practitioners "anneal" the unigram distribution (to increase the entropy), that is sample from \(q_i \propto p_{\text{unigram},i}^\alpha.\) Typically, \(\alpha\) is set by grid search and (no surprise) \(\alpha \approx 1/2\) tends to work best! The \(\sqrt{p}\)sampling trick is possibly a reverseengineered justification in favor of annealing as "the right thing to do" (e.g., why not do additive smoothing?) and it even tells us how to set the annealing parameter \(\alpha.\) The key assumption is that we want to sample the actual word at a given position as often as possible while still being diverse thanks to the coverage of unigram prior. (Furthermore, memoryless sampling leads to simpler algorithms.)
Final remarks: I have uploaded a Jupyter notebook with test cases that illustrate the ideas in this article.