Graduate Descent

Evaluating ∇f(x) is as fast as f(x)

Automatic differentiation ('autodiff' or 'backprop') is great—not just because it makes it easy to rapidly prototype deep networks with plenty of doodads and geegaws, but because it means that evaluating the gradient \(\nabla f(x)\) is as fast of computing \(f(x)\). In fact, the gradient provably requires at most a small constant factor more arithmetic operations than the function itself. Furthermore, autodiff tells us how to derive and implement the gradient efficiently. This is a fascinating result that is perhaps not emphasized enough in machine learning.

The gradient should never be asymptotically slower than the function. In my recent EMNLP'16 paper, my coauthors and I found a line of work on variable-order CRFs (Ye+'09; Cuong+'14), which had an unnecessarily slow and complicated algorithm for computing gradients, which was asymptotically (and practically) slower than their forward algorithm. Without breaking a sweat, we derived a simpler and more efficient gradient algorithm by simply applying backprop to the forward algorithm (and made some other contributions).

Many algorithms are just backprop. For example, forward-backward and inside-outside, are actually just instances of automatic differentiation (Eisner,'16) (i.e., outside is just backprop on inside). This shouldn't be a surprise because these algorithms are used to compute gradients. Basically, if you know backprop and the inside algorithm, then you can derive the outside algorithm by applying the backprop transform manually. I find it easier to understand the outside algorithm via its connection to backprop, then via the usual presentation. Note that inside-outside and forward-backward pre-date backpropagation and have additional uses beyond computing gradients.

Once you've grokked backprop, the world is your oyster! You can backprop through many approximate inference algorithms, e.g., Stoyanov+'11 and many of Justin Domke's papers, to avoid issues I've mentioned before. You can even backprop through optimization algorithms to get gradients of dev loss wrt hyperparameters, e.g., Domke'12 and Maclaurin+'15.

There's at least one catch! Although the time complexity of computing the gradient is as good as the function, the space complexity may be much larger because the autodiff recipe (at least the default reverse-mode one) requires memoizing all intermediate quantities (e.g., the quantities you overwrite in a loop). There are generic methods for balancing the time-space tradeoff in autodiff, since you can (at least in theory) reconstruct the intermediate quantities by playing the forward computation again from intermediate checkpoints (at a cost to runtime, of course). A recent example is Gruslys+'16.

A final remark. Despite the name "automatic" differentiation, there is no need to rely on software to "automatically" give you gradient routines. Applying the backprop transformation is generally easy to do manually and sometimes more efficient than using a library. Many autodiff libraries lack good support for dynamic computation graph, i.e., when the structure depends on quantities that vary with the input (e.g., sentence length).

Fast sigmoid sampling

In this notebook, we describe a simple trick for efficiently sampling a Bernoulli random variable $Y$ from a sigmoid-defined distribution, $p(Y = 1) = (1 + \exp(-x))^{-1}$, where $x \in \mathbb{R}$ is the only parameter of the distribution ($x$ is often defined as the dot product of features and weights).

The "slow" method for sampling from a sigmoid,

$$ u \sim \textrm{Uniform}(0,1) $$$$ Y = sigmoid(x) > u $$

This method is slow because it calls the sigmoid function for every value of $x$. It is slow because $\exp$ is 2-3x slower than basic arithmetic operations.

In this post, I'll describe a simple trick, which is well-suited to vectorized computations (e.g., numpy, matlab). The way it works is by precomputing the expensive stuff (i.e., calls to expensive functions like $\exp$).

$$ sigmoid(x) > u \Leftrightarrow logit(sigmoid(x)) > logit(u) \Leftrightarrow x > logit(u). $$

Some details worth mentioning: (a) logit is the inverse of sigmoid and (b) logit is strictly monotonic increasing you can apply it both sides of the greater than and preserves ordering (there's a plot in the appendix).

The "fast" method derives it's advantage by leveraging the fact that expensive computation can be done independently of the data (i.e., specific values of $x$). The fast method is also interesting as just cute math. In the bonus section of this post, we'll make a connection to the Gumbel max trick.

How fast is it in practice? Below, we run a quick experiment to test that the method is correct and how fast it is.

In [1]:
%matplotlib inline
import numpy as np
import pylab as pl
from numpy.random import uniform
from numpy import exp
from scipy.special import expit as sigmoid, logit
from arsenal.timer import timers    #
In [3]:
T = timers()

# These are the sigmoid parameters we're going to sample from.
n = 10000
X = np.linspace(-5,5,n)

# number of runs to average over.
R = 1000

# Used for plotting average p(Y=1)
F = np.zeros_like(X)

# Temporary array for saving on memory allocation, cf. method slow-2.
tmp = np.empty(n)                     

for _ in range(R):

    # Let's use the same random variables for all methods. This allows 
    # for a lower variance comparsion and equivalence testing.
    u = uniform(0,1,size=n)
    z = logit(u)       # used in fast method: precompute expensive stuff.

    # Requires computing sigmoid for each x.
    with T['slow1']:
        s1 = sigmoid(X) > u           
    # Avoid memory allocation in slow-1 by using the out option to sigmoid
    # function. It's a little bit faster than slow-1.
    with T['slow2']:
        sigmoid(X, out=tmp)           
        s2 = tmp > u

    # Rolling our sigmoid is a bit slower than using the library function.
    # Not to mention this implementation isn't as numerically stable.
    with T['slow3']:
        s3 = 1/(1+exp(-X)) > u
    # The fast method.
    with T['fast']:
        f = X > z
    F += f / R    
    assert (s1 == f).all()
    assert (s2 == f).all()
    assert (s3 == f).all()

pl.plot(X, F)
pl.plot(X, sigmoid(X), c='r', lw=2)
fast is 28.4239x faster than slow1 (avg: slow1: 0.00114061 fast: 4.01285e-05)
slow2 is 1.0037x faster than slow1 (avg: slow1: 0.00114061 slow2: 0.0011364)
slow1 is 1.0840x faster than slow3 (avg: slow3: 0.00123637 slow1: 0.00114061)

It looks like our trick is about $28$x faster than the fastest competing slow method!

We also see that the assert statements passed, which means that the methods tested produce precisely the same samples.

The final plot demonstrates that we get the right expected value (red curve) as we sweep the distributions parameter (x-axis).


We could alternatively use the Gumbel max trick to derive a similar algorithm. If we ground out the trick for a sigmoid instead of a general mutlinomal distributions, we end up with

$$ Z_0 \sim \textrm{Gumbel}(0,1) $$$$ Z_1 \sim \textrm{Gumbel}(0,1) $$$$ Y = x > Z_0 - Z_1 $$

Much like our new trick, this one benefits from the fact that all expensive stuff is done independent of the data (i.e., the value of $x$). However, it seems silly that we "need" to generate two Gumbel RVs to get one sample from the sigmoid. With a little bit of Googling, we discover that the difference of $\textrm{Gumbel}(0,1)$ RVs is a logistic RV (specifically $\textrm{Logistic}(0,1)$).

It turns out that $\textrm{logit}(\textrm{Uniform}(0,1))$ is a $\textrm{Logistic}(0,1)$ RV.

Voila! Our fast sampling trick and the Gumbel max trick are connected!

Another trick is Justin Domke's trick to reduce calls to $\exp$ by $\approx 88\%$. The disadvantage of this approach is that it's harder to implement with vectorization. The advantage is that we don't need to precompute any expensive things.


Logit plot

In [3]:
xs = np.linspace(0,1,100)
ys = logit(xs)
pl.plot(xs, ys);

Logistic random variable

Check that our sampling method is equivalent to sampling from a logistic distribution.

In [5]:
from scipy.stats import logistic
u = uniform(0,1,size=10000)
z = logit(u)
pl.hist(z, bins=100, normed=1)
xs = np.linspace(-6,6,100)
ys = logistic.pdf(xs)
pl.plot(xs, ys, c='r', lw=2);

Sqrt-biased sampling

The following post is about instance of "sampling in proportion to \(p\) is not optimal, but you probably think it is." It's surprising how few people seem to know this trick. Myself included! It was brought to my attention recently by Nikos Karampatziakis. (Thanks, Nikos!)

The paper credited for this trick is Press (2008). I'm borrowing heavily from that paper as well as an email exchange from Nikos.

Setting: Suppose you're an aspiring chef with a severe head injury affecting your long- and short- term memory trying to find a special recipe from a cookbook that you made one time but just can't remember exactly which recipe it was. So, based on the ingredients of each recipe, you come up with a prior probability \(p_i\) that recipe \(i\) is the one you're looking for. In total, the cookbook has \(n\) recipes and \(\sum_{i=1}^n p_i = 1.\)

A good strategy would be to sort recipes by \(p_i\) and cook the most promising ones first. Unfortunately, you're not a great chef so there is some probability that you'll mess-up the recipe. So, it's a good idea to try recipes multiple times. Also, you have no short term memory...

This suggests a sampling with replacement strategy, where we sample a recipe from the cookbook to try independently of whether we've tried it before (called a memoryless strategy). Let's give this strategy the name \(\boldsymbol{q}.\) Note that \(\boldsymbol{q}\) is a probability distribution over the recipes in the cookbook, just like \(\boldsymbol{p}.\)

How many recipes until we find the special one? To start, suppose the special recipe is \(j.\) Then, the expected number of recipes we have to make until we find \(j\) under the strategy \(\boldsymbol{q}\) is

$$ \sum_{t=1}^\infty t \cdot (1 - q_j)^{t-1} q_{j} = 1/q_{j}. $$

The equation says that expected time it takes to sample \(j\) for the first time is the probability we didn't sample for \((t-1)\) steps times the probability we sample it at time \(t.\) We multiply this probability by the time \(t\) to get the expected time.

Note that this equation assumes that we known \(j\) is the special recipe with certainty when we sample it. We'll revisit this assumption later when we consider potential errors in executing the recipe.

Since we don't known which \(j\) is the right one, we take an expectation over it according to the prior distribution, which yields the following equation,

$$ f(\boldsymbol{q}) = \sum_{i=1}^n \frac{p_i}{q_i}. $$

The first surprising thing: Uniform is just as good as \(\boldsymbol{p}\), yikes! \(f(\boldsymbol{p}) = \sum_{i=1}^n \frac{p_i}{p_i} = n\) and \(f(\text{uniform}(n)) = \sum_{i=1}^n \frac{p_i }{ 1/n } = n.\) (Assume, without loss of generality, that \(p_i > 0\) since we can just drop these elements from \(\boldsymbol{p}.\))

What's the optimal \(\boldsymbol{q}\)? We can address this question by solving the following optimization (which will have a nice closed form solution),

$$ \begin{eqnarray*} && \boldsymbol{q}^* = \underset{\boldsymbol{q}}{\operatorname{argmin}} \sum_{i=1}^n \frac{p_i}{q_i} \\ && \ \ \ \ \ \ \ \ \text{ s.t. } \sum_{i=1}^n q_i = 1 \\ && \ \ \ \ \ \ \ \ \ \ \ \ \, q_1 \ldots q_n \ge 0. \end{eqnarray*} $$

The optimization problem says minimize the expected time to find the special recipe. The constraints enforce that \(\boldsymbol{q}\) be a valid probability distribution.

The optimal strategy, which we get via Lagrange multipliers, turns out to be,

$$ q^*_i = \frac{ \sqrt{p_i} }{ \sum_{j=1}^n \sqrt{p_j} }. $$

How much better is \(q^*\)?

$$ f(q^*) = \sum_i \frac{p_i}{q^*_i} = \sum_i \frac{p_i}{ \frac{\sqrt{p_i} }{ \sum_j \sqrt{p_j}} } = \left( \sum_i \frac{p_i}{ \sqrt{p_i} } \right) \left( \sum_j \sqrt{p_j} \right) = \left( \sum_i \sqrt{p_i} \right)^2 $$

which sometimes equals \(n\), e.g., when \(\boldsymbol{p}\) is uniform, but is never bigger than \(n.\)

What's the intuition? The reason why the \(\sqrt{p}\)-scheme is preferred is because we save on additional cooking experiments. For example, if a recipe has \(k\) times higher prior probability than the average recipe, then we will try that recipe \(\sqrt{k}\) times more often; compared to \(k\), which we'd get under \(\boldsymbol{p}.\) Additional cooking experiments are not so advantageous.

Allowing for noise in the cooking process: Suppose that for each recipe we had a prior belief about how hard that recipe is for us to cook. Denote that belief \(s_i\), these belief are between zero (never get it right) and one (perfect every time) and do not sum to one over the cookbook.

Following a similar derivation to before, the time to cook the special recipe \(j\) and cook it correctly is,

$$ \sum_{t=1}^\infty t \cdot (1 - \color{red}{s_j} q_j)^{t-1} q_{j} \color{red}{s_j} = \frac{1}{s_j \cdot q_j} $$

That gives rise to a modified objective,

$$ f'(\boldsymbol{q}) = \sum_{i=1}^n \frac{p_i}{\color{red}{s_i} \cdot q_i} $$

This is exactly the same as the previous objective, except we've replaced \(p_i\) with \(p_i/s_i.\) Thus, we can reuse our previous derivation to get the optimal strategy, \(q^*_i \propto \sqrt{p_i / s_i}.\) If noise is constant, then we recover the original solution, \(q^*_i \propto \sqrt{p_i}.\)

Extension to finding multiple tasty recipes: Suppose we're trying to find several tasty recipes, not just a single special one. Now, \(p_i\) is our prior belief that we'll like the recipe at all. How do we minimize the time until we find a tasty one? It turns out the same trick works without modification because all derivations apply to each recipe independently. The same trick works if \(p_i\) does not sums to one over \(n.\) For example, if \(p_i\) is the independent probability that you'll like recipe \(i\) at all, not the probability that it's the special one.

Beyond memoryless policies: Clearly, our choice of a memoryless policy can be beat by a policy family that balances exploration (trying new recipes) and exploitation (trying our best guess).

  • Overall, the problem we've posed is similar to a multi-armed bandit. In our case, the arms are the recipes, pulling the arm is trying the recipe and the reward is whether or not we liked the recipe (possibly noisy). The key difference between our setup and multi-armed bandits is that we trust our prior distribution \(\boldsymbol{p}\) and noise model \(\boldsymbol{s}.\)

  • If the amount of noise \(s_i\) is known and we trust the prior \(p_i\) then there is an optimal deterministic (without-replacement) strategy that we can get by sorting the recipes by \(p_i\) accounting for the error rates \(s_i.\) This approach is described in the original paper.

A more realistic application: In certain language modeling applications, we avoid computing normalization constants (which require summing over a massive vocabulary) by using importance sampling, negative sampling or noise contrastive estimation techniques (e.g., Ji+,16; Levy+,15). These techniques depend on a proposal distribution, which folks often take to be the unigram distribution. Unfortunately, this gives too many samples of stop words (e.g., "the", "an", "a"), so practitioners "anneal" the unigram distribution (to increase the entropy), that is sample from \(q_i \propto p_{\text{unigram},i}^\alpha.\) Typically, \(\alpha\) is set by grid search and (no surprise) \(\alpha \approx 1/2\) tends to work best! The \(\sqrt{p}\)-sampling trick is possibly a reverse-engineered justification in favor of annealing as "the right thing to do" (e.g., why not do additive smoothing?) and it even tells us how to set the annealing parameter \(\alpha.\) The key assumption is that we want to sample the actual word at a given position as often as possible while still being diverse thanks to the coverage of unigram prior. (Furthermore, memoryless sampling leads to simpler algorithms.)