Graduate Descent

How to test gradient implementations

Setup: Suppose we have a function, \(f: \mathbb{R}^n \rightarrow \mathbb{R}\), and we want to test code that computes \(\nabla f\). (Note that these techniques also apply when \(f\) has multivariate output.)

Finite-difference approximation

The main way that people test gradient computation is by comparing it against a finite-difference (FD) approximation to the gradient:

$$ \boldsymbol{d}^\top\! \nabla f(\boldsymbol{x}) \approx \frac{1}{2 \varepsilon}(f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{d}) - f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{d})) $$

where \(\boldsymbol{d} \in \mathbb{R}^n\) is an arbitrary "direction" in parameter space. We will look at many directions when we test. Generally, people take the \(n\) elementary vectors as the directions, but random directions are just as good (and you can catch bugs in all dimensions with less than \(n\) of them).

Always use the two-sided difference formula. There is a version which doesn't add and subtract, just does one or the other. Do not use it ever.

Make sure you test multiple inputs (values of \(\boldsymbol{x}\)) or any thing else the function depends on (e.g., the minibatch).

What directions to use: When debugging, I tend to use elementary directions because they tell me something about which dimensions that are wrong... this doesn't always help though. The random directions are best when you want the test cases to run really quickly. In that case, you can switch to check a few random directions using a spherical distribution—do not sample them from a multivariate uniform!

Always test your implementation of \(f\)! It's very easy to correctly compute the gradient of the wrong function. The FD approximation is a "self-consistency" test, it does not validate \(f\) only the relationship between \(f\) and \(\nabla\! f\).

Obviously, how you test \(f\) depends strongly on what it's supposed to compute.

  • Example: For a conditional random field (CRF), you can also test that your implementation of a dynamic program for computing \(\log Z_\theta(x)\) is correctly by comparing against brute-force enumeration of \(\mathcal{Y}(x)\) on small examples.

Similarly, you can directly test the gradient code if you know a different way to compute it.

  • Example: In a CRF, we know that the \(\nabla \log Z_\theta(x)\) is a feature expectation, which you can also test against a brute-force enumeration on small examples.

Why not just use the FD approximation as your gradient?

For low-dimensional functions, you can straight-up use the finite-difference approximation instead of rolling code to compute the gradient. (Take \(n\) axis-aligned unit vectors for \(\boldsymbol{d}\).) The FD approximation is very accurate. Of course, specialized code is probably a little more accurate, but that's not really why we bother to do it! The reason why we write specialized gradient code is not improve numerical accuracy, it's to improve efficiency. As I've ranted before, automatic differentiation techniques guarantee that evaluating \(\nabla f(x)\) gradient should be as efficient as computing \(f(x)\) (with the caveat that space complexity may increase substantially - i.e., space-time tradeoffs exists). FD is \(\mathcal{O}(n \cdot \textrm{runtime } f(x))\), where as autodiff is \(\mathcal{O}(\textrm{runtime } f(x))\).

How to compare vectors

Absolute difference is the devil. You should never compare vectors in absolute difference (this is Lecture 1 of any numerical methods course). In this case, the problem is that gradients depend strongly on the scale of \(f\). If \(f\) takes tiny values then it's easy for differences to be lower than a tiny threshold.

Most people use relative error \(= \frac{|\textbf{want} - \textbf{got}|}{|\textbf{want}|}\), to get a scale-free error measure, but unfortunately relative error chokes when \(\textbf{want}\) is zero.

I compute several error measures with a script that you can import from my github arsenal.math.checkgrad.{fdcheck}.

I use two metrics to test gradients:

  1. Relative error (skipping zeros): If relative error hits a zero, I skip it. I'll rely on the other measure.

  2. Pearson correlation: Checks the direction of the gradient, but allows a scale and shift transformation. This measure doesn't have trouble with zeros, but allows scale and shift problems to pass by. Make sure you fix those errors! (e.g. In the CRF example, you might have forgotten to divide by \(Z(x)\), which not really a constant... I've made this exact mistake a few times.)

I also look at some diagnostics, which help me debug stuff:

  • Accuracy of predicting the sign {+,-,0} of each dimension (or dot random product).

  • Absolute error (just as a diagnostic)

  • Scatter plot: When debugging, I like to scatter plot the elements of FD vs. my implementation.

All these measurements (and the scatter plot) can be computed with{compare}, which I find super useful when debugging absolutely anything numerical.

Bonus tests

Testing modules: You can test the different modules of your code as well (assuming you have a composable module-based setup). E.g., I test my DP algorithm independent of how the features and downstream loss are computed. You can also test feature and downstream loss modules independent of one another. Note that autodiff (implicitly) computes Jacobian-vector products because modules are multivariate in general. We can reduce to the scalar case by taking a dot product of the outputs with a (fixed) random vector.

Something like this:

r = spherical(m)  # fixed random vector |output|=|m|
h = lambda x: module.fprop(x).dot(r)   # scalar function for use in fd

module.fprop(x)  # propagate
module.outputs.adjoint = r. # set output adjoint to r, usually we set adjoint of scalar output=1
ad = module.input.adjoint # grab the gradient
fd = fdgrad(h, x)
compare(fd, ad)

Integration tests: Test that running a gradient-based optimization algorithm is successful with your gradient implementation. Use smaller versions of your problem if possible. A related test for machine learning applications is to make sure that your model and learning procedure can (over)fit small datasets.

Test that batch = minibatch (if applicable). It's very easy to get this bit wrong. Broadcasting rules (in numpy, for example) make it easy to hide matrix conformability mishaps. So make sure you get the same results as manual minibatching (Of course, you should only do minibatching if are get a speed-up from vectorization or parallelism. You should probably test that it's actually faster.)

Further reading: I've written about gradient approximations before, you might like these articles: gradient-vector products, complex-step method. I strongly recommend learning how automatic differentiation works, I learned it from Justin Domke's course notes.

Evaluating ∇f(x) is as fast as f(x)

Automatic differentiation ('autodiff' or 'backprop') is great—not just because it makes it easy to rapidly prototype deep networks with plenty of doodads and geegaws, but because it means that evaluating the gradient \(\nabla f(x)\) is as fast of computing \(f(x)\). In fact, the gradient provably requires at most a small constant factor more arithmetic operations than the function itself. Furthermore, autodiff tells us how to derive and implement the gradient efficiently. This is a fascinating result that is perhaps not emphasized enough in machine learning.

The gradient should never be asymptotically slower than the function. In my recent EMNLP'16 paper, my coauthors and I found a line of work on variable-order CRFs (Ye+'09; Cuong+'14), which had an unnecessarily slow and complicated algorithm for computing gradients, which was asymptotically (and practically) slower than their forward algorithm. Without breaking a sweat, we derived a simpler and more efficient gradient algorithm by simply applying backprop to the forward algorithm (and made some other contributions).

Many algorithms are just backprop. For example, forward-backward and inside-outside, are actually just instances of automatic differentiation (Eisner,'16) (i.e., outside is just backprop on inside). This shouldn't be a surprise because these algorithms are used to compute gradients. Basically, if you know backprop and the inside algorithm, then you can derive the outside algorithm by applying the backprop transform manually. I find it easier to understand the outside algorithm via its connection to backprop, then via the usual presentation. Note that inside-outside and forward-backward pre-date backpropagation and have additional uses beyond computing gradients.

Once you've grokked backprop, the world is your oyster! You can backprop through many approximate inference algorithms, e.g., Stoyanov+'11 and many of Justin Domke's papers, to avoid issues I've mentioned before. You can even backprop through optimization algorithms to get gradients of dev loss wrt hyperparameters, e.g., Domke'12 and Maclaurin+'15.

There's at least one catch! Although the time complexity of computing the gradient is as good as the function, the space complexity may be much larger because the autodiff recipe (at least the default reverse-mode one) requires memoizing all intermediate quantities (e.g., the quantities you overwrite in a loop). There are generic methods for balancing the time-space tradeoff in autodiff, since you can (at least in theory) reconstruct the intermediate quantities by playing the forward computation again from intermediate checkpoints (at a cost to runtime, of course). A recent example is Gruslys+'16.

A final remark. Despite the name "automatic" differentiation, there is no need to rely on software to "automatically" give you gradient routines. Applying the backprop transformation is generally easy to do manually and sometimes more efficient than using a library. Many autodiff libraries lack good support for dynamic computation graph, i.e., when the structure depends on quantities that vary with the input (e.g., sentence length).

Gradient-based Hyperparameter Optimization and the Implicit Function Theorem

The most approaches to hyperparameter optimization can be viewed as a bi-level optimization---the "inner" optimization optimizes training loss (wrt \(\theta\)), while the "outer" optimizes hyperparameters (\(\lambda\)).

$$ \lambda^* = \underset{\lambda}{\textbf{argmin}}\ \mathcal{L}_{\text{dev}}\left( \underset{\theta}{\textbf{argmin}}\ \mathcal{L}_{\text{train}}(\theta, \lambda) \right) $$

Can we estimate \(\frac{\partial \mathcal{L}_{\text{dev}}}{\partial \lambda}\) so that we can run gradient-based optimization over \(\lambda\)?

Well, what does it mean to have an \(\textbf{argmin}\) inside a function?

Well, it means that there is a \(\theta^*\) that gets passed to \(\mathcal{L}_{\text{dev}}\). And, \(\theta^*\) is a function of \(\lambda\), denoted \(\theta(\lambda)\). Furthermore, \(\textbf{argmin}\) must set the derivative of the inner optimization is zero in order to be a local optimum of the inner function. So we can rephrase the problem as

$$ \lambda^* = \underset{\lambda}{\textbf{argmin}}\ \mathcal{L}_{\text{dev}}\left(\theta(\lambda) \right), $$

where \(\theta(\lambda)\) is the solution to,

$$ \frac{\partial \mathcal{L}_{\text{train}}(\theta, \lambda)}{\partial \theta} = 0. $$

Now how does \(\theta\) change as the result of an infinitesimal change to \(\lambda\)?

The constraint on the derivative implies a type of "equilibrium"---the inner optimization process will continue to optimize regardless of how we change \(\lambda\). Assuming we don't change \(\lambda\) too much, then the inner optimization shouldn't change \(\theta\) too much and it will change in a predictable way.

To do this, we'll appeal to the implicit function theorem. Let's looking the general case to simplify notation. Suppose \(x\) and \(y\) are related through a function \(g\) as follows,

$$g(x,y) = 0.$$

Assuming \(g\) is a smooth function in \(x\) and \(y\), we can perturb either argument, say \(x\) by a small amount \(\Delta_x\) and \(y\) by \(\Delta_y\). Because system preserves the constraint, i.e.,

$$ g(x + \Delta_x, y + \Delta_y) = 0. $$

We can solve for the change of \(x\) as a result of an infinitesimal change in \(y\). We take the first-order expansion,

$$ g(x, y) + \Delta_x \frac{\partial g}{\partial x} + \Delta_y \frac{\partial g}{\partial y} = 0. $$

Since \(g(x,y)\) is already zero,

$$ \Delta_x \frac{\partial g}{\partial x} + \Delta_y \frac{\partial g}{\partial y} = 0. $$

Next, we solve for \(\frac{\Delta_x}{\Delta_y}\).

$$ \Delta_x \frac{\partial g}{\partial x} = - \Delta_y \frac{\partial g}{\partial y}. $$
$$ \frac{\Delta_x}{\Delta_y} = -\left( \frac{\partial g}{\partial y} \right)^{-1} \frac{\partial g}{\partial x}. $$

Back to the original problem: Now we can use the implicit function theorem to estimate how \(\theta\) varies in \(\lambda\) by plugging in \(g \mapsto \frac{\partial \mathcal{L}_{\text{train}}}{\partial \theta}\), \(x \mapsto \theta\) and \(y \mapsto \lambda\):

$$ \frac{\partial \theta}{\partial \lambda} = - \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top} $$

This tells us how \(\theta\) changes with respect to an infinitesimal change to \(\lambda\). Now, we can apply the chain rule to get the gradient of the whole optimization problem wrt \(\lambda\),

$$ \frac{\partial \mathcal{L}_{\text{dev}}}{\partial \lambda} = \frac{\partial \mathcal{L}_{\text{dev}}}{\partial \theta} \left( - \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top} \right) $$

Since we don't like (explicit) matrix inverses, we compute \(- \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top}\) as the solution to \(\left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right) x = -\frac{ \partial^2 \mathcal{L}_{\text{train}}}{ \partial \theta\, \partial \lambda^\top}\). When the Hessian is positive definite, the linear system can be solved with conjugate gradient, which conveniently only requires matrix-vector products---i.e., you never have to materialize the Hessian. (Apparently, matrix-free linear algebra is a thing.) In fact, you don't even have to implement the Hessian-vector and Jacobian-vector products because they are accurately and efficiently approximated with centered differences (see earlier post).

At the end of the day, this is an easy algorithm to implement! However, the estimate of the gradient can be temperamental if the linear system is ill-conditioned.

In a later post, I'll describe a more-robust algorithms based on automatic differentiation through the inner optimization algorithm, which make fewer and less-brittle assumptions about the inner optimization.

Further reading: