Graduate Descent

Evaluating ∇f(x) is as fast as f(x)

Automatic differentiation ('autodiff' or 'backprop') is great—not just because it makes it easy to rapidly prototype deep networks with plenty of doodads and geegaws, but because it means that evaluating the gradient \(\nabla f(x)\) is as fast of computing \(f(x)\). In fact, the gradient provably requires at most a small constant factor more arithmetic operations than the function itself. Furthermore, autodiff tells us how to derive and implement the gradient efficiently. This is a fascinating result that is perhaps not emphasized enough in machine learning.

The gradient should never be asymptotically slower than the function. In my recent EMNLP'16 paper, my coauthors and I found a line of work on variable-order CRFs (Ye+'09; Cuong+'14), which had an unnecessarily slow and complicated algorithm for computing gradients, which was asymptotically (and practically) slower than their forward algorithm. Without breaking a sweat, we derived a simpler and more efficient gradient algorithm by simply applying backprop to the forward algorithm (and made some other contributions).

Many algorithms are just backprop. For example, forward-backward and inside-outside, are actually just instances of automatic differentiation (Eisner,'16) (i.e., outside is just backprop on inside). This shouldn't be a surprise because these algorithms are used to compute gradients. Basically, if you know backprop and the inside algorithm, then you can derive the outside algorithm by applying the backprop transform manually. I find it easier to understand the outside algorithm via its connection to backprop, then via the usual presentation. Note that inside-outside and forward-backward pre-date backpropagation and have additional uses beyond computing gradients.

Once you've grokked backprop, the world is your oyster! You can backprop through many approximate inference algorithms, e.g., Stoyanov+'11 and many of Justin Domke's papers, to avoid issues I've mentioned before. You can even backprop through optimization algorithms to get gradients of dev loss wrt hyperparameters, e.g., Domke'12 and Maclaurin+'15.

There's at least one catch! Although the time complexity of computing the gradient is as good as the function, the space complexity may be much larger because the autodiff recipe (at least the default reverse-mode one) requires memoizing all intermediate quantities (e.g., the quantities you overwrite in a loop). There are generic methods for balancing the time-space tradeoff in autodiff, since you can (at least in theory) reconstruct the intermediate quantities by playing the forward computation again from intermediate checkpoints (at a cost to runtime, of course). A recent example is Gruslys+'16.

A final remark. Despite the name "automatic" differentiation, there is no need to rely on software to "automatically" give you gradient routines. Applying the backprop transformation is generally easy to do manually and sometimes more efficient than using a library. Many autodiff libraries lack good support for dynamic computation graph, i.e., when the structure depends on quantities that vary with the input (e.g., sentence length).

Gradient-based Hyperparameter Optimization and the Implicit Function Theorem

The most approaches to hyperparameter optimization can be viewed as a bi-level optimization---the "inner" optimization optimizes training loss (wrt \(\theta\)), while the "outer" optimizes hyperparameters (\(\lambda\)).

$$ \lambda^* = \underset{\lambda}{\textbf{argmin}}\ \mathcal{L}_{\text{dev}}\left( \underset{\theta}{\textbf{argmin}}\ \mathcal{L}_{\text{train}}(\theta, \lambda) \right) $$

Can we estimate \(\frac{\partial \mathcal{L}_{\text{dev}}}{\partial \lambda}\) so that we can run gradient-based optimization over \(\lambda\)?

Well, what does it mean to have an \(\textbf{argmin}\) inside a function?

Well, it means that there is a \(\theta^*\) that gets passed to \(\mathcal{L}_{\text{dev}}\). And, \(\theta^*\) is a function of \(\lambda\), denoted \(\theta(\lambda)\). Furthermore, \(\textbf{argmin}\) must set the derivative of the inner optimization is zero in order to be a local optimum of the inner function. So we can rephrase the problem as

$$ \lambda^* = \underset{\lambda}{\textbf{argmin}}\ \mathcal{L}_{\text{dev}}\left(\theta(\lambda) \right), $$

where \(\theta(\lambda)\) is the solution to,

$$ \frac{\partial \mathcal{L}_{\text{train}}(\theta, \lambda)}{\partial \theta} = 0. $$

Now how does \(\theta\) change as the result of an infinitesimal change to \(\lambda\)?

The constraint on the derivative implies a type of "equilibrium"---the inner optimization process will continue to optimize regardless of how we change \(\lambda\). Assuming we don't change \(\lambda\) too much, then the inner optimization shouldn't change \(\theta\) too much and it will change in a predictable way.

To do this, we'll appeal to the implicit function theorem. Let's looking the general case to simplify notation. Suppose \(x\) and \(y\) are related through a function \(g\) as follows,

$$g(x,y) = 0.$$

Assuming \(g\) is a smooth function in \(x\) and \(y\), we can perturb either argument, say \(x\) by a small amount \(\Delta_x\) and \(y\) by \(\Delta_y\). Because system preserves the constraint, i.e.,

$$ g(x + \Delta_x, y + \Delta_y) = 0. $$

We can solve for the change of \(x\) as a result of an infinitesimal change in \(y\). We take the first-order expansion,

$$ g(x, y) + \Delta_x \frac{\partial g}{\partial x} + \Delta_y \frac{\partial g}{\partial y} = 0. $$

Since \(g(x,y)\) is already zero,

$$ \Delta_x \frac{\partial g}{\partial x} + \Delta_y \frac{\partial g}{\partial y} = 0. $$

Next, we solve for \(\frac{\Delta_x}{\Delta_y}\).

$$ \Delta_x \frac{\partial g}{\partial x} = - \Delta_y \frac{\partial g}{\partial y}. $$
$$ \frac{\Delta_x}{\Delta_y} = -\left( \frac{\partial g}{\partial y} \right)^{-1} \frac{\partial g}{\partial x}. $$

Back to the original problem: Now we can use the implicit function theorem to estimate how \(\theta\) varies in \(\lambda\) by plugging in \(g \mapsto \frac{\partial \mathcal{L}_{\text{train}}}{\partial \theta}\), \(x \mapsto \theta\) and \(y \mapsto \lambda\):

$$ \frac{\partial \theta}{\partial \lambda} = - \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top} $$

This tells us how \(\theta\) changes with respect to an infinitesimal change to \(\lambda\). Now, we can apply the chain rule to get the gradient of the whole optimization problem wrt \(\lambda\),

$$ \frac{\partial \mathcal{L}_{\text{dev}}}{\partial \lambda} = \frac{\partial \mathcal{L}_{\text{dev}}}{\partial \theta} \left( - \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top} \right) $$

Since we don't like (explicit) matrix inverses, we compute \(- \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{-1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top}\) as the solution to \(\left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right) x = -\frac{ \partial^2 \mathcal{L}_{\text{train}}}{ \partial \theta\, \partial \lambda^\top}\). When the Hessian is positive definite, the linear system can be solved with conjugate gradient, which conveniently only requires matrix-vector products---i.e., you never have to materialize the Hessian. (Apparently, matrix-free linear algebra is a thing.) In fact, you don't even have to implement the Hessian-vector and Jacobian-vector products because they are accurately and efficiently approximated with centered differences (see earlier post).

At the end of the day, this is an easy algorithm to implement! However, the estimate of the gradient can be temperamental if the linear system is ill-conditioned.

In a later post, I'll describe a more-robust algorithms based on automatic differentiation through the inner optimization algorithm, which make fewer and less-brittle assumptions about the inner optimization.

Further reading:

Gradient of a product

$$ \newcommand{\gradx}[1]{\grad{x}{ #1 }} \newcommand{\grad}[2]{\nabla_{\! #1}\! \left[ #2 \right]} \newcommand{\R}{\mathbb{R}} \newcommand{\bigo}[0]{\mathcal{O}} $$

In this post we'll look at how to compute the gradient of a product. This is such a common subroutine in machine learning that it's worth careful consideration. In a later post, I'll describe the gradient of a sum-over-products, which is another interesting and common pattern in machine learning (e.g., exponential families, CRFs, context-free grammar, case-factor diagrams, semiring-weighted logic programming).

Given a collection of functions with a common argument \(f_1, \cdots, f_n \in \{ \R^d \mapsto \R \}\).

Define their product \(p(x) = \prod_{i=1}^n f_i(x)\)

Suppose, we'd like to compute the gradient of the product of these functions with respect to their common argument, \(x\).

$$ \begin{eqnarray*} \gradx{ p(x) } &=& \gradx{ \prod_{i=1}^n f_i(x) } &=& \sum_{i=1}^n \left( \gradx{f_i(x)} \prod_{i \ne j} f_j(x) \right) \end{eqnarray*} $$

As you can see in the equation above, the gradient takes the form of a "leave-one-out product" sometimes called a "cavity."

A naive method for computing the gradient computes the leave-one-out products from scratch for each \(i\) (outer loop)---resulting in a overall runtime of \(O(n^2)\) to compute the gradient. Later, we'll see a dynamic program for computing this efficiently.

Division trick: Before going down the dynamic programming rabbit hole, let's consider the following relatively simple method for computing the gradient, which uses division:

$$ \begin{eqnarray*} \gradx{ p(x) } &=& \sum_{i=1}^n \left( \frac{\gradx{f_i(x)} }{ f_i(x) } \prod_{j=1}^n f_j(x) \right) &=& \left( \sum_{i=1}^n \frac{\gradx{f_i(x)} }{ f_i(x) } \right) \left( \prod_{j=1}^n f_j(x) \right) \end{eqnarray*} $$

Pro:

  • Runtime \(\bigo(n)\) with space \(\bigo(1)\).

Con:

  • Requires \(f \ne 0\). No worries, we can handle zeros with three cases: (1) If no zeros: the division trick works fine. (2) Only one zero: implies that only one term in the sum will have a nonzero gradient, which we compute via leave-one-out product. (3) Two or more zeros: all gradients are zero and there is no work to be done.

  • Requires multiplicative inverse operator (division) and associative-commutative multiplication, which means it's not applicable to matrices.

Log trick: Suppose \(f_i\) are very small numbers (e.g., probabilities), which we'd rather not multiply together because we'll quickly lose precision (e.g., for large \(n\)). It's common practice (especially in machine learning) to replace \(f_i\) with \(\log f_i\), which turns products into sums, \(\prod_{j=1}^n f_j(x) = \exp \left( \sum_{j=1}^n \log f_j(x) \right)\), and tiny numbers (like \(\texttt{3.72e-44}\)) into large ones (like \(\texttt{-100}\)).

Furthermore, using the identity \((\nabla g) = g \cdot \nabla \log g\), we can operate exclusively in the "\(\log\)-domain".

$$ \begin{eqnarray*} \gradx{ p(x) } &=& \left( \sum_{i=1}^n \gradx{ \log f_i(x) } \right) \exp\left( \sum_{j=1}^n \log f_j(x) \right) \end{eqnarray*} $$

Pro:

  • Numerically stable

  • Runtime \(\bigo(n)\) with space \(\bigo(1)\).

  • Doesn't require multiplicative inverse assuming you can compute \(\gradx{ \log f_i(x) }\) without it.

Con:

  • Requires \(f > 0\). But, we can use LogReal number class to represent negative numbers in log-space, but we still need to be careful about zeros (like in the division trick).

  • Doesn't easily generalize to other notions of multiplication.

Dynamic programming trick: \(\bigo(n)\) runtime and \(\bigo(n)\) space. You may recognize this as forward-backward algorithm for linear chain CRFs (cf. Wallach (2004), section 7).

The trick is very straightforward when you think about it in isolation. Compute the products of all prefixes and suffixes. Then, multiply them together.

Here are the equations:

$$ \begin{eqnarray*} \alpha_0(x) &=& 1 \\ \alpha_t(x) &=& \prod_{i \le t} f_i(x) = \alpha_{t-1}(x) \cdot f_t(x) \\ \beta_{n+1}(x) &=& 1 \\ \beta_t(x) &=& \prod_{i \ge t} f_i(x) = f_t(x) \cdot \beta_{t+1}(x)\\ \gradx{ p(x) } &=& \sum_{i=1}^n \left( \prod_{j < i} f_j(x) \right) \gradx{f_i(x)} \left( \prod_{j > i} f_j(x) \right) \\ &=& \sum_{i=1}^n \alpha_{i-1}(x) \cdot \gradx{f_i(x)} \cdot \beta_{i+1}(x) \end{eqnarray*} $$

Clearly, this requires \(O(n)\) additional space.

Only requires an associative operator (i.e., Does not require it to be commutative or invertible like earlier strategies).

Why do we care about the non-commutative multiplication? A common example is matrix multiplication where \(A B C \ne B C A\), even if all matrices have the conformable dimensions.

Connections to automatic differentiation: The theory behind reverse-mode automatic differentiation says that if you can compute a function, then you can compute it's gradient with the same asymptotic complexity, but you might need more space. That's exactly what we did here: We started with a naive algorithm for computing the gradient with \(\bigo(n^2)\) time and \(\bigo(1)\) space (other than the space to store the \(n\) functions) and ended up with a \(\bigo(n)\) time \(\bigo(n)\) space algorithm with a little clever thinking. What I'm saying is autodiff---even if you don't use a magical package---tells us that an efficient algorithm for the gradient always exists. Furthermore, it tells you how to derive it manually, if you are so inclined. The key is to reuse intermediate quantities (hence the increase in space).

Sketch: In the gradient-of-a-product case, assuming we implemented multiplication left-to-right (forward pass) that already defines the prefix products (\(\alpha\)). It turns out that the backward pass gives us \(\beta\) as adjoints. Lastly, we'd propagate gradients through the \(f\)'s to get \(\frac{\partial p}{\partial x}\). Essentially, we end up with exactly the dynamic programming algorithm we came up with.