Graduate Descent

Evaluating ∇f(x) is as fast as f(x)

Automatic differentiation ('autodiff' or 'backprop') is great—not just because it makes it easy to rapidly prototype deep networks with plenty of doodads and geegaws, but because it means that evaluating the gradient \(\nabla f(x)\) is as fast of computing \(f(x)\). In fact, the gradient provably requires at most a small constant factor more arithmetic operations than the function itself. Furthermore, autodiff tells us how to derive and implement the gradient efficiently. This is a fascinating result that is perhaps not emphasized enough in machine learning.

The gradient should never be asymptotically slower than the function. In my recent EMNLP'16 paper, my coauthors and I found a line of work on variable-order CRFs (Ye+'09; Cuong+'14), which had an unnecessarily slow and complicated algorithm for computing gradients, which was asymptotically (and practically) slower than their forward algorithm. Without breaking a sweat, we derived a simpler and more efficient gradient algorithm by simply applying backprop to the forward algorithm (and made some other contributions).

Many algorithms are just backprop. For example, forward-backward and inside-outside, are actually just instances of automatic differentiation (Eisner,'16) (i.e., outside is just backprop on inside). This shouldn't be a surprise because these algorithms are used to compute gradients. Basically, if you know backprop and the inside algorithm, then you can derive the outside algorithm by applying the backprop transform manually. I find it easier to understand the outside algorithm via its connection to backprop, then via the usual presentation. Note that inside-outside and forward-backward pre-date backpropagation and have additional uses beyond computing gradients.

Once you've grokked backprop, the world is your oyster! You can backprop through many approximate inference algorithms, e.g., Stoyanov+'11 and many of Justin Domke's papers, to avoid issues I've mentioned before. You can even backprop through optimization algorithms to get gradients of dev loss wrt hyperparameters, e.g., Domke'12 and Maclaurin+'15.

There's at least one catch! Although the time complexity of computing the gradient is as good as the function, the space complexity may be much larger because the autodiff recipe (at least the default reverse-mode one) requires memoizing all intermediate quantities (e.g., the quantities you overwrite in a loop). There are generic methods for balancing the time-space tradeoff in autodiff, since you can (at least in theory) reconstruct the intermediate quantities by playing the forward computation again from intermediate checkpoints (at a cost to runtime, of course). A recent example is Gruslys+'16.

A final remark. Despite the name "automatic" differentiation, there is no need to rely on software to "automatically" give you gradient routines. Applying the backprop transformation is generally easy to do manually and sometimes more efficient than using a library. Many autodiff libraries lack good support for dynamic computation graph, i.e., when the structure depends on quantities that vary with the input (e.g., sentence length).

Multiclass logistic regression and conditional random fields are the same thing

A short rant: multiclass logistic regression and conditional random fields (CRF) are the same thing. This comes to a surprise to many people because CRFs tend to be surrounded by additional "stuff."

Multiclass logistic regression is simple. The goal is to predict the correct label \(y^*\) from handful of labels \(\mathcal{Y}\) given the observation \(x\) based on features \(\phi(x,y)\). Training this model typically requires computing the gradient:

$$ \phi(x,y^*) - \sum_{y \in \mathcal{Y}} p(y|x) \phi(x,y) $$

where

$$ \begin{eqnarray*} p(y|x) &=& \frac{1}{Z(x)} \exp(\theta^\top \phi(x,y)) & \ \ \ \ \text{and} \ \ \ \ & Z(x) &=& \sum_{y \in \mathcal{Y}} \exp(\theta^\top \phi(x,y)) \end{eqnarray*} $$

At test-time, we often take the highest-scoring label under the model.

$$ \hat{y}(x) = \textbf{argmax}_{y \in \mathcal{Y}} \theta^\top \phi(x,y) $$

A conditional random field is exactly multiclass logistic regression. The only difference is that the sum and argmax are inefficient to compute naively (i.e., by enumeration). This point is often lost when people first learn about CRFs. Some people never make this connection.

Here's some stuff you'll see once we start talking about CRFs:

  1. Inference algorithms (e.g., Viterbi decoding, forward-backward, Junction tree)

  2. Graphical models (factor graphs, Bayes nets, Markov random fields)

  3. Model templates (i.e., repeated feature functions)

In the logistic regression case, we'd never use the term "inference" to describe the "sum" and "max" over a handful of categories. Once we move to a structured label space, this term gets throw around. (BTW, this isn't "statistical inference," just algorithms to compute sum and max over \(\mathcal{Y}\).)

Graphical models establish a notation and structural properties which allow efficient inference -- things like cycles and treewidth.

Model templating is the only essential trick to move from logistic regression to a CRF. Templating "solves" the problem that not all training examples have the same "size" -- the set of outputs \(\mathcal{Y}(x)\) now depends on \(x\). A model template specifies how to compute the features for an entire output, by looking at interactions between subsets of variables.

$$ \phi(x,\boldsymbol{y}) = \sum_{\alpha \in A(x)} \phi_\alpha(x, \boldsymbol{y}_\alpha) $$

where \(\alpha\) is a labeled subset of variables often called a factor and \(\boldsymbol{y}_\alpha\) is the subvector containing values of variables \(\alpha\). Basically, the feature function \(\phi\) gets to look at some subset of the variables being predicted \(y\) and the entire input \(x\). The ability to look at more of \(y\) allows the model to make more coherent predictions.

Anywho, it's often useful to take a step back and think about what you are trying to compute instead of how you're computing it. In this post, this allowed us see the similarity between logistic regression and CRFs even though they seem quite different.