Automatic differentiation ('autodiff' or 'backprop') is great—not just
because it makes it easy to rapidly prototype deep networks with plenty of
doodads and geegaws, but because it means that evaluating the gradient \(\nabla
f(x)\) is as fast of computing \(f(x)\). In fact, the gradient provably requires at
most a *small* constant factor more arithmetic operations than the function
itself. Furthermore, autodiff tells us how to derive and implement the gradient
efficiently. This is a fascinating result that is perhaps not emphasized enough
in machine learning.

**The gradient should never be asymptotically slower than the function.** In my
recent EMNLP'16 paper, my coauthors and I found a
line of work on variable-order CRFs
(Ye+'09;
Cuong+'14), which
had an unnecessarily slow and complicated algorithm for computing gradients,
which was asymptotically (and practically) slower than their forward
algorithm. Without breaking a sweat, we derived a simpler and more efficient
gradient algorithm by simply applying backprop to the forward algorithm (and
made some other contributions).

**Many algorithms are just backprop.** For example, forward-backward and
inside-outside, are actually just instances of automatic differentiation
(Eisner,'16) (i.e.,
outside is just backprop on inside). This shouldn't be a surprise because these
algorithms are used to compute gradients. Basically, if you know backprop and
the inside algorithm, then you can derive the outside algorithm by applying the
backprop transform manually. I find it easier to understand the outside
algorithm via its connection to backprop, then via
the usual presentation. Note
that inside-outside and forward-backward pre-date backpropagation and have
additional uses beyond computing gradients.

**Once you've grokked backprop, the world is your oyster!** You can backprop
through many approximate inference algorithms, e.g.,
Stoyanov+'11
and many of Justin Domke's papers, to avoid issues I've mentioned
before. You
can even backprop through optimization algorithms to get gradients of dev loss wrt
hyperparameters, e.g.,
Domke'12 and
Maclaurin+'15.

**There's at least one catch!** Although the *time* complexity of computing the
gradient is as good as the function, the *space* complexity may be much larger
because the autodiff recipe (at least the default reverse-mode one) requires memoizing
all intermediate quantities (e.g., the quantities you overwrite in a
loop). There are generic methods for balancing the time-space tradeoff in
autodiff, since you can (at least in theory) reconstruct the intermediate
quantities by playing the forward computation again from intermediate
checkpoints (at a cost to runtime, of course). A recent example is
Gruslys+'16.

**A final remark**. Despite the name "automatic" differentiation, there is no
need to rely on software to "automatically" give you gradient routines. Applying
the backprop transformation is generally easy to do manually and sometimes more
efficient than using a library. Many autodiff libraries lack good support for
dynamic computation graph, i.e., when the structure depends on quantities that
vary with the input (e.g., sentence length).