We've all written the following test for our gradient code (known as the finite-difference approximation).
where \(\varepsilon > 0\) and \(\boldsymbol{e_i}\) is a vector of zeros except at \(i\) where it is \(1\). This approximation is exact in the limit, and accurate to \(o(\varepsilon)\) additive error.
This is a specific instance of a more general approximation! The dot product of the gradient and any (conformable) vector \(\boldsymbol{d}\) can be approximated with the following formula,
We get the special case above when \(\boldsymbol{d}=\boldsymbol{e_i}\). This also exact in the limit and just as accurate.
Runtime? Finite-difference approximation is probably too slow for approximating a high-dimensional gradient because the number of function evaluations required is \(2 n\) where \(n\) is the dimensionality of \(x\). However, if the end goal is to approximate a gradient-vector product, a mere \(2\) function evaluations is probably faster than specialized code for computing the gradient.
How to set \(\varepsilon\)? The second approach is more sensitive to \(\varepsilon\) because \(\boldsymbol{d}\) is arbitrary, unlike \(\boldsymbol{e_i}\), which is a simple unit-norm vector. Luckily some guidance is available. Andrei (2009) reccommends
where \(\epsilon_{\text{mach}}\) is
machine epsilon. (Numpy users:
numpy.finfo(x.dtype).eps
).
Why do I care?
-
Well, I tend to work on sparse, but high-dimensional problems where finite-difference would be too slow. Thus, my usual solution is to only test several randomly selected dimensions\(-\)biasing samples toward dimensions which should be nonzero. With the new trick, I can effectively test more dimensions at once by taking random vectors \(\boldsymbol{d}\). I recommend sampling \(\boldsymbol{d}\) from a spherical Gaussian so that we're uniform on the angle of the vector.
-
Sometimes the gradient-vector dot product is the end goal. This is the case with Hessian-vector products, which arises in many optimization algorithms, such as stochastic meta descent. Hessian-vector products are an instance of the gradient-vector dot product because the Hessian is just the gradient of the gradient.
Hessian-vector product
Hessian-vector products are an instance of the gradient-vector dot product because since the Hessian is just the gradient of the gradient! Now you only need to remember one formula!
With this trick you never have to actually compute the gnarly Hessian! More on Justin Domke's blog