Gradient-vector product — Graduate Descent

We've all written the following test for our gradient code (known as the finite-difference approximation).

$$ \frac{\partial}{\partial x_i} f(\boldsymbol{x}) \approx \frac{1}{2 \varepsilon} \Big( f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{e_i}) - f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{e_i}) \Big) $$

where $\varepsilon > 0$ and $\boldsymbol{e_i}$ is a vector of zeros except at $i$ where it is $1$. This approximation is exact in the limit, and accurate to $o(\varepsilon)$ additive error.

This is a specific instance of a more general approximation! The dot product of the gradient and any (conformable) vector $\boldsymbol{d}$ can be approximated with the following formula,

$$ \nabla f(\boldsymbol{x})^{\top} \boldsymbol{d} \approx \frac{1}{2 \varepsilon} \Big( f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{d}) - f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{d}) \Big) $$

We get the special case above when $\boldsymbol{d}=\boldsymbol{e_i}$. This also exact in the limit and just as accurate.

Runtime? Finite-difference approximation is probably too slow for approximating a high-dimensional gradient because the number of function evaluations required is $2 n$ where $n$ is the dimensionality of $x$. However, if the end goal is to approximate a gradient-vector product, a mere $2$ function evaluations is probably faster than specialized code for computing the gradient.

How to set $\varepsilon$? The second approach is more sensitive to $\varepsilon$ because $\boldsymbol{d}$ is arbitrary, unlike $\boldsymbol{e_i}$, which is a simple unit-norm vector. Luckily some guidance is available. Andrei (2009) reccommends

$$ \varepsilon = \sqrt{\epsilon_{\text{mach}}} (1 + \|\boldsymbol{x} \|_{\infty}) / \| \boldsymbol{d} \|_{\infty} $$

where $\epsilon_{\text{mach}}$ is machine epsilon. (Numpy users: numpy.finfo(x.dtype).eps).

Why do I care?

Well, I tend to work on sparse, but high-dimensional problems where finite-difference would be too slow. Thus, my usual solution is to only test several randomly selected dimensions$-$biasing samples toward dimensions which should be nonzero. With the new trick, I can effectively test more dimensions at once by taking random vectors $\boldsymbol{d}$. I recommend sampling $\boldsymbol{d}$ from a spherical Gaussian so that we're uniform on the angle of the vector.
Sometimes the gradient-vector dot product is the end goal. This is the case with Hessian-vector products, which arises in many optimization algorithms, such as stochastic meta descent. Hessian-vector products are an instance of the gradient-vector dot product because the Hessian is just the gradient of the gradient.

Hessian-vector product

Hessian-vector products are an instance of the gradient-vector dot product because since the Hessian is just the gradient of the gradient! Now you only need to remember one formula!

$$ H(\boldsymbol{x})\, \boldsymbol{d} \approx \frac{1}{2 \varepsilon} \Big( \nabla f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{d}) - \nabla f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{d}) \Big) $$

With this trick you never have to actually compute the gnarly Hessian! More on Justin Domke's blog

Why do I care?

Hessian-vector product

Comments