We've all written the following test for our gradient code (known as the finite-difference approximation).

$$\frac{\partial}{\partial x_i} f(\boldsymbol{x}) \approx \frac{1}{2 \varepsilon} \Big( f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{e_i}) - f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{e_i}) \Big)$$

where $$\varepsilon > 0$$ and $$\boldsymbol{e_i}$$ is a vector of zeros except at $$i$$ where it is $$1$$. This approximation is exact in the limit, and accurate to $$o(\varepsilon)$$ additive error.

This is a specific instance of a more general approximation! The dot product of the gradient and any (conformable) vector $$\boldsymbol{d}$$ can be approximated with the following formula,

$$\nabla f(\boldsymbol{x})^{\top} \boldsymbol{d} \approx \frac{1}{2 \varepsilon} \Big( f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{d}) - f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{d}) \Big)$$

We get the special case above when $$\boldsymbol{d}=\boldsymbol{e_i}$$. This also exact in the limit and just as accurate.

Runtime? Finite-difference approximation is probably too slow for approximating a high-dimensional gradient because the number of function evaluations required is $$2 n$$ where $$n$$ is the dimensionality of $$x$$. However, if the end goal is to approximate a gradient-vector product, a mere $$2$$ function evaluations is probably faster than specialized code for computing the gradient.

How to set $$\varepsilon$$? The second approach is more sensitive to $$\varepsilon$$ because $$\boldsymbol{d}$$ is arbitrary, unlike $$\boldsymbol{e_i}$$, which is a simple unit-norm vector. Luckily some guidance is available. Andrei (2009) reccommends

$$\varepsilon = \sqrt{\epsilon_{\text{mach}}} (1 + \|\boldsymbol{x} \|_{\infty}) / \| \boldsymbol{d} \|_{\infty}$$

where $$\epsilon_{\text{mach}}$$ is machine epsilon. (Numpy users: numpy.finfo(x.dtype).eps).

## Why do I care?

1. Well, I tend to work on sparse, but high-dimensional problems where finite-difference would be too slow. Thus, my usual solution is to only test several randomly selected dimensions$$-$$biasing samples toward dimensions which should be nonzero. With the new trick, I can effectively test more dimensions at once by taking random vectors $$\boldsymbol{d}$$. I recommend sampling $$\boldsymbol{d}$$ from a spherical Gaussian so that we're uniform on the angle of the vector.

2. Sometimes the gradient-vector dot product is the end goal. This is the case with Hessian-vector products, which arises in many optimization algorithms, such as stochastic meta descent. Hessian-vector products are an instance of the gradient-vector dot product because the Hessian is just the gradient of the gradient.

## Hessian-vector product

Hessian-vector products are an instance of the gradient-vector dot product because since the Hessian is just the gradient of the gradient! Now you only need to remember one formula!

$$H(\boldsymbol{x})\, \boldsymbol{d} \approx \frac{1}{2 \varepsilon} \Big( \nabla f(\boldsymbol{x} + \varepsilon \cdot \boldsymbol{d}) - \nabla f(\boldsymbol{x} - \varepsilon \cdot \boldsymbol{d}) \Big)$$

With this trick you never have to actually compute the gnarly Hessian! More on Justin Domke's blog