Blackbox optimization algorithms are a fantastic tool that everyone should be aware of. I frequently use blackbox optimization algorithms for prototyping and when gradientbased algorithms fail, e.g., because the function is not differentiable, because the function is truly opaque (no gradients), because the gradient would require too much memory to compute efficiently.
From a young age, we are taught to love gradients and never learn about any optimization algorithms other than gradient descent. I believe this obsession has put us in a local optimum. I've been amazed at how few people know about nongradient algorithms for optimization. Although, this is slowly improving thanks to the prevalence of hyperparameter optimization, so most people have used random search and at least know of Bayesian optimization.
There are many ways to optimize a function! The gradient just happens to have a beautiful and computationally efficient shortcut for finding the direction of steepest descent in Euclidean space.
What is a descent direction anyway? For minimizing an function \(f: \mathbb{R}^d \mapsto \mathbb{R}\), a descent direction for \(f\) is a \((d+1)\)dimensional hyperplane. The gradient gives a unique hyperplane that is tangent to the surface of \(f\) at the point \(x\); the \((d+1)^{\text{th}}\) coordinate comes from the value \(f(x)\)—think of it like a firstorder Taylor approximation to \(f\) at \(x\). (For an indepth discussion on notions of steepest descent, check out this post.)
The baseline: Without access to gradient code, approximating the gradient takes \(d+1\) function evaluations via the finitedifference approximation to the gradient,^{1} which I've discussed a few times. This shouldn't be surprising since that's the size of the object we're looking for anyways!^{2}
Can we do better? Suppose we had \((d+1)\) arbitrary points \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(d+1)}\) in \(\mathbb{R}^n\) with values \(f^{(i)} = f(\boldsymbol{x}^{(i)}).\) Can we find efficiently find a descent direction without extra \(f\) evaluations?
The NelderMead trick: Take the worstperforming point in this set \(\boldsymbol{x}^{(\text{worst})}\) and consider moving that point through the centerofmass of the \(d\) remaining points. Call this the NM direction. At some point along that direction (think line search) there will be a good place to put that point, which will make it the new best point. We can now repeat this process: pick the worst point, reflect it through the center of mass, etc.

The cost of finding the NM descent direction requires no additional function evaluations, which allows the method to be very frugal with function evaluations. Of course, stepping in the search direction should use line search, which will require additional function evaluations; gradientbased methods also benefit from line search.

Finding the worst point can be done in time \(\mathcal{O}(\log d)\) using a heap.

This NM direction might is not the steepest descent direction—like the gradient—but it does give a reasonable descent direction to follow. Often, the NM direction is more useful than the gradient direction because it is not based on an infinitesimal ball around the current point like the gradient. NM often "works" on noisy and nonsmooth functions where gradients do not exist.

On highdimensional problems, NM requires a significant number of "warm up" function evaluations before it can take its first informed step. Whereas, gradient descent could plausibly CONVERGE in fewer gradient evaluations (assuming sufficiently "nice" functions)! So, if you have highdimensional problem and efficient gradients, use them.

In three dimensions, we can visualize this as a tetrahedron with corners that "stick" to the surface of the function. At each iterations, the highest (i.e., worst performing) point is the one most likely to be affected by "gravity" which causes it to flip through the middle of the blob, as the other points stay stuck.
(animation source: Wikipedia page for NelderMead)
 This is exactly the descent direction used in the
NelderMead algorithm
(Nelder & Mead, 1965), which happens to be a great default algorithm for
locally optimizing functions without access to gradients. Matlab and scipy
users may know it better as
fmin
. There are some additional "search moves" required to turn NM into a robust algorithm; these include shrinking and growing the set of points. I won't try to make yet another tutorial on the specifics of NelderMead, as several already exist, but rather bring it to your attention as a plausible approach for efficiently finding descent directions. You can find a tutorial with plenty of visualization on its Wikipedia page.
Summary
I described the NelderMead search direction as an efficient way to leverage past function evaluations to find a descent directions, which serves as a reasonable alternative to gradients when they are unavailable (or not useful).
Further reading
 There are plenty of other blackbox optimization algorithms out there. The wiki page on derivativefree optimization is a good starting point for learning more.

Of course, it's better to use the twosided difference approximation to the gradient in practice, which requires \(2 \cdot d\) function evaluations, not \(d+1\). ↩

Note that we can get noisy, approximations with much fewer than \(\mathcal{O}(d)\) evaluations, e.g., SPSA or even REINFORCE obtain gradients approximations with just \(\mathcal{O}(1)\) evaluations per iteration. ↩