The most approaches to hyperparameter optimization can be viewed as a bilevel optimization—the "inner" optimization optimizes training loss (wrt \(\theta\)), while the "outer" optimizes hyperparameters (\(\lambda\)).
Can we estimate \(\frac{\partial \mathcal{L}_{\text{dev}}}{\partial \lambda}\) so that we can run gradientbased optimization over \(\lambda\)?
Well, what does it mean to have an \(\textbf{argmin}\) inside a function?
Well, it means that there is a \(\theta^*\) that gets passed to \(\mathcal{L}_{\text{dev}}\). And, \(\theta^*\) is a function of \(\lambda\), denoted \(\theta(\lambda)\). Furthermore, \(\textbf{argmin}\) must set the derivative of the inner optimization to zero in order to be a local optimum of the inner function. So we can rephrase the problem as
where \(\theta(\lambda)\) is the solution to
Now how does \(\theta\) change as the result of an infinitesimal change to \(\lambda\)?
The constraint on the derivative implies a type of "equilibrium"—the inner optimization process will continue to optimize regardless of how we change \(\lambda\). Assuming we don't change \(\lambda\) too much, then the inner optimization shouldn't change \(\theta\) too much and it will change in a predictable way.
To do this, we'll appeal to the implicit function theorem. Let's look at the general case to simplify notation. Suppose \(x\) and \(y\) are related through a function \(g\) as follows,
Assuming \(g\) is a smooth function in \(x\) and \(y\), we can perturb either argument, say \(x\) by a small amount \(\Delta_x\) and \(y\) by \(\Delta_y\). Because system preserves the constraint, i.e.,
We can solve for the change of \(x\) as a result of an infinitesimal change in \(y\). We take the firstorder expansion,
Since \(g(x,y)\) is already zero,
Next, we solve for \(\frac{\Delta_x}{\Delta_y}\).
Back to the original problem: Now we can use the implicit function theorem to estimate how \(\theta\) varies in \(\lambda\) by plugging in \(g \mapsto \frac{\partial \mathcal{L}_{\text{train}}}{\partial \theta}\), \(x \mapsto \theta\) and \(y \mapsto \lambda\):
This tells us how \(\theta\) changes with respect to an infinitesimal change to \(\lambda\). Now, we can apply the chain rule to get the gradient of the whole optimization problem wrt \(\lambda\),
Since we don't like (explicit) matrix inverses, we compute \( \left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right)^{1} \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \lambda^\top}\) as the solution to \(\left( \frac{ \partial^2 \mathcal{L}_{\text{train}} }{ \partial \theta\, \partial \theta^\top } \right) x = \frac{ \partial^2 \mathcal{L}_{\text{train}}}{ \partial \theta\, \partial \lambda^\top}\). When the Hessian is positive definite, the linear system can be solved with conjugate gradient, which conveniently only requires matrixvector products—i.e., you never have to materialize the Hessian. (Apparently, matrixfree linear algebra is a thing.) In fact, you don't even have to implement the Hessianvector and Jacobianvector products because they are accurately and efficiently approximated with centered differences (see earlier post).
At the end of the day, this is an easy algorithm to implement! However, the estimate of the gradient can be temperamental if the linear system is illconditioned.
In a later post, I'll describe a morerobust algorithms based on automatic differentiation through the inner optimization algorithm, which make fewer and lessbrittle assumptions about the inner optimization.
Further reading:

Efficient multiple hyperparameter learning for loglinear models

Gradientbased Hyperparameter Optimization through Reversible Learning

Hyperparameter optimization with approximate gradient (paper): This paper looks at the implicit differentiation approach where you have an approximate solution to the inner optimization problem. They are able to provide error bounds and convergence guarantees under some reasonable conditions.