Graduate Descent

Expected Value of a Quadratic and the Delta Method


statistics probability

Expected value of a quadratic: Suppose we'd like to compute the expectation of a quadratic function, i.e., $\mathbb{E}\left[ x^{\top}\negthinspace\negthinspace A x \right]$ , where $x$ is a random vector and $A$ is deterministic symmetric matrix. Let $\mu$ and $\Sigma$ be the mean and variance of $x$. It turns out the expected value of a quadratic has the following simple form:

$$ \mathbb{E}\left[ x^{\top}\negthinspace\negthinspace A x \right] = \text{trace}\left( A \Sigma \right) + \mu^{\top}\negthinspace A \mu $$

Delta Method: Suppose we'd like to compute expected value of a nonlinear function $f$ applied our random variable $x$, $\mathbb{E}\left[ f(x) \right]$. The Delta method approximates this expection by replacing $f$ by its second-order Taylor approximation $\hat{f_{a}}$ taken at some point $a$

$$ \hat{f_{a}}(x) = f(a) + \nabla f(a)^{\top} (x - a) + \frac{1}{2} (x - a)^\top H(a) (x - a) $$

The expectation of this Taylor approximation is a quadratic function! Let's try to apply our new equation for the expected value of quadratic. We can use the trick from above with $A=H(a)$ and $x = (x-a)$. Note, the covariance matrix is shift-invariant, and the Hessian is a symmetric matrix!

$$ \begin{aligned} \mathbb{E}\left[ \hat{f_{a}}(x) \right] & = \mathbb{E} \left[ f(a) + \nabla\negthinspace f(a)^{\top} (x - a) + \frac{1}{2} (x - a)^{\top} H(a)\, (x - a) \right] \\\ & = f(a) + \nabla\negthinspace f(a)^{\top} ( \mu - a ) + \frac{1}{2} \mathbb{E} \left[ (x - a)^{\top} H(a)\, (x - a) \right] \\\ & = f(a) + \nabla\negthinspace f(a)^{\top} ( \mu - a ) + \frac{1}{2}\left( \text{trace}\left( H(a) \, \Sigma \right) + (\mu - a)^{\top} H(a)\, (\mu - a) \right) \end{aligned} $$

Taking the Taylor expansion around $\mu$ simplifies the equation as follows

\begin{aligned} \mathbb{E}\left[ \hat{f_{\mu}} (x) \right] &= \mathbb{E}\left[ f(\mu) + \nabla\negthinspace f(\mu) (x - \mu) + \frac{1}{2} (x - \mu)^{\top} H(\mu)\, (x - \mu) \right] \\\ &= f(\mu) + \frac{1}{2} \, \text{trace}\Big( H(\mu) \, \Sigma \Big) \end{aligned}

That looks much more tractable! Error bounds are possible to derive, but outside to scope of this post. For a nice use of the delta method in machine learning see (Wager+,'13) and (Smith & Eisner,'06)