Importance sampling is a powerful and pervasive technique in statistics, machine learning and randomized algorithms.
Basics
Importance sampling is a technique for estimating the expectation \(\mu\) of a random variable \(f(x)\) under distribution \(p\) from samples of a different distribution \(q.\)
The key observation is that \(\mu\) is can expressed as the expectation of a different random variable \(f^*(x)=\frac{p(x)}{q(x)}\! \cdot\! f(x)\) under \(q.\)
Technical condition: \(q\) must have support everywhere \(p\) does, \(f(x) p(x) > 0
\Rightarrow q(x) > 0.\) Without this condition, the equation is biased! Note: \(q\)
can support things that \(p\) doesn't.
Terminology: The quantity \(w(x) = \frac{p(x)}{q(x)}\) is often referred to as the "importance weight" or "importance correction". We often refer to \(p\) as the target density and \(q\) the proposal density.
Now, given samples \(\{ x^{(i)} \}_{i=1}^{n}\) from \(q,\) we can use the Monte Carlo estimate, \(\hat{\mu} \approx \frac{1}{n} \sum_{i=1}^n f^{*}(x^{(i)}),\) as an unbiased estimator of \(\mu.\)
Remarks
There are a few reasons we might want use importance sampling:

Convenience: It might be trickier to sample directly from \(p.\)

Biascorrection: Suppose, we're developing an algorithm which requires samples to satisfy some "safety" condition (e.g., a minimum support threshold) and be unbiased. Importance sampling can be used to remove bias, while satisfying the condition.

Variance reduction: It might be the case that sampling directly from \(p\) would require more samples to estimate \(\mu.\) Check out these great notes for more.

Offpolicy evaluation and learning: We might want to collect some "exploratory data" from \(q\) and evaluate different "policies", \(p\) (e.g., to pick the best one). Here's a link to a future post on offpolicy evaluation and counterfactual reasoning and some cool papers: counterfactual reasoning, reinforcement learning, contextual bandits, domain adaptation.
There are a few common cases for \(q\) worth separate consideration:

Control over \(q\): This is the case in experimental design, variance reduction, active learning and reinforcement learning. It's often difficult to design \(q,\) which results in an estimator with "reasonable" variance. A very difficult case is in offpolicy evaluation because it (essentially) requires a good exploratory distribution for every possible policy. (I have much more to say on this topic.)

Little to no control over \(q\): For example, you're given some dataset (e.g., new articles) and you want to estimate performance on a different dataset (e.g., Twitter).

Unknown \(q\): In this case, we want to estimate \(q\) (typically referred to as the propensity score) and use it in the importance sampling estimator. This technique, as far as I can tell, is widely used to remove selection bias when estimating effects of different treatments.
Drawbacks: The main drawback of importance sampling is variance. A few bad samples with large weights can drastically throw off the estimator. Thus, it's often the case that a biased estimator is preferred, e.g., estimating the partition function, clipping weights, indirect importance sampling. A secondary drawback is that both densities must be normalized, which is often intractable.
What's next? I plan to cover "variance reduction" and offpolicy evaluation in more detail in future posts.