This post is intended to convince conditional random field (CRF) lovers that deep learning might not be as crazy as it seems. And maybe even convince some deep learning lovers that the graphical models might have interesting things to offer.
In the world of structured prediction, we are plagued by the hightreewidth problem  models with loopy factors are "bad" because exact inference is intractable. There are three common approaches for dealing with this problem:

Limit the expressiveness of the model (i.e., don't use to model you want)

Change the training objective

Approximate inference
Approximate inference is tricky. Things can easily go awry.
For example, structured perceptron training with loopy maxproduct BP instead of exact max product can diverge (Kulesza & Pereira, 2007). Another example: using approximate marginals from sumproduct loopy BP in place of the true marginals in the gradient of the loglikelihood. This results in a different nonconvex objective function. (Note: sometimes these loopy BP approximations work fine.)
It looks like using approximate inference during training changes the training objective.
So, here's a simple idea: learn a model which makes accurate predictions given the approximate inference algorithm that will be used at testtime. Furthermore, we should minimize empirical risk instead of loglikelihood because it is robust to model missspecification and approximate inference. In other words, make training conditions as close as possible to testtime conditions.
Now, as long as everything is differentiable, you can apply automatic differentiation (backprop) to train the endtoend system. This idea appears in a few publications, including a handful of papers by Justin Domke, and a few by Stoyanov & Eisner.
Unsuprisingly, it works pretty well.
I first saw this idea in Stoyanov & Eisner (2011). They use loopy belief propagation as their approximate inference algorithm. At the end of the day, their model is essentially a deep recurrent network, which came from unrolling inference in a graphical model. This idea really struck me because it's clearly right in the middle between graphical models and deep learning.
You can immediately imagine swapping in other approximate inference algorithms in place of loopy BP.
Deep learning approaches get a bad reputation because there are a lot of "tricks" to get nonconvex optimization to work and because model structures are more open ended. Unlike graphical models, deep learning models have more variation in model structures. Maybe being more open minded about model structures is a good thing. We seem to have hit a brick wall with likelihoodbased training. At the same time, maybe we can port over some of the good work on approximate inference as deep architectures.