Pretty much anything that covers the theory gets you there, such as Gelman et al.’s *Bayesian Data Analysis*. But even the Wikipedia entry for EM covers MAP estimation. And it has links to other papers with the theory. Or you can read Rubin et al.’s original paper, which isn’t that hard to follow.

I have since learned that log likelihood is guaranteed to converge, even if you have something like a separable logistic regression problem where the coefficients don’t converge. As the coefficients head off to positive (or negative) infinity [because the problem’s separable], the log likelihood converges.

It’s also important to keep in mind that if you’re doing something like a latent mixture model (the usual poster child app for EM), the latent parameters are not considered data — they get marginalized out. Again, the Wikipedia’s presentation is pretty good in the theory section here.

]]>Your reply brings up two very important points.

One is the Very Important Problem, which is that even if you converge, there’s no guarantee your local maximum is a global maximum (MLE or MAP). I was just trying to point out what it is you’re measuring for converging to a LOCAL optimum. [This can also be a problem for Bayesian sampling with Gibbs or other MCMC approaches — you need to get in the neighborhood of all the modes to cover the posterior well.]

I really like the discussion in David MacKay’s book, which points out very clearly that in the standard Gaussian mixture example, there is no maximum likelihood point, because likelihood is unbounded when there is a single element in a cluster and variance goes to zero.

The second point, that you only get convergence in likelihood, was news to me. I see where it’s a problem with non-identified problems, like perfectly correlated predictors for a regression problem. I’m now wondering if the coefficients will converge along with the likelihood at a local maximum in practice, but they just won’t do so stably. EM’s deterministic once you have inits and have started to climb.

Lots of people stop EM short to do a kind of poor-man’s regularization. For some reason, this always annoys my inner neat freak. I’d rather they fix the model then fit it rather than mixing half model, half heuristic.

I’m in the midst of learning Python and am about to roll out a Python-based annotation inference package. I implemented EM first because I’m working with Andrey Rzhetsky at U. Chicago on this and he wanted to see EM. It’s turning out to be unstable for ordinal annotation problems in that if two items are close on the ordinal scale and hence easily confusible, EM tends to drive prevalence in a winner-take-all direction. I’m hoping the Gibbs sampler I write next will be a bit more stable in spreading out the wealth.

]]>Your right, of course. With hidden variables the likelihood surface can have lots of local maxima, and different “tricks” work best on different problems.

I suspect most people think of the likelihood surface as several mountain peaks, but it can be much more complicated! (Or perhaps the right comment is that 10,000-dimensional mountain ranges are more complicated!). Many of our problems are non-identifiable (or close to non-identifiable) which means the likelihood surface contains ridges, plateaus and mesas — the EM theorems guarantee convergence in likelihood, not parameters, and for these problems you shouldn’t expect convergence in parameter space!

Another observation is that most people don’t run EM for enough iterations to get anything like convergence. In some experiments on EM for HMM POS tagging we showed that EM may need hundreds of iterations to approach convergence.

Best,

Mark

]]>