The Expectation-Maximization (EM) algorithm is widely used for maximum likelihood estimation when a model involves hidden (latent) variables. Examples include Gaussian mixture models, hidden Markov models, and incomplete-data problems where some variables are unobserved. A key reason EM is trusted in practice is its convergence property: each EM iteration does not decrease the observed-data log-likelihood. This article walks through the proof idea in clear steps, focusing on why the likelihood increases monotonically.
If you are learning this in data science classes in Pune, understanding the convergence argument helps you interpret EM’s behaviour beyond just “run E-step, run M-step.”
1) The Incomplete-Data Setup
Assume we observe data XXX, but there is an unobserved latent variable ZZZ. The model is defined by a parameter vector θ\thetaθ. The quantity we want to maximise is the observed-data log-likelihood:
ℓ(θ)=logp(X∣θ)=log∑Zp(X,Z∣θ).\ell(\theta) = \log p(X \mid \theta) = \log \sum_{Z} p(X, Z \mid \theta).ℓ(θ)=logp(X∣θ)=logZ∑p(X,Z∣θ).That summation (or integral, in continuous cases) is what makes direct optimisation hard. EM solves this by iteratively working with the complete-data likelihood p(X,Z∣θ)p(X, Z \mid \theta)p(X,Z∣θ), which is usually easier to handle.
2) Lower Bounding the Log-Likelihood Using Jensen’s Inequality
EM’s proof relies on constructing a lower bound on ℓ(θ)\ell(\theta)ℓ(θ). Start by introducing any distribution q(Z)q(Z)q(Z) over the latent variable:
ℓ(θ)=log∑Zq(Z)p(X,Z∣θ)q(Z).\ell(\theta) = \log \sum_Z q(Z)\frac{p(X,Z\mid \theta)}{q(Z)}.ℓ(θ)=logZ∑q(Z)q(Z)p(X,Z∣θ).Now apply Jensen’s inequality (logE[Y]≥E[logY]\log \mathbb{E}[Y] \ge \mathbb{E}[\log Y]logE[Y]≥E[logY]):
ℓ(θ)≥∑Zq(Z)logp(X,Z∣θ)q(Z).\ell(\theta) \ge \sum_Z q(Z)\log \frac{p(X,Z\mid \theta)}{q(Z)}.ℓ(θ)≥Z∑q(Z)logq(Z)p(X,Z∣θ).Define the bound function:
F(q,θ)=∑Zq(Z)logp(X,Z∣θ)−∑Zq(Z)logq(Z).\mathcal{F}(q,\theta) = \sum_Z q(Z)\log p(X,Z\mid \theta) – \sum_Z q(Z)\log q(Z).F(q,θ)=Z∑q(Z)logp(X,Z∣θ)−Z∑q(Z)logq(Z).This is often called the evidence lower bound (ELBO). The second term is the entropy of qqq, which does not depend on θ\thetaθ once qqq is fixed.
In data science classes in Pune, this is a crucial step: EM is not directly maximising ℓ(θ)\ell(\theta)ℓ(θ) each time; it is maximising a lower bound that “touches” ℓ(θ)\ell(\theta)ℓ(θ) at the right place.
3) The E-Step: Make the Bound Tight at the Current Parameters
At iteration ttt, EM has parameters θ(t)\theta^{(t)}θ(t). The E-step chooses qqq to tighten the bound as much as possible. The best choice is:
q(t)(Z)=p(Z∣X,θ(t)).q^{(t)}(Z) = p(Z\mid X,\theta^{(t)}).q(t)(Z)=p(Z∣X,θ(t)).With this choice, the bound becomes exact at θ(t)\theta^{(t)}θ(t). You can show this using the KL divergence:
ℓ(θ)=F(q,θ)+KL(q(Z) ∥ p(Z∣X,θ)),\ell(\theta) = \mathcal{F}(q,\theta) + \mathrm{KL}\big(q(Z)\ \|\ p(Z\mid X,\theta)\big),ℓ(θ)=F(q,θ)+KL(q(Z) ∥ p(Z∣X,θ)),where KL is always non-negative. Therefore, for a fixed θ\thetaθ, F\mathcal{F}F is maximised when q(Z)=p(Z∣X,θ)q(Z)=p(Z\mid X,\theta)q(Z)=p(Z∣X,θ), making the KL term zero.
So after the E-step:
ℓ(θ(t))=F(q(t),θ(t)).\ell(\theta^{(t)}) = \mathcal{F}(q^{(t)},\theta^{(t)}).ℓ(θ(t))=F(q(t),θ(t)).4) The M-Step: Increase the Bound, Which Increases the Likelihood
Now keep q(t)q^{(t)}q(t) fixed and choose new parameters to maximise the bound:
θ(t+1)=argmaxθF(q(t),θ).\theta^{(t+1)} = \arg\max_\theta \mathcal{F}(q^{(t)},\theta).θ(t+1)=argθmaxF(q(t),θ).Because θ(t+1)\theta^{(t+1)}θ(t+1) maximises F\mathcal{F}F with q(t)q^{(t)}q(t) fixed, we have:
F(q(t),θ(t+1))≥F(q(t),θ(t)).\mathcal{F}(q^{(t)},\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t)}).F(q(t),θ(t+1))≥F(q(t),θ(t)).Now connect the bound back to the likelihood. Since:
ℓ(θ)≥F(q(t),θ)for any θ,\ell(\theta) \ge \mathcal{F}(q^{(t)},\theta)\quad \text{for any }\theta,ℓ(θ)≥F(q(t),θ)for any θ,we get:
ℓ(θ(t+1))≥F(q(t),θ(t+1)).\ell(\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t+1)}).ℓ(θ(t+1))≥F(q(t),θ(t+1)).Chain these inequalities:
ℓ(θ(t+1))≥F(q(t),θ(t+1))≥F(q(t),θ(t))=ℓ(θ(t)).\ell(\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t)}) = \ell(\theta^{(t)}).ℓ(θ(t+1))≥F(q(t),θ(t+1))≥F(q(t),θ(t))=ℓ(θ(t)).This proves monotonic non-decrease of the observed-data log-likelihood:
ℓ(θ(t+1))≥ℓ(θ(t)).\ell(\theta^{(t+1)}) \ge \ell(\theta^{(t)}).ℓ(θ(t+1))≥ℓ(θ(t)).That is the core EM convergence guarantee taught in strong data science classes in Pune: every iteration improves (or at least does not worsen) the fit in terms of likelihood.
Conclusion
The EM algorithm’s monotonic likelihood increase comes from a simple but powerful strategy: build a lower bound on the log-likelihood, tighten the bound with the E-step using the posterior over latent variables, then raise the bound in the M-step by optimising parameters. While EM does not guarantee reaching the global maximum (it may converge to a local optimum or saddle point), it reliably improves the likelihood each iteration. For learners in data science classes in Pune, mastering this proof builds real confidence in when and why EM works—and how to diagnose its behaviour in practical modelling tasks.
