Expectation-Maximization (EM) Algorithm Convergence Proof: Why Likelihood Improves Each Iteration

The Expectation-Maximization (EM) algorithm is widely used for maximum likelihood estimation when a model involves hidden (latent) variables. Examples include Gaussian mixture models, hidden Markov models, and incomplete-data problems where some variables are unobserved. A key reason EM is trusted in practice is its convergence property: each EM iteration does not decrease the observed-data log-likelihood. This article walks through the proof idea in clear steps, focusing on why the likelihood increases monotonically.

If you are learning this in data science classes in Pune, understanding the convergence argument helps you interpret EM’s behaviour beyond just “run E-step, run M-step.”

1) The Incomplete-Data Setup

Assume we observe data XXX, but there is an unobserved latent variable ZZZ. The model is defined by a parameter vector θ\thetaθ. The quantity we want to maximise is the observed-data log-likelihood:

ℓ(θ)=log⁡p(X∣θ)=log⁡∑Zp(X,Z∣θ).\ell(\theta) = \log p(X \mid \theta) = \log \sum_{Z} p(X, Z \mid \theta).ℓ(θ)=logp(X∣θ)=logZ∑p(X,Z∣θ).That summation (or integral, in continuous cases) is what makes direct optimisation hard. EM solves this by iteratively working with the complete-data likelihood p(X,Z∣θ)p(X, Z \mid \theta)p(X,Z∣θ), which is usually easier to handle.

2) Lower Bounding the Log-Likelihood Using Jensen’s Inequality

EM’s proof relies on constructing a lower bound on ℓ(θ)\ell(\theta)ℓ(θ). Start by introducing any distribution q(Z)q(Z)q(Z) over the latent variable:

ℓ(θ)=log⁡∑Zq(Z)p(X,Z∣θ)q(Z).\ell(\theta) = \log \sum_Z q(Z)\frac{p(X,Z\mid \theta)}{q(Z)}.ℓ(θ)=logZ∑q(Z)q(Z)p(X,Z∣θ).Now apply Jensen’s inequality (log⁡E[Y]≥E[log⁡Y]\log \mathbb{E}[Y] \ge \mathbb{E}[\log Y]logE[Y]≥E[logY]):

ℓ(θ)≥∑Zq(Z)log⁡p(X,Z∣θ)q(Z).\ell(\theta) \ge \sum_Z q(Z)\log \frac{p(X,Z\mid \theta)}{q(Z)}.ℓ(θ)≥Z∑q(Z)logq(Z)p(X,Z∣θ).Define the bound function:

F(q,θ)=∑Zq(Z)log⁡p(X,Z∣θ)−∑Zq(Z)log⁡q(Z).\mathcal{F}(q,\theta) = \sum_Z q(Z)\log p(X,Z\mid \theta) – \sum_Z q(Z)\log q(Z).F(q,θ)=Z∑q(Z)logp(X,Z∣θ)−Z∑q(Z)logq(Z).This is often called the evidence lower bound (ELBO). The second term is the entropy of qqq, which does not depend on θ\thetaθ once qqq is fixed.

In data science classes in Pune, this is a crucial step: EM is not directly maximising ℓ(θ)\ell(\theta)ℓ(θ) each time; it is maximising a lower bound that “touches” ℓ(θ)\ell(\theta)ℓ(θ) at the right place.

3) The E-Step: Make the Bound Tight at the Current Parameters

At iteration ttt, EM has parameters θ(t)\theta^{(t)}θ(t). The E-step chooses qqq to tighten the bound as much as possible. The best choice is:

q(t)(Z)=p(Z∣X,θ(t)).q^{(t)}(Z) = p(Z\mid X,\theta^{(t)}).q(t)(Z)=p(Z∣X,θ(t)).With this choice, the bound becomes exact at θ(t)\theta^{(t)}θ(t). You can show this using the KL divergence:

ℓ(θ)=F(q,θ)+KL(q(Z) ∥ p(Z∣X,θ)),\ell(\theta) = \mathcal{F}(q,\theta) + \mathrm{KL}\big(q(Z)\ \|\ p(Z\mid X,\theta)\big),ℓ(θ)=F(q,θ)+KL(q(Z) ∥ p(Z∣X,θ)),where KL is always non-negative. Therefore, for a fixed θ\thetaθ, F\mathcal{F}F is maximised when q(Z)=p(Z∣X,θ)q(Z)=p(Z\mid X,\theta)q(Z)=p(Z∣X,θ), making the KL term zero.

So after the E-step:

ℓ(θ(t))=F(q(t),θ(t)).\ell(\theta^{(t)}) = \mathcal{F}(q^{(t)},\theta^{(t)}).ℓ(θ(t))=F(q(t),θ(t)).4) The M-Step: Increase the Bound, Which Increases the Likelihood

Now keep q(t)q^{(t)}q(t) fixed and choose new parameters to maximise the bound:

θ(t+1)=arg⁡max⁡θF(q(t),θ).\theta^{(t+1)} = \arg\max_\theta \mathcal{F}(q^{(t)},\theta).θ(t+1)=argθmaxF(q(t),θ).Because θ(t+1)\theta^{(t+1)}θ(t+1) maximises F\mathcal{F}F with q(t)q^{(t)}q(t) fixed, we have:

F(q(t),θ(t+1))≥F(q(t),θ(t)).\mathcal{F}(q^{(t)},\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t)}).F(q(t),θ(t+1))≥F(q(t),θ(t)).Now connect the bound back to the likelihood. Since:

ℓ(θ)≥F(q(t),θ)for any θ,\ell(\theta) \ge \mathcal{F}(q^{(t)},\theta)\quad \text{for any }\theta,ℓ(θ)≥F(q(t),θ)for any θ,we get:

ℓ(θ(t+1))≥F(q(t),θ(t+1)).\ell(\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t+1)}).ℓ(θ(t+1))≥F(q(t),θ(t+1)).Chain these inequalities:

ℓ(θ(t+1))≥F(q(t),θ(t+1))≥F(q(t),θ(t))=ℓ(θ(t)).\ell(\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t+1)}) \ge \mathcal{F}(q^{(t)},\theta^{(t)}) = \ell(\theta^{(t)}).ℓ(θ(t+1))≥F(q(t),θ(t+1))≥F(q(t),θ(t))=ℓ(θ(t)).This proves monotonic non-decrease of the observed-data log-likelihood:

ℓ(θ(t+1))≥ℓ(θ(t)).\ell(\theta^{(t+1)}) \ge \ell(\theta^{(t)}).ℓ(θ(t+1))≥ℓ(θ(t)).That is the core EM convergence guarantee taught in strong data science classes in Pune: every iteration improves (or at least does not worsen) the fit in terms of likelihood.

Conclusion

The EM algorithm’s monotonic likelihood increase comes from a simple but powerful strategy: build a lower bound on the log-likelihood, tighten the bound with the E-step using the posterior over latent variables, then raise the bound in the M-step by optimising parameters. While EM does not guarantee reaching the global maximum (it may converge to a local optimum or saddle point), it reliably improves the likelihood each iteration. For learners in data science classes in Pune, mastering this proof builds real confidence in when and why EM works—and how to diagnose its behaviour in practical modelling tasks.

Expectation-Maximization (EM) Algorithm Convergence Proof: Why Likelihood Improves Each Iteration

1) The Incomplete-Data Setup

2) Lower Bounding the Log-Likelihood Using Jensen’s Inequality

3) The E-Step: Make the Bound Tight at the Current Parameters

ℓ(θ(t))=F(q(t),θ(t)).\ell(\theta^{(t)}) = \mathcal{F}(q^{(t)},\theta^{(t)}).ℓ(θ(t))=F(q(t),θ(t)).4) The M-Step: Increase the Bound, Which Increases the Likelihood

Conclusion

Weight loss Apple Valley services supporting healthier modern wellness routines

Uptown Eye Specialists: Comprehensive and Advanced Eye Care Services

Supporting Your Canine Health With CBD Oil

The Science Behind Thai Massage: How It Works

The Role of Physical Therapy in Injury Prevention for Athletes

Discover Compassionate Care at Wayfinder Psychiatry in Gilbert, Arizona

FOLLOW US

Key Factors to Check Before Ordering Extruded Plastic Profiles in Bulk

Weight loss Apple Valley services supporting healthier modern wellness routines

Best AEO Agencies for SaaS: Full Review and Comparison

Epoxy Flooring Solutions for Modern Garage Floor Protection

Garage Floors Atlanta and Concrete Polishing Atlanta ideas

Trending Post

5 Ways Pizza in Antioch Strikes the Right Balance

The Best starbucks hours Experience Revealed: Find the starbucks near me, learn about their hours, and find the ideal cup.

Dry or Damp Dog Food: Which Is Better?

Latest Post

Weight loss Apple Valley services supporting healthier modern wellness routines

Uptown Eye Specialists: Comprehensive and Advanced Eye Care Services

Supporting Your Canine Health With CBD Oil