2007-11-06, 3:30pm in Wean 5409
Projection Pursuit, Gaussian Scale Mixtures, and the EM Algorithm
Sanjoy Dasgupta
University of California – San Diego
The EM algorithm for fitting Gaussian mixture models is one of the most widely-used clustering methods. Yet, surprisingly little is known about its behavior. Under what conditions does it converge to a good local optimum? What are good ways to initialize it, and to merge/remove intermediate clusters? Such questions are difficult to answer with the mathematical tools that have traditionally been applied to EM.
I will describe an alternative way of analyzing EM: by a probabilistic analysis. This reveals, first of all, that many common methods of initializing EM produce highly suboptimal results even in ideal clustering scenarios. On the other hand, I’ll show that a particular variant of EM will provably recover a near-optimal clustering, provided that the clusters are adequately separated and that their distributions are of a certain fairly general form.
The type of cluster distributions allowed is motivated by a new result in projection pursuit, along the lines of the (somewhat mistaken) folklore that “almost all projections of high-dimensional data look Gaussian”.
Specifically, I’ll show that for any D-dimensional distribution with finite second moments, there is a precise sense in which almost all of its linear projections into d < D dimensions look like a scale-mixture of spherical Gaussians (concentric spherical Gaussians with the same center).
The extent of this effect depends upon the ratio of d to D, and upon the “eccentricity” of the high-dimensional distribution.