k-Means Clustering

The k-means clustering algorithm classifies n points into k clusters by assigning each point to the cluster whose average value on a set of p variables is nearest to it by some distance measure (usually Euclidean) on that set. The algorithm computes these assignments iteratively, until reassigning points and recomputing averages (over all points in a cluster) produces no changes.

Another way to describe the k-means algorithm is to define its goal geometrically. If n points are embedded in a p-dimensional space, then k clusters are summarized by their respective centroids (average of the cluster members' coordinates) in that space. When k-means has finished iterating, the result partitions the set of points in this space using cutting planes (lines in 2D). Each cutting plane is positioned orthogonal to the line connecting a pair of cluster centroids so that it bisects that line. For k clusters, there are k(k-1)/2 cutting planes.

Computing a k-means clustering involves identifying a set of cluster centroids, which implicitly solves for these planes. There are many methods for finding a satisfactory set of centroids given a set of data. The simplest is to pick an initial set of centroid seeds randomly (assuming we know how many clusters we want) and to assign each point to its closest seed. After each assignment, we need to update the assigned centroid by adding in the coordinates of the new point (a simple calculation). Assigning all points to a set of successively updated centroids constitutes one iteration of the k-means algorithm.

Each new iteration consists of a re-assignment of all points, until no point can be moved to a centroid closer than the one for the cluster it is already a member of. Every time a point is re-assigned, its old centroid must be downdated and its new centroid must be updated.

Starting with arbitrary random centroids is a relatively poor method, however. If there really are blobs of points, we do much better to begin with locations that are relatively close to the center of these blobs. John Hartigan (Clustering Algorithms, Wiley, 1975) suggests a stagewise method to approach this goal. We begin with two centroids. These are computed by splitting all points on the variable (dimension) with greatest range (or some other measure of spread). The split separates points above the mean (or some other measure of location) on this variable from those below the mean. Centroids are then computed for each group by averaging coordinates of its members. Then we do an entire k-means cycle (assignments plus iterations). After convergence, we move to computing three centroids. Again, we split the cluster having the largest range on a variable into two clusters (one above and one below the mean on this variable). New centroids are computed and another k-means cycle is performed. We continue this process until we reach the desired number (k)of centroids.

If we do not know k in advance, how can we choose a value based on our data? There is a circularity here: we need to know k to find clusters and we need to identify clusters to determine k. Hartigan's procedure gives us a lever. At each stage of Hartigan's method, we compute the sum-of-squares within groups over all variables (sum of squared deviations of each point from its centroid on every dimension). This sum of squares should decline as we add new clusters; indeed, it would be zero if we made every point a cluster. So we look for the reduction in sum of squares at each step and stop adding clusters when this reduction is negligible. Hartigan gives an approximate F statistic that can be used to test the "significance" of this reduction, but a simple method that works well for most datasets is to look for a proportional reduction in error (PRE) of about .4 or better to justify a split. PRE is the ratio of reduction in sum of squares to the previous sum of squares.

The applet above computes a k-means clustering on two variables, using the PRE method to determine the number of clusters present. Each cluster is colored differently in the display. There are three random data distributions controlled by the buttons at the top. The first is Uniform random, the second is bivariate Normal, and the third is a mixture of bivariate normals (Lumpy). Every time you press the Sample button, you get a new random sample from these distributions. The size of these samples is controlled by the buttons at the bottom. The GIGO button at the bottom right stands for "Garbage In, Garbage Out." More about the GIGO button later.

You should expect to see only one color for Uniform or Normal data (except occasionally for the smaller samples, where it is possible to get a blobby pattern by chance). And you should expect to see different colored clusters wherever they appear separated in the Lumpy display. In some cases you will see the straight cutting planes (lines) between clusters when overlapping blobs of points are cut apart. There will also be rare instances when clearly separated blobs are not identified. These are cases where the PRE statistic misses the cutoff (.4) by a small amount. There is always a tradeoff between false positives and false negatives, but we could improve this situation a bit by using more information than simple sums of squares.

The applet also illustrates that k-means is most suited for separating convex clusters (clusters in which any line passing through a cluster intersects its boundary only twice). If your data contain doughnut-shaped or wormy-shaped clusters, don't expect k-means to find them.

In general, k-means is a popular clustering method because it is simple to program and is easy to compute on large samples. Data mining programs incorporating k-means sometimes ignore the subtleties of the algorithm (updating, downdating, identifying an appropriate number of clusters, choosing suitable seeds, iterating enough times to converge). Ignoring these subtleties has its benefits, though: you will find nice clusters even when they don't exist in the data! To this end, I have given you a GIGO button to try. This button finds four clusters in anything. It uses random seeds, no updating/downdating, and iterates only once. For uniform random data, it produces a random Voronoi tessellation of the plane. For the lumpy data, it slices clusters as if a monkey were chopping beans.

The GIGO button in a 2D applet lets you see the consequences of a badly designed k-means algorithm, but you would not perceive this as easily in higher dimensions. We humans are disposed to see patterns everywhere, even in random data. Thus, the insidious aspect of a badly designed cluster program is that it unfailingly gives us what we want.