Home |

Another way to describe the k-means algorithm is to define its goal geometrically. If

Computing a k-means clustering involves identifying a set of cluster centroids, which implicitly solves for these planes. There are many methods for finding a satisfactory set of centroids given a set of data. The simplest is to pick an initial set of centroid seeds randomly (assuming we know how many clusters we want) and to assign each point to its closest seed. After each assignment, we need to update the assigned centroid by adding in the coordinates of the new point (a simple calculation). Assigning all points to a set of successively updated centroids constitutes one iteration of the k-means algorithm.

Each new iteration consists of a re-assignment of all points, until no point can be moved to a centroid closer than the one for the cluster it is already a member of. Every time a point is re-assigned, its old centroid must be downdated and its new centroid must be updated.

Starting with arbitrary random centroids is a relatively poor method, however. If there really are blobs of points, we do much better to begin with locations that are relatively close to the center of these blobs. John Hartigan (

If we do not know

The applet above computes a k-means clustering on two variables, using the PRE method to determine the number of clusters present. Each cluster is colored differently in the display. There are three random data distributions controlled by the buttons at the top. The first is Uniform random, the second is bivariate Normal, and the third is a mixture of bivariate normals (Lumpy). Every time you press the Sample button, you get a new random sample from these distributions. The size of these samples is controlled by the buttons at the bottom. The GIGO button at the bottom right stands for "Garbage In, Garbage Out." More about the GIGO button later.

You should expect to see only one color for Uniform or Normal data (except occasionally for the smaller samples, where it is possible to get a blobby pattern by chance). And you should expect to see different colored clusters wherever they appear separated in the Lumpy display. In some cases you will see the straight cutting planes (lines) between clusters when overlapping blobs of points are cut apart. There will also be rare instances when clearly separated blobs are not identified. These are cases where the PRE statistic misses the cutoff (.4) by a small amount. There is always a tradeoff between false positives and false negatives, but we could improve this situation a bit by using more information than simple sums of squares.

The applet also illustrates that k-means is most suited for separating convex clusters (clusters in which any line passing through a cluster intersects its boundary only twice). If your data contain doughnut-shaped or wormy-shaped clusters, don't expect k-means to find them.

In general, k-means is a popular clustering method because it is simple to program and is easy to compute on large samples. Data mining programs incorporating k-means sometimes ignore the subtleties of the algorithm (updating, downdating, identifying an appropriate number of clusters, choosing suitable seeds, iterating enough times to converge). Ignoring these subtleties has its benefits, though: you will find nice clusters even when they don't exist in the data! To this end, I have given you a GIGO button to try. This button finds four clusters in anything. It uses random seeds, no updating/downdating, and iterates only once. For uniform random data, it produces a random Voronoi tessellation of the plane. For the lumpy data, it slices clusters as if a monkey were chopping beans.

The GIGO button in a 2D applet lets you see the consequences of a badly designed k-means algorithm, but you would not perceive this as easily in higher dimensions. We humans are disposed to see patterns everywhere, even in random data. Thus, the insidious aspect of a badly designed cluster program is that it unfailingly gives us what we want.