K-MEANS
Last updated
Last updated
The objective of k-means clustering is to partition a dataset of n points into k groups, where each cluster i is defined by a prototype point ci which is also its centroid and every point xin belongs to the cluster with the closest prototype. In 2-D space, the result of a k-means clustering can be represented by a centroidal Voronoi tessellation, an example of which is shown in Figure 1 below.
Figure 1: Example of Centroidal Voronoi Tessellation
The k-means clustering problem is NP-hard in Euclidean space, but many heuristic algorithms are available [,]. The advantages of k-means are that the heuristic methods are easy to implement and typically converge quickly.
The disadvantages of k-means are that it is sensitive to the initial condition, requires the user to choose k, and assumes that clusters are “blob-like” and of similar size.
The standard method for k-means clustering, Lloyd’s algorithm [1], tends to converge to local minima which makes the solution dependent on the initial locations of the k prototypes. To address this problem, the k-means++ algorithm uses a weighted random sampling to choose the starting prototypes and was shown to improve the speed and accuracy of k-means.
Choosing an appropriate value of k is difficult, as a value that is either too small or too large produces some clusters that are not meaningful. Several papers have proposed methods for automatically determining k,,.
K-means is best at identifying clusters that are “blob-like” (i.e. roughly spherical) and of similar size. A well-known example where k-means performs poorly is the mouse dataset, reproduced below in Figure 2. This example contains three clusters: two small ones (the ears), and one large one (the head). Because the head is much larger than the ears and there is little separation between them, k-means will incorrectly associate many points to the ears which actually belong to the head.