Hierarchical Clustering
Last updated
Last updated
Hierarchical clustering is a class of clustering methods which are applicable to most types of data. These methods can be subdivided into agglomerative (or bottom-up) and divisive (or top-down) approaches. Both approaches generate a dendrogram which shows the nested grouping of patterns and similarity levels at which groupings change. An example is shown in Figure 3.
Figure 3: Example of a Dendrogram
Agglomerative hierarchical clustering starts with each data point in its own cluster. In each step, the most similar clusters are merged. This process continues until there is only one cluster remaining. A common algorithm for agglomerative hierarchical clustering is complete-link. In this algorithm, the similarity value for two clusters is the maximum of all pairwise distances between their component points. In every step, clusters with a similarity score higher than a threshold value are merged. The threshold value is increased in each step until there is only one cluster remaining.
Divisive approaches start with all data points in a single cluster and perform splitting until a stopping criterion is met [13]. One algorithm, described in , selects the cluster with the highest squared error and splits it using bisecting k-means with an objective of maximizing the Ward’s distance, between the two parts. Another type of divisive algorithm starts by constructing a weighted graph from the data and finding a minimum spanning tree. Then, the highest weighted edge remaining in the minimum spanning tree is removed in each step to split clusters.
The advantages of hierarchical clustering is that these methods have been shown to produce clusters of higher quality and do not need predefined parameters like k in the methods described above. The outputted dendrogram is a useful representation of the structure of the data and allows the user to determine k at the end of the process by selecting the cut that is most meaningful.
The disadvantages of hierarchical clustering is that the computational cost is typically higher than other clustering methods such as k-means. The complexity is quadratic, O(N2logN) for agglomerative and O(2N) for divisive [14], which makes hierarchical clustering impractical for large datasets unless heuristics or hybrid approaches are used.