Hierarchical Cluster Analysis

Blogs

Sign-up to receive the latest articles related to the area of business excellence.

Hierarchical Cluster Analysis

What is it?

Hierarchical clustering is an algorithm that groups similar objects into groups or clusters often without prior information of the data structure. The objects within a group are similar to each other and objects in one group are dissimilar to the objects in another group. For example, if you have several patients who come to visit your clinic, based on their symptoms you can probably group patients into different groups where each group of patients are broadly similar to each other in terms of the symptoms.

Hierarchical Cluster Example

There are a number of different approaches to generating clusters such as K-Means clustering, density-based clustering, Gaussian Mixture Models clustering, and Hierarchical clustering. Each method has its own advantages and disadvantages. Several of the clustering algorithms only group the objects into different clusters and the resulting groups can be viewed on a cluster plot. A hierarchical clustering algorithm in addition to breaking up the objects into clusters also shows the hierarchy or ranking of the distance and shows how dissimilar one cluster is from the other. In this article we will only focus on the hierarchical clustering.

Applications

Hierarchical clustering can be used in a number of different areas where we would like to understand the “natural” grouping of the objects. Here are some examples:

Cluster search results so that similar results are grouped and shown together
Relate different species into groups based on their genetic DNA makeup
Tracking viral outbreak based on mutation of viral DNA
Group loan applicants into high/medium/low risk based on attributes
Organize customers into groups/segments with similar traits, product preferences
Social network analysis based on tweets
Anomaly detection in medical imaging
Diagnostic questionnaire for symptoms to identify patients with similar symptoms
Identify homogeneous groups of customers with similar needs and attitudes
Group students into similar groups based on psychological, aptitude and achievement scores

For example, you are currently serving all customers located in a city from your warehouses. You would like to group your customers into two or more groups based on their distance such that all customers in one group can be served from a warehouse that is located close to that group.

How does it work?

Hierarchical clusters can be built either as agglomerative or divisive. In the agglomerative model, the two closest nodes are combined together into one node and this process is repeated until all the nodes have been grouped together. In the divisive approach, all the nodes are together initially, and these are broken up into two groups and so on until only individual nodes remain. In most cases, we will be using the agglomerative model to create the hierarchical clusters.

Let’s take an example of 10 nodes as shown in Figure 1, the nodes that are the closest together are combined and replaced by the average of these nodes. Let’s calculate the distance matrix to find out the distance between nodes. In this example, nodes 4 and 10 are the closest together and these are combined into one cluster. Next, nodes 2 and 5 are group together. This process is repeated until only one node is left. The process of combining nodes is shown in Figure 2.

Example Points

Figure 1: Location of Nodes & Distance Matrix

Step by Step Agglomeration Process

Figure 2: Sequence of Agglomeration

We can also show the clusters and the distance between clusters graphically on a plot called the dendrogram. An example of the dendrogram is shown in Figure 3.

Dendrogram Example

Figure 3: Cluster Dendrogram for the given problem with 2 clusters

Once all the groups have been determined, we get a tree like structure with the leaves being the individual nodes. Note that the Y-axis on this plot shown as Height is the average distance between the nodes. For example, the distance between nodes 4 and 10 is 0.5 and the distance between nodes 2 and 5 is 1.5 and so on. If you want to break-up the nodes into two groups, we determine the appropriate height that will split this into two groups. For example, if you draw a horizontal line at a height of 8 units, then we get two groups. The first group only contains 4 and 10 and the second group contains (2, 3, 5, 6, 7, 8, and 9). If we wanted to breakup the data into three groups, we could have drawn a horizontal line at 7, in which case the three groups will be Group 1 (4, 10), Group 2 (2, 3, 5, 8, 9) and Group 3 (1, 6, 7). Hence, the final number of clusters required can be decided later on and is not required prior to building this hierarchical groups.

Algorithm Options

There are two main options available for the hierarchical cluster analysis. The algorithm used to determine the distance between two nodes and the formula used to calculate the distance between clusters that contain multiple nodes.

Distance Calculation
There are several ways to calculate distance between two points. In the above example, we used the Euclidean calculation for distance which is the default and most commonly used. However, depending on your application, you may need to pick a different way to calculate the distances. Here the the available options:

Binary: The two points are regarded as binary bits with non-zero elements as “on” and zero elements as “off”. The distance between them is the proportion of bits in which only one is “on” vs. the bits in which at least one is “on”. As an example, let’s say that we have 4 patients who have the following symptoms. We want to determine which of these patient symptoms are similar to each other and which ones are different.

In order to answer the question on similarity, we will use the distance computation between each of these patients using the “Binary” method. Any response marked as “Yes” will be 1 and any response marked as “No” will be a 0. We can create a contingency table between each of these patients. Let’s say, we want to compare patients A and B. The contingency table is as follows:

Where p is the number of times A is “on” and B is “on”, r is the number of times A is “on” and B is “off”, q is the number of times A is “off” and B is “on”, and s is the number of times both A and B are “off”.

Note that for asymmetric distance metric the value of s is not used in the above formula. The following table shows the distance using this formula. From this table, the distance between C and D is minimum and hence their symptoms are the most similar.
Canberra: The Canberra distance is based on the sum of a series of fraction differences between the coordinates of a pair of objects. Each term of the fraction has a value between 0 and 1. The formula for calculating the distance as:

Note that if one of the points is 0 and the other is not, then the distance value is 1.0. If both points are 0, then the distance between them is defined as 0 (note the formula cannot be evaluated as it is 0/0 form). Canberra distance is more sensitive for values close to 0.

Let’s take an example. We have 4 projects that are rated based on cost, quality, timeline, and risk. We want to determine which of the projects are similar to each other. From this table, it can be seen that the projects A and D are the most similar.
Euclidean: The Euclidean distance is the straight line distance between two points. This distance is also termed as the L2 distance. The formula for computing the distance is:

For the example shown above, the distance between points is shown by the following table. The shortest distance is between B and D. Note that this is different from the solution provided by Canberra distance.
Manhattan: The Manhattan distance measures the distance between two points measured along the axes at right angles. The formula for the distance between two points p and q is given as:

Some of the areas where Manhattan distance is used is in integrated circuits where the wires only run parallel to X and Y Axis, compute distance between two points for a vehicle that is moving on a grid layout of the streets. For the example shown above, the Manhattan distance is shown in the following table. The closest points are A and D and C and D at 10 units.

Distance Between Clusters
Computing the distance between two nodes is clear using the distance calculation listed above. However, if each cluster contains different number of nodes, how do we compute the distance between the clusters? There are several approaches to do this and there is no one right answer. You will need to pick the method that best suits your application.

Single Linkage: In this method, the distance between clusters is computed as the distance between its two closest members. This method often results in clusters in which individuals are added sequentially to a single group.
Complete Linkage: In this method, the distance between clusters is computed as the distance between two of their farthest-apart members. This method often results in clusters that are well separated and compact.
Average Linkage: In this method, the distance between each each group of nodes between the clusters are calculated and the overall average is reported as the cluster distance.
Centroid: In this method, the distance between two clusters is calculated as the distance between their centroids. This method should only be used for Euclidean distances. This may result in a dendrogram that is difficult to interpret.
Median: In this method, the distance between two clusters is calculated as the weighted distance between the centroids, the weights being proportion to the number of nodes in each cluster. This method should only be used for Euclidean distances. This may result in a dendrogram that is difficult to interpret.
Ward’s Distance: In this method, the groups are formed so that the pooled within-group sum of squares is minimized. The two clusters are selected to fuse if it results in the least increase in the within-group sum of squares.

Data Types

In the real-world, we will get objects that contain mixed data types. For example, let’s say we want to cluster bank clients based on the following characteristics: age, industry, job title, marital status, education level, past loan defaults status, bank balance, owns house, has kids. Some of these characteristics are numeric (like age, bank balance), some of them are categorical (like industry, job title, marital status, education level, past loan defaults status, owns house, has kids). For numeric values, we can calculate the distance using either Euclidean or Manhattan distance. For categorical values, we need to determine if they are binary, ordinal or nominal values. For ordinal data types, we can first rank the variables, and then compute the distance similar to numeric values. For nominal values, we can use dummy variables to convert them into binary variables and then use distance formula for binary variables. In addition, if you have mixed data types, you may want to normalize the variables before you combine the distances into a single value. You can use the Gower distance to handle this situation if you are using the R software. More details about how to handle mixed data types and Gower distance are out of scope for this article as it is already too long – we will follow it up in a different article.

In Summary, use hierarchical cluster analysis to divide the data into multiple groups or clusters and when the underlying data has a hierarchical structure. The strengths and weaknesses of hierarchical cluster analysis are shown below.

Strengths

It is easy to understand and easy to do
Relatively straightforward to program
Main output, the dendrogram, is appealing to users
Not necessary to specify the number of clusters a-priori

Weaknesses

Involves a lot of arbitrary decisions (distance metric, linkage criteria)
Does not work very well with missing data
Computations are expensive and difficult to use with very large data sets
Works poorly with mixed data types (arbitrary decisions on distances)
Rarely provides the best solution. Easy to see the result if only 2 dimensions are present
If data is in high dimensions, we may not be aware that we have a poor solution

Blogs