Hey folks, in this post I will show you how to perform K-means clustering in R. But before moving onto the R facet of K-means clustering algorithm, let us spare a moment or two in understanding what it actually is and why is it required at the first place.
What is K-means clustering algorithm?
K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). A more formal definition can be found here. The phrase “unsupervised learning” simply means that there is no specific outcome to be predicted, the algorithm simply tries to find patterns in the data.
Why K-means clustering?
To answer this question, let me take a real time business scenario.
Consider a garment manufacturing company XYZ, which is planning to introduce a new line of shirts into the market. As the saying goes “to each his own”, the company cannot just manufacture shirts of a fixed size. In fact, it must manufacture multiple shirt sizes suiting different category of people. In order to cater to all these needs, the company retrieves data consisting of people’s heights and weights and plots them on a graph, which looks like this:
Fig 1: plot of people’s height v/s weight
Of course the company cannot manufacture shirts of all these sizes. Instead, it opts for a smart route and divides these sizes into 3 broad categories: Small, Medium and Large. These categories are selected such that all sizes can be accommodated within them. Now one might wonder how the company arrived at this ingenious grouping of sizes. This is where K-means clustering jumps right across the scene.
This classification of sizes into 3 groups can be accomplished by K-means clustering, and the algorithm provides the 3 best sizes, which will meet the demands of all the people. If in case it doesn’t, the company can divide people to few more groups, such as X- large, XX-large and so on.
Fig 2: Divide the size into more groups
Now that you have a fair understanding on the “what” and “why” aspects of the algorithm, let us emphasize the “how” part and see what are the steps involved in performing the clustering.
- Select “k” centroids (k rows chosen at random)
- Assign each data point to its closest centroid.
- Recalculate the centroids as the average of all data points in a cluster (i.e. the centroids are p-length mean vectors, where p is the number of variables.)
- Assign data points to their closest centroids.
- Continue steps 3) and 4) until the observations are not reassigned or the maximum number of iterations (the default being 10) is reached.
As a rule of thumb, the algorithm aims at minimizing an objective function, which is usually a squared error function, given as:
‘||xi – vj||’ is the Euclidian distance between xi and vj
‘ci’ is the number of data points in the ith cluster
‘c’ is the number of cluster centers
K-means clustering in R:
I would be using the “lenses” dataset provided by R to perform a simple clustering. This dataset can be downloaded from here.
Exploring the data:
The lenses dataset consists of attributes such as:
- Age of the patient- (1) young, (2) pre-presbyopic, (3) presbyopic
- Spectacle prescription- (1) mypoic, (2) hypermetropic
- Astigmatic- (1) no, (2) yes
- Tear production rate- (1) reduced, (2) normal
Additionally, the data is classified into 3 classes:
(1) The patient should be fitted with hard contact lenses
(2) The patient should be fitted with soft contact lenses
(3) The patient should not be fitted with contact lenses
Once the downloaded dataset has been loaded into the R working environment, let us see how it looks like:
If you type these commands, the output would be:
Fig 3: Lenses dataset
Now since we know that k-means is an unsupervised algorithm, we are focused only on the feature space of the data. As such, it makes sense to remove the last attribute i.e. class from the input dataset as we will focus purely on the features. The corresponding code can be shown as:
Fig 4: Removing the “class” column
Performing k-means clustering:
Now that we are well acquainted with the data, let us subject it to clustering.
R provides a function named “kmeans” to perform k-means clustering on an input dataset. In this example, it takes in 2 parameters, namely:
- i/p dataset: Lenses.features
- number of clusters: 3
If you want to better understand the kmeans function and its various attributes, click here.
The process of selecting the desired number of clusters is often confusing and requires a deep understanding of the dataset and the objective of clustering. Some methods can be found here.
In this example, we know that the patients are broadly classified into 3 classes, as such, there are 3 clusters required.
It can be shown as:
Fig 5: Performing k-means clustering
As you can see, there are 3 clusters, of size 8 each (meaning there are 8 observations from the dataset in each cluster). You can also see the mean of each attribute per cluster, the individual mapping of each data point to a specific cluster and the within cluster sum of squares result.
Plotting the results:
Let us plot a graph of age and the tear production rate, using the color of each cluster as the differentiator.
Fig 6: Plot using cluster color as differentiator
Now, let us check the accuracy of our grouping by plotting the same graph, but in this case using the original class color as differentiator.
Fig 7: plot of class color as differentiator for comparison
As you can see, the clustering achieves a pretty similar grouping as the assigned classes. This proves the validity of the clustering done here.
- The t-shirt example can be found here
- An Efficient k-means Clustering Algorithm: Analysis and Implementation by Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu.