This is where the journey of Machine Learning (just a fraction of Data Science) begins.
Let me warn you: it will be tough in the beginning, you’ll have to sit for hours playing around these algorithms, but these are the building blocks of the all-so-hyped “Machine Learning”.
I can totally guarantee you: If you religiously follow whatever I am covering, cracking Data Hackathons and those interviews would be simpler, much simpler.
To give you a brief, there are many (actually MANY) algorithms that can be applied to a ML problem. To make the process easier, I have divided them into different categories:
- Instance – Based
- Dimensionality Reduction
- Decision Tree
- Association Rule
- Neural Networks
- Deep Learning
- Miscellaneous (never-ending list)
This link gives you a good overview about these algorithms.
Now, what is Clustering?
It is a process of partitioning a large number of objects into a small number of clusters such that objects in the same cluster are more similar to each other than those in other clusters. For example, the clothes in any Pantaloons showroom are clustered in different categories (men’s lowers, women’s tops and kids wear). There can be different ways of partitioning: Qualitative and Quantitative. The example discussed belongs to the Qualitative part, but some selected features can be quantified and measured to partition the clothes (or any commodity).
It follows a simple and easy way to classify a given data set into a certain number of clusters (assume k) fixed before-hand. The idea is to define k centroids, one for each cluster. The centroids should be placed in such a way that they are as far as possible from each other, so that coverage of the entire data is possible. At the end of the algorithm, all the given points N will be assigned to one of the k centroids, wherein each centroid will describe a cluster.
It is a Heuristic algorithm, i.e. there is no guarantee that it will converge to the global optimum, and the result will depend on initial clusters. This is so, because it follows a recursive approach and we will always get a different result depending on the initial conditions.
As the algorithm is generally very fast, it is very common to run it multiple times with different starting conditions.
Steps involved in this algorithm:
1. Define initial k centroids for the given data set consisting of N data points.
2. Assign the N data points to the closest initial centroid as defined in step 1.
3. Recalculate the centroids for each cluster by finding new centroid position (mean of all points belonging to that cluster).
4. Repeat step 3 till convergence.
In this algorithm, the objective function is a squared error function which is minimised at all times. When we are finding the closest centroid for each data point, automatically, this objective function J is getting minimised.
Here, xi is the data point and cj is the centroid. We are minimising the squared error of all points in each cluster to effectively minimise the objective function.
The chosen distance measure can vary from situation to situation.
Eg. Euclidean, Manhattan, Pearson
NOTE: K-Means generally uses Euclidean distance.
Step 1: Initialising the position of the ‘k’ clusters:
It is really upto you!
- Forgy: takes k points randomly from the set and calls them ‘initial means’
- Random Partition: first assigns a cluster to each point and then proceeds to the assignment step with means of random points belonging to each cluster.
- KMeans++: This method is used for small data sets.
The underlying logic is that spreading out the k initial cluster centres is a good thing.
: First cluster centre is chosen randomly from the given dataset.
: For each data point x, find D(x) i.e. the distance between x and the nearest centre that has been already chosen.
: Choose one new data point at random as a new centre, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
: Repeat the above two steps, until k centres have been chosen.NOTE – D(x) should be updated as more centres are being added. It refers to the distance between a data point and the nearest centroid.Here is the link to original research paper for those interested. 🙂
- Canopy Clustering: It is a unsupervised pre-clustering algorithm used as preprocessing step for K-Means or any Hierarchical Clustering. It helps in speeding up clustering operations on large data sets.This algorithm uses two thresholds T1 (the loose distance) and T2 (the tight distance), obviously T1>T2.Algorithm –
: From the set of data points to be clustered, remove a point, thus beginning a new “canopy”.
: For each point left in the dataset, assign it to the new canopy is the distance is less than T1.
: If the distance is additionally less than T2, remove it from the original set.
: Repeat from step 2, until there are no points in set to be clustered.
The algorithm loops until the initial set is empty, accumulating a set of Canopies, each containing one or more points.A given point may occur in more than one Canopy.For more optimisation, an approximate and fast distance metric can be used for step 3, and slow and accurate metric can be used for step 4.The only problem may arise when it comes to choosing the distance function because of high dimensionality in large data sets.
Step 2: Assigning Data to Clusters
Distance is calculated between each data point and cluster centres (initial cluster centres for the first iteration), and the data point is assigned to the cluster whose distance is minimum of all the cluster centres.
There can be different ways of finding distance between two multi-dimensional vectors:
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Cosine Distance
- Correlation Distance
As mentioned earlier, K-Means generally uses Euclidean Distance and there is a valid reason for it. (as is for everything in this post)
The way K-Means is constructed is not based on distances. It is based on within-cluster variance. Basic idea is to minimise squared errors, and distance is not involved anywhere.
We just cannot use any arbitrary distance, because k-means may stop converging with other distance functions.
At the end of this step, we have k clusters each having some points assigned to itself.
Step 3: Recalculating new cluster centres
The centre of each cluster is calculated using the ordinary way of finding mean of given observations.
New Cluster Centroid = Mean of all points within that cluster
– Steps 2 and 3 are repeated until no centroid moves or changes on recalculation. In this way, we find a grouping of objects using which the objective function to be minimised can be calculated.
Determining ‘k’ for a dataset
I have discussed the entire algorithm of clustering the given data into suitable number of groups, but the question arises that how many groups/clusters?
For small data, it is sometimes possible to have a look at the data and decide by intuition, but that isn’t a reliable option as the size and dimensionality of data increases.
- Elbow Method:
– The idea is to run k-means clustering on the dataset for a range of values(say, k from 2 to 15).
– For each value of k, calculate the sum of squared errors (SSE).
– Plot a line chart of the SSE for each value of k.
If the line chart looks like a arm, then the “elbow” on the arm gives us the best k.
We want a small SSE, but as we keep on increasing k, SSE tends to zero (because when the number of clusters is equal to number of points in the dataset, there will be zero error). Hence, choose a small value of k, that has a small SSE as well.
As it can be seen in the picture above, for Dataset A, we obtain a clear elbow at k = 3.
For Dataset B, we may argue that the line is smooth and it isn’t possible to define a suitable k using this method. Look at next method for better results.
- Silhouette Score: This method is used to study the separation distance between the resulting clusters.The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters.This measure has a range of [-1,1]. Coefficients near +1 indicate that the sample is far away from neighbouring clusters whereas -1 may indicate that the sample has been assigned to the wrong cluster.For every data point i, we calculate two parameters a(i) and b(i).
a(i) – Average dissimilarity of i with all other points in it’s own cluster, i.e. average of distance of i from other data points.
b(i) – Lowest average dissimilarity of i to any other cluster. The cluster with this lowest value is the next best fit for i.s(i) = Difference of a(i) and b(i) divided by maximum of a(i) and b(i)Average s(i) over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. Hence, average s(i) over all data of the entire dataset is a measure of how appropriately the data has been clustered.The silhouette analysis is used to find the number of clusters. To put it in a sentence, “More positive the Silhouette Score, more optimal value of k.”
So, till now, I have discussed K-Means in detail. Hope you got a hang of it and it was helpful.
Implementation in Python with examples would be updated soon.
Share your feedback and comments are welcome. 🙂
PS – This is my first blog post. Suggestions would be highly appreciated.