Clustering is a Machine Learning algorithm to divide a set of data points into a limited nmber of groups.
The most popular clustering algorithm, known for its simplicity and ease of use, is k-means. The reason why this algorithm, compared to many others, is so popular is that it outperforms many other alternatives on a small sample and is quite easy to use (it only requires five lines of code). This algorithm, however, can be equally be applied to Big Data producing valid results.
When are k-means used?
K-means performs very well with these two following pre-conditions:
- The data is low-dimensional
- All variables have the same variance
- All clusters have a similar number of data points
If our data meets these requirements, k-means is probably the algorithm you might want to use on your dataset.
The Gaussian sphere assumption
The most important principle that the k-means algorithm relies upon is the Gaussian Sphere assumption. Because the algorithm uses the Euclidean distance to assign data points to the closest cluster, the algorithm will work by assuming that all clusters are spherical (because the circumference of a cluster is the set of all data points with the same distance from the centroid).
This assumption works quite well when all clusters have a similar number of data points, but if data is irregular, then this assumption is no longer valid. K-means begins to underperform compared to other clustering algorithms.
There are several ways of classifying clustering algorithms because each can have multiple features. One of the simplest ways to differentiate between clustering algorithms is to determine which inputs they require (hyperparameters). We can divide all clustering algorithms into two separate categories:
When we run the algorithm, we already specify the number of clusters we want. This scenario is prevalent when our data is small enough to analyze it, or we need to keep a meager number of clusters. K-means is an example of parametric clustering.
There are methods used to estimate the optimal number of clusters before running the clustering algorithms, like the elbow method or the silhouette method (both highly computationally intensive).
This class of algorithms is usually used on Big Data, where the structure of the data is so variegated that we are unable to estimate the optimal number of clusters. In such a case, we prefer the algorithm to give us an estimate of the optimal number of clusters. Some of the most common non-parametric clustering algorithms are affinity propagation and HDBSCAN.
When to avoid using k-means
K-means is a very powerful algorithm when the previously outlined conditions are met. However, when we work on datasets with a high number of dimensions, the distance between points gets wider, and the Gaussian sphere assumption is no longer valid, making this algorithm less relevant for these scenarios.
One perfect example is when we are working with language or image data. Once converted into a vector by using encoders, we end up with more than 500 dimensions (sometimes over 1000, for textual data). With this degree of complexity, we will likely need more complex algorithms that are optimized for working in high-dimensional space: the so-called density-based clustering. One example is HDBSCAN.
Not that k-means will still produce good results; we are only making the case that there are a set of algorithms that will outperform k-means in the same task. For example, in the following app, we have been using k-means to subdivide the data points of a 768-dimensional space into 200 clusters, and the insights obtained from this dataset are valid.
How does k-means work?
K-means is an algorithm that executes a single process iteratively until this process can no longer be run. Compared to all the other clustering algorithms, it is one of the less computationally intensive.
Before explaining in detail how the k-means algorithm works, we first need to explain what centroids are and how they are used by this clustering algorithm. A centroid is a point in space that represents the entire cluster. This point has the potential of having the same coordinates (but is not always the case) of one of the data points in the dataset; in this case, it becomes a medoid.
By using centroids and using a single dot to represent the coordinates of an entire cluster, we can easily determine how data points interact with the clusters in our dataset without having to use excessive computational power. For example, if I need to assign a new data point to the closest cluster, by using a centroid, I can calculate the distance between the new point and all the centroids. If I had no centroid, I would need to average the distance between this new point and all the data points in the dataset (a solution that would require huge computational costs).
Finding k clusters
The k-means algorithm is executed using these simple steps:
- The algorithm chooses random datapoints that become centroids
- All the data points the dataset are assigned to the closest cluster (the centroids are used to see which cluster is the closest one)
- Once all points have been assigned, new centroids are computer per cluster (This is done by averaging all the data points in a cluster and calculating the position of the new centroid)
- If some data points could be more efficiently assigned to new clusters, restart the process from step 2
- When data points no longer need to be assigned to unique clusters, end the process.
Use Case on E-commerce data
To better understand how k-means work on a real-life problem and the value you can generate from it, let us take a sample dataset and run k-means on the data to label it appropriately. After that, we will analyze the generated insights.
The starting data for clustering will need to be numerical and in tabular form, exactly like the dataset shown below. The data is taken from an e-commerce platform, and shows the data of 1000 customers:
- The average expense of all transactions
- The Customer age
- Consecutive days without making a purchase
Because the data is 3 dimensional, we can visualize it in a 3d graph after we have applied a clustering algorithm and we have segmented our customer base into 3 main clusters.
We can manually label each one of our clusters manually:
Cluster 1: young people (x-axis) that buy frequently (y-axis) with a high range of spending (z-axis)
Cluster 1: young people (x-axis) that do not buy frequently (y-axis) with a high range of spending (z-axis)
Cluster 1: older people (x-axis) that buy regularly (y-axis) with a low-medium range of spending (z-axis)
If we had more data at our disposal (many more customer information as well as a higher number of samples) we would get way more clusters that could give us precious insights into the trends that are emerging from our customer activity.
Use Case on textual data
Clustering on textual data is not much different from tabular data. Once the text has been encoded, it is converted into vectors, and we can apply a clustering algorithm to it.
For example, we can vectorize a sample of 1000 Amazon customer reviews with an encoder, converting each review into a 768-dimensional vector.
We can then apply a clustering algorithm on the result to group all samples (reviews) into groups with similar semantic meanings.
To visualize results, we need to compress the original 768 dimensions of each vector into 2 dimensions. If we graph the results, we can see how each cluster contains all the reviews in a region of space.
How can you take advantage of clustering?
Would you need an end-to-end vector platform that incorporates all the state of the art clustering algorithms and workflows?
This is where RelevanceAI comes into play.
Book your platform demo here with our vector experts and learn how you can take the next steps.
Alternatively with knowledge of Python and Juypiter notebooks, you can create an account and get started today.