What is clustering?

  • What is clustering?
  • Why is clustering useful?
  • What are the different types of clustering? (high level)
  • What are good packages for clustering?
  • How have vectors allowed one to cluster images and text?
min read

Intro

Machine Learning is an essential tool with the purpose of harnessing artificial intelligence technologies. Machine learning is frequently referred to as AI because of its learning and decision-making capabilities, although it is actually a subset of AI. It was a part of AI’s evolution until the late 1970s. Then it split out and began to evolve on its own.

Machine learning has emerged as a critical response tool for cloud computing and eCommerce, as well as a number of other cutting-edge technologies.

Machine learning is currently behind some of the most significant technological breakthroughs. It’s being employed in the burgeoning self-driving vehicle sector, as well as for galaxy exploration, as it aids in the discovery of exoplanets.

Stanford University recently described machine learning as “the study of getting computers to act without being explicitly programmed.” Machine learning has spawned a slew of new ideas and technologies, including supervised and unsupervised learning.

What is Machine Learning?

Machine Learning refers to a set of advanced algorithms used to either create models or extract valuable insights from data. For many firms today, machine learning is an essential part of modern business and research. It makes use of algorithms and neural network models to help computers improve their performance over time.

Without being specifically taught to make those decisions, machine learning algorithms automatically develop a mathematical model utilizing sample data – often known as “training data” – to make decisions. Among these algorithms, one of the most popular is called clustering.

Supervised vs. Unsupervised

To understand what clustering is, you will need to understand in low resolution what is ML and how it is organized.

There are multiple ways of categorizing Machine Learning algorithms. The easier way is to divide them into two main groups: supervised learning algorithms and unsupervised learning algorithms.

Supervised Learning algorithms All algorithms that are considered supervised have one thing in common: the data is labeled. The algorithms that we are going to apply will use the remaining part of the data (called features) to learn how to predict each label: once an AI has been trained to make this exact prediction, it becomes a model.

If the labels are categorical, we use classification algorithms to predict them. In case the labels are numerical, we use regression algorithms. For example,

Unsupervised Learning algorithms What defines this class of algorithms is that we have at our disposal only unlabeled data: basically, we only have features, but no labels. The purpose of these algorithms is not to create models, but rather to extract insights from the data.

If we operate on rows, we are using clustering algorithms, and our purpose is to estimate the missing labels. If we are operating on columns, we are using dimensionality reduction algorithms, and our purpose is to diminish or increase the number of columns.

What is clustering?

Clustering is an algorithm that is applied to unlabeled, numerical data. Each sample in the data corresponds to a point in a multi-dimensional space.

How does clustering work?

A myriad of clustering algorithms can be employed depending on our computational or performance needs. The most common clustering algorithm is called k-means, and it will be explained in detail in the following article.

A clustering algorithm works by dividing the entire data into groups. Once the algorithm is complete, each sample will be assigned a cluster number, essentially a label. We can further explore the content of each group to assign a personalized label to each sample.

In the example above, we have a dataset of university courses. The x-axis represents the course duration (in hours), while the y-axis their cost (in USD). Each point in the graph represents a single course. Initially, they are unlabeled. After applying a clustering algorithm, instructing it to find 3 clusters, we can divide all the courses into 3 different groups. All the samples belonging to one group will have a different color.

Why is clustering useful?

Clustering is used when we have to work on structured raw data. Thanks to vector-based technology, we can even convert non-structured data, like text or images, into cross-sectional data, where clustering algorithms are employed.

Several companies collect all their unstructured data into data lakes, and this is where clustering algorithms can prove invaluable for giving meaning to that data.

By assigning a label to each cluster, the most common use case for clustering is being able to explain the content of the data without any manual work. If we know, for example, that a portion of our data is similar in scope, we can group it and extract information from the entire group.

Some examples include all the data collected from customer feedback, including emails, reviews, and customer service calls. A company that interacts regularly with its customers will collect plenty of data, but it will likely be unstructured and unlabeled, and all the insights from this data will be contained in transcripts and written text.

By vectorizing this data and applying clustering analysis, we can identify all the problems that users complain about as well as what they would like to see improved, and, if we wish to get more creative, we can also profile different users.

Another common example of unstructured data is customer data. We are only able to capture a part of the user data into a structured (mostly tabular) format.

For example, we can input in a table the general user information that can be used for analytics, such as the age of users, their average budget, most expensive item bought, preference tags… Unfortunately, this way of formatting information can only partially capture the value of each customer.

Each individual customer makes his own choices, writes reviews, shares on social media… all this data cannot be added to a structured file. Instead, by applying clustering techniques we can get value from this data and profile users into several different categories.

This analysis is known as customer segmentation and allows us to understand what are the market niches that we can target.

What are the different types of clustering?

In total, there are more than 30 different clustering algorithms with several ways of classifying them based on their properties. One of the best ways of dividing them is to group all the algorithms in flat vs. hierarchical clustering.

Flat clustering

With flat clustering, we divide the data into n groups without further complexity. Of course, because the algorithms are much simpler compared to more complex solutions, they are also much faster.

Usually, flat clustering is used when we know there aren’t too many insights that can be extracted from data, or when its volume is big enough to represent a challenge for more computational-intensive algorithms (such as hierarchical clustering).

A good example would be to analyze employees’ data to divide them into different groups. We wish to know if we can categorize employees into different groups to better know how to improve efficiency and better support each group. A group would be represented by all the employees who are creative, while another one by all the employees who are hard-working, etc.

We only want to have a superficial understanding of what are all these groups, not all the existing sub-categories, otherwise, the data would become too complex for us to make any value out of it. In such a case, flat clustering is perfect.

The most common algorithms under this nomenclature are:

K-means

This algorithm is probably the most common clustering algorithm used by data scientists. We need to input the desired number of clusters we want to obtain from the dataset, then all the samples in the dataset are assigned to one individual group.

K-medoids

The k-medoid algorithm works quite similarly to k-means, with only one minor difference: to represent each cluster, a clustering algorithm uses a single representative point. K-means uses centroids, points in space in the same region of the clusters but that are not necessarily existing data points, while k-medoids use one of the data points o the dataset as representative.

Hierarchical clustering

With hierarchical clustering, we can divide the data into n groups, with each group having a hierarchy of its own.

There are several use cases of hierarchical clustering. For example, when segmenting customers (customer segmentation) into different groups, we will be interested in knowing what are the main customer niches, but also all the sub-niches for every segment.

For example, the niche of customer-friendly customers of an e-commerce platform will likely contain additional clusters, like customers who choose to buy vegan products, others that choose biodegradable products, and others who support smaller eco-friendly startups. As we can see, the data can provide so many insights that flat clustering is unable to give us.

Two widespread algorithms that fall under this description are:

Agglomerative clustering

Agglomerative clustering is the most common kind of hierarchical clustering. It works by subdividing the entire dataset into groups made of smaller groups:

HDBSCAN

This algorithm is considered one of the most advanced clustering algorithms. It is quite computationally expensive but can be used when the data is so complex that flat clustering would no longer make sense. It works by only finding the areas in the dataset that are denser while labeling everything else as noise.

However, there are several other properties that can be used to identify the purpose of a clustering algorithm; these are some of the most common:

Parametric (cluster-based) vs. non-parametric (density-based)

Parametric clustering is the set of all algorithms that require us to input the desired number of clusters. They are instrumental when the data is limited, and we have an idea of its shape.

For example, in the case of analyzing employees’ or customers’ feedback data, we know from the start that we want to deal with a limited amount of clusters. Having 200 different groups of employees for the purpose of better managing them would make things worse while managing 5 groups only would be quite efficient.

In such a case, we want to force clustering to use the number of clusters we input.

Non-parametric clustering is the set of all algorithms that do not require us to input any number of desired clusters but are able to make an estimation on the optimal number of clusters with iterative techniques.

These algorithms are handy when the data is complex, and we have no idea of how it is distributed. For example, when analyzing customer data, data is so massive and variegated that it would be impossible for us to estimate an optimal number of customer niches.

They might be 8, but could be as well be 200. In this case, non-parametric clustering is perfect.

Hard vs. soft

With hard clustering, each sample can only belong to a single group.

When we work on customer data, for example, we wish to keep things simple enough, and we want to assign customers to a single cluster, rather than having them in multiple niches.

With soft clustering, a sample can belong to different groups with a different probability score. An example of hard clustering would be grouping documents into different categories.

This is one of the most common use cases to extract values from internal documents in a company. However, a single document, a contract, for example, can contain information about finance, as well as legal issues.

It is in our interest to assign multiple labels to each document, so searches can become more efficient.

What are good packages for clustering?

Although there is, as we have seen already, a considerable number of clustering algorithms, most of them are available under the sklearn library. Other algorithms, usually more complex, have not been included in standard Machine Learning libraries and can be found on independent libraries in GitHub. This applies to both **DBSCAN** and HDBSCAN.

How have vectors allowed one to cluster images and text?

Clustering algorithms only work on structured numerical data. Most of our data, especially if it is unstructured, will likely have categorical data, like text. This is an issue if we wish to apply a clustering algorithm to any of our unstructured data collected in our data lake, for example, text or images.

The latest ML models allow the conversion of this data into numbers (vectors) with a process called encoding.

Conversion of the text ‘Chip’ into a vector

When data is encoded, similar data is (text with the same meaning or images that look alike) occupy the same space region. With this principle, we can group the data that look similar and share similar content with clustering algorithms, separating them from the rest of the data.

To perform the encoding, we commonly use neural networks (which are quite advanced machine learning algorithms). Before the advent of these encoders, the applications on NLP were very limited, and it was only possible to extract data from text using statistical indicators such as frequency, word counts, and tf-idf. Vector-based technology has enhanced our capabilities of extracting insights from data, specifically semantic meaning from every corpus of text.

Word/sentence encoding

The first encoders (like word2vec) could only encode a single word into vectors, for example:

chip → [.45, .23, .72, .25…]

Although the semantic meaning of every word was assigned to a space region, which was quite an improvement compared to the old statistical NLP methods, the model could not grasp the meaning of entire sentences, let alone the ones from entire documents. The vector of each sentence was computed by averaging the vector of every word.

Nowadays, thanks to the introduction of transformer-based technology, both the word order and the context is taken into account. For example, in the two sentences:

‘I watched Snake Eyes at the movie theater’

‘You have the eyes of a snake

Snake is part of a movie name in the first sentence, while it represents an animal in the second one. If we were to use word2vec, we would convert snake into the same vector for both sentences.

However, with transformers, this difference is taken into account and two different vectors are used, one that occupies the same space region occupied by animals, the other one will be placed in a region with all warrior names. //requires an image

Document encoding

The same method is used to subdivide several documents into different groups. One document is made by several sentences, which are then encoded into individual vectors.

Encoding of the sentence: RelevanceAI is one of the most advanced vector-based technology startup on the market

  • At the end of the process, a document is represented by several vectors, however, we can average them to represent the document with a single vector. We can then apply a clustering algorithm on thousands of documents to group them into categories.

How can you take advantage of clustering?

Would you need an end-to-end vector platform that incorporates all the state of the art clustering algorithms and workflows?

This is where RelevanceAI comes into play.

Book your platform demo here with our vector experts and learn how you can take the next steps.

Alternatively with knowledge of Python and Juypiter notebooks, you can create an account and get started today.

What is clustering?
Michelangiolo Mazzeschi
April 14, 2022
Find out how your business can glean insights through unstructured data with vectors

Book a demo with our experts today.