K-means in Python

  • Sklearn experiment
  • Using the RelevanceAI clustering dashboard on ElonMusk tweets
  • The clustering dashboard
min read

In the previous two articles, we have been exploring the theory behind K-means, including several use cases and examples where this algorithm can be employed.

In this article, I will focus on explaining the code that you can use to perform a clustering experiment using k-means, starting from a simple sklearn example that uses NLP to solve a problem, and then using the Relevance AI platform to exploit its advanced clustering dashboard.

Sklearn experiment

Load dataset

import pandas as pd

df = pd.read_csv('Restaurant_Reviews.csv', sep=';')
df['Review\\tLiked'] = df['Review\\tLiked'].apply(lambda x : x[:-2])
df.columns = ['review']
df

Encoding textual data

To perform this step I’ll use a library called sentence-transformers:

import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
tqdm.pandas()

model = SentenceTransformer('all-mpnet-base-v2') #all-MiniLM-L6-v2 #all-mpnet-base-v2

#encode df version: for small dataset only
df['text_vector_'] = df['review'].progress_apply(lambda x : model.encode(x).tolist())
df

K-means algorithm

from sklearn.cluster import KMeans
import numpy as np

#4 punti in un piano bidimensionale
kmeans = KMeans(n_clusters=5, random_state=0).fit(df['text_vector_'].values.tolist())
kmeans

df['cluster'] = kmeans.labels_
df

Dimensionality Reduction with PCA

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2, svd_solver='auto')
pca_result = pca.fit_transform(df['text_vector_'].values.tolist())
print(pca_result)

fig = plt.figure(figsize=(14, 8))
x = list(pca_result[:,0])
y = list(pca_result[:,1])
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(df, x=x, y=y, color='cluster', hover_name='review')
fig.update_traces(textfont_size=22)
fig.show()

Dimensionality Reduction with umap

import umap

umap_embeddings = umap.UMAP(
    n_neighbors=17, 
    n_components=2, 
    metric='cosine'
).fit_transform(df['text_vector_'].values.tolist())

fig = plt.figure(figsize=(14, 8))
x = list(umap_embeddings[:,0])
y = list(umap_embeddings[:,1])
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(df, x=x, y=y, color='cluster', hover_name='review')
fig.update_traces(textfont_size=22)
fig.show()

Using the Relevance AI clustering dashboard on Elon Musk tweets

Install libraries

!pip install relevanceai
!pip install vectorhub
!pip install transformers
!pip install sentence-transformers

Declare variables

To create a clustering application, I will first import my data, preprocess it and encode a part of it. After uploading it to Relevance AI, I will apply a clustering algorithm to create a sharable application. The following parameters will be used in the entire project.

PROJECT_ID = '<project-name>'
API_KEY = '<api-key>'
REGION = 'us-east-1'
ENCODING_FIELDS = ["Text"] #they need to be a list
DATASET_ID = "elon_musk_twitter" #name of the dataset_id on relevanceai
MODEL = "all-MiniLM-L6-v2"
VECTOR_SUFFIX = '_sentence_transformers_vector_'

Import and create functions

To run the entire algorithm we will be using a set of very common libraries, including some other dependencies used to encode our data. In addition, we will be including two custom-made functions that will allow us to batch split our data and extract zeroshot-labels from text:

import nltk
import pandas as pd
import numpy as np
nltk.download('punkt')
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
from sklearn.metrics.pairwise import euclidean_distances
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
import progressbar
import relevanceai

def batch_splitting(len_df, range_len):
    range_list = list()
    if range_len >= len_df:
        range_list.append([0, len_df])
    else:
        for a in range(int(len_df/range_len)):
            range_list.append([a*range_len, (a+1)*range_len])
        range_list.append([range_list[-1][1], len_df])
    return range_list

def zeroshot(df_text, df_text_vectors, model, top_common, top_sample):
    #df_text is the list of text
    #df_text is the list of vecotrized text

    #tokenize all words
    all_words = []
    for t in list(df_text):
        all_words += nltk.tokenize.word_tokenize(t)
    all_words

    #frequency dictionary
    all_words_dist = nltk.FreqDist(w.lower() for w in all_words)
    all_words_except_stop_dist = nltk.FreqDist(w.lower() for w in all_words if w not in stopwords and w.isalnum() and len(w) != 1)

    #dictionary of vectorized top frequent words
    dictionary_words = [{"_id": i,"label": w[0], "label_vector_": model.encode(w[0])} for i, w in enumerate(all_words_except_stop_dist.most_common(top_common))]

    #
    closest_topn_index = np.argsort(euclidean_distances(
        [d for d in df_text_vectors], 
        np.array([vectorized_word["label_vector_"] for vectorized_word in dictionary_words])
    ), axis=1)[:, :top_sample]

    word_list = list()
    count = 0
    for vector in df_text_vectors:
        tags = []
        for ind in closest_topn_index[count]:
            tags.append(dictionary_words[ind]["label"])
        word_list.append(tags)
        count += 1

    #we obtain a list of lists, long as the sample itself
    return word_list

Import data

To perform this experiment, we have scraped all the Elon Musk tweets from his personal twitter account. We will upload the dataset, convert each tweet into a vector, and then perform clustering using the Relevance AI dashboard:

#if its a pandas df
df = pd.read_csv(rf'{DATASET_NAME}.csv')
df = df.drop(['Unnamed: 0.1', 'index'], axis=1)
df = df.dropna()
#df.columns = ['_id']+list(df.columns)[1:]
df

Create a list of dictionaries

To upload data into Relevance AI, we will need to convert our Pandas DataFrame into a list of dictionaries. We could do that with a line of code, but because it would require a huge amount of time on big datasets, the best way is make this conversion in batches, so that the entire process will take only a few seconds:

#convert to df_ready in batches
rows_ = list()
for rows in range(0, len(df), 20):
    rows_.append(rows)
rows_.append(len(df))

df_ready = list()
for r in range(len(rows_)-1):
    #print(rows_[r], rows_[r+1])
    df_ready += df[rows_[r]:rows_[r+1]].to_dict(orient='records')
df_ready

Encode

Now that I have prepared the data, I can encode the text, attach the zeroshot labels to the dataset, and finally upload the data onto Relevance AI:

#clean dataset, otherwise repeated clustering throws error
#client.datasets.delete(dataset_id=DATASET_ID) #in case we want a fresh start
batches = batch_splitting(len_df=len(df_ready), range_len=5000)

bar = progressbar.ProgressBar(maxval=len(batches), widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
model = SentenceTransformer2Vec(MODEL)

#encoding
df_ready_encoded = list()
bar.start()
counter = 0
for batch in batches:
    bar.update(counter)
    current_vectors = model.encode_documents(documents=df_ready[batch[0]:batch[1]], fields=ENCODING_FIELDS)
    df_ready_encoded += current_vectors
    counter += 1
bar.finish()

#we operate on df_ready
df_ready = df_ready_encoded

#add zeroshot list
zeroshot_list = zeroshot([x[ENCODING_FIELDS[0]] for x in df_ready_encoded], [x[ENCODING_FIELDS[0]+VECTOR_SUFFIX] for x in df_ready_encoded], model, 5000, 10)
#update df_ready
for index in range(len(df_ready_encoded)):
    df_ready_encoded[index]['label_'+ENCODING_FIELDS[0]] = zeroshot_list[index]

Upload data

It is now time to encode our data and upload it to Relevance AI. To do this, I will be using the bulk_insert function which can support up to 250MB of files. However, I’ll still be uploading my data into batches of 5000 samples each, to show you how a batch upload would work:

#upload
bar.start()
counter = 0
for batch in batches:
    bar.update(counter)
    client.datasets.bulk_insert(dataset_id=DATASET_ID, documents=df_ready[batch[0]:batch[1]])
    counter += 1
bar.finish()

Clustering

Before entering the dashboard and seeing the results in the clustering app, we will need to generate a set of clusters. This algorithm performs k-means with three different parameters:

import relevanceai

# Vector field based on which clustering is done - (Currently only one vector is supported)
vector_field = 'descriptiontextmulti_vector_'

for CLUSTER in [180, 200, 220]:
  #local or remote?
  centroids = client.vector_tools.cluster.kmeans_cluster(
      dataset_id = DATASET_ID, 
      #vector_fields=[[f'{x}' for x in ENCODING_FIELDS][0]], 
      vector_fields=[ENCODING_FIELDS[0]+VECTOR_SUFFIX], #potential bug when in our dataset we do not have a text field
      alias=f"kmeans_{CLUSTER}",
      k = CLUSTER)

  #creates clusters but only gives the centroids
  #clustering results are uploaded on the database

  client.datasets.schema(DATASET_ID)

  client.services.cluster.centroids.list_closest_to_center(
    dataset_id=DATASET_ID,
    #vector_fields=[[f'{x}' for x in ENCODING_FIELDS][0]], 
    vector_fields=[ENCODING_FIELDS[0]+VECTOR_SUFFIX],
    page_size=40,
    #cluster_ids=[], # Leave this as an empty list if you want all of the clusters
    alias=f"kmeans_{CLUSTER}" #change to 'kmeans_10' 
  )

The clustering dashboard

Editing the Dashboard

Now that we have prepared and uploaded the data to Relevance AI, we need to specify which insights we want to see on our dashboard. There are three main sections we can edit:

  • Label section By editing the label section, we can specify what the main labels we can see for every cluster are, and many of them we can visualize
  • Metric section In the metric section, we can specify the numerical measurements we can use to sort the different clusters and see our data from different viewpoints.
  • Groupby section, I will use the zero shots labels obtained from text visualized in a WordCloud to see what are the top words that represent each cluster of tweets.

Insights extraction

By using the relevanceai dashboard we can use the different metrics selected by us to sort the clusters in a different order. For example, we might want to see which are the biggest clusters (sort by cluster size in Descending order), or what were the most popular clusters by sorting all clusters by the number of retweets.

How can you take advantage of clustering?

Would you need an end-to-end vector platform that incorporates all the state of the art clustering algorithms and workflows?

This is where Relevance AI comes into play.

Book your platform demo here with our vector experts and learn how you can take the next steps.

Alternatively with knowledge of Python and Juypiter notebooks, you can create an account and get started today.

K-means in Python
Michelangiolo Mazzeschi
April 15, 2022
Find out how your business can glean insights through unstructured data with vectors

Book a demo with our experts today.