Ken Jee is a Data Scientist & Youtube personality with a big focus on Sports Analytics.
He published a dataset containing data from 223 of his Youtube videos to kaggle containing the following metrics.
1) Aggregated Metrics By Video – all topline metrics from his channel (2015 to Jan 22 2022).
2) Aggregated Metrics By Video with Country and Subscriber Status – Same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not.
3) Video Performance Over Time – Daily data from each of Ken’s videos.
4) All Comments – All comment data gathered from the YouTube API with usernames anonymized.
With this dataset, I analyzed it using Natural Language Processing (NLP) text vectors and clustering to find which titles perform the best, with highest engagement. You can interact with the clustering results in the dashboard below:
Dashboard Link: Take a look at the Cluster app and play around with it here.
Best Performing Video Overall – How to start learning Data Science (Highest watch time, Subscriptions added and Views)
From the above, there is a clear theme regarding this cluster. The most popular videos, with the highest watch times and subscriptions are all themed around learning data science or data science education videos.
Worst Performing Overall – Golf
The worst performing video cluster was related to golf, where only one video was picked up relating to “Golf: Would You Rather Be the LONGEST or STRAIGHTEST Driver on the PGA Tour?“
The dataset I used was “Aggregated_Metrics_By_Country_And_Subscriber_Status.csv“, the aggregated video stats grouped by country and video id.
Steps to make the App:
- Pre-process: quick cleaning of the data to use only the crucial data points
- Vectorize: vectorize and turn all the text into text vectors using
- Upload: Insert the data into Relevance AI’s data experimentation platform.
- Clustering + Experiment + Evaluate: Experiment with different configurations of
KMeansand qualitatively evaluate them within the Cluster App for the best one. This is done by looking through each cluster and its
furthest_from_centerto see if each cluster’s text makes sense.
- Interpret: I went through the data running filters and sorts to find the best performing videos for each metric.
- Deploy: Once I picked the best and worst videos, I clicked deploy, being able to share my findings.
From end to end going from step 1 to 6, this took 30 minutes.
Google Colab/Jupyter Code to reproduce it here.
Other data Insights:
We know the best performing and worst performing, but what about 2nd and 3rd best for different metrics?
- The main takeaway being that non-data science related topics didn’t do too well. The worst two video categories by watch time were:
- “Random” video content
- “Youtube/Q&A/Podcasts” video content.
- This was similar for worst views and user subscriptions as well, with the addition of video content related to “Life lessons”.
- The videos that added most subscribers were categorized within more niche Data Science related topics such as:
- “What is ___?” (Content diving deep into data science tools such as lambda, PANDAS etc)
- “Problems with Data Science”
- “Sports Analytics”
- The most engaging ones in terms of watch time and views were more related to getting jobs in data science, such as:
- “Data Science Portfolio”
- “Data science Interviews and Internships”.
- The Reviewing your Data Science Projects series had a mixture of high and low performing videos.
- The portfolio specific cluster performed better in terms of engagement (views and user subscription added) compared to the general & non-specific cluster.
Looking through hundreds, or hundreds of thousands of videos can be very time consuming; however thanks to natural language processing and clustering we can save a lot of time.
By summarizing more of the similar videos into groups and seeing how these groups perform instead.
Don’t hesitate to reach out to us via our Slack community channel to learn how your unstructured data can be unleashed.