Product Categorization with Machine Learning

  • Categorizing products in e-Commerce
  • Product categorization using machine learning
  • Product categorization with similarity.ai
min read

What Is Product Categorisation and why does it matter?

Product categorization is essential for most e-commerce stores. They must show products to their customers that they want to buy, as quickly as possible. Otherwise, potential customers churn, purchasing products in stores where they find them immediately.

However, E-commerce stores have hundreds of thousands of different items they want to sell. Each customer only wants to buy a few items from the store’s selection. Therefore online stores need to categorise and group their products, so customers only have to look at groups of products that they want to buy. Obviously, a customer buying a keyboard in a tech store does not want to scroll past televisions for sale because they take longer to find their item.

The solution to this is ‘Product Categorization’.

For example, customers could find keyboards under Computers Computer Accessories Keyboards. They could find Televisions under TVs → Led TVs.

E-commerce Product Categorization

E-commerce stores can improve categories by adding new ones. If we want to assign a category for mechanical keyboards, employees need to change all mechanical keyboards to Computers Computer Accessories Keyboards Mechanical Keyboards.

The challenge comes with large volumes of products and categories. With hundreds of thousands of products, stores can experiment with categories very slowly.

To solve this, we can ask the AI to look at item titles and descriptions to categorise them without employees labelling them.

For example, if the title of an item is ‘Wireless Mechanical Keyboard’, we can tell the computer to categorise the words ‘mechanical’, and ‘keyboard’ as Computers → Computer Accessories → Keyboards → Mechanical Keyboards.

This is called ‘tagging’, and it’s the most common way e-commerce stores categorise items.

When tagging, employees must set up rules to ‘tag’ items.

It takes a very long time if there are thousands of categories. It also doesn’t work when products have wrong, incomplete descriptions and titles. For example, a ‘machine keyboard’ product might not get categorised as a mechanical keyboard.

How to categorize products using machine learning?

An employee tagging products knows that the word ‘machine’ is similar to the word ‘mechanical’, so assigns an item with title ‘machine keyboard’ to the ‘mechanical keyboards’ category. With machine learning we can teach computers to make the same decisions.

In the past, training computers “to think in the same way” as humans was challenging and required machine learning expertise and access to lots of data to train the machines.

Thanks to recent improvements since 2020 (e.g. transformer models), people with no machine learning skills can now use these tools easily.

A powerful option would be to use the similarity.ai API to categorise e-commerce items. It is a cheap option, costing $2 per 1,000,000 characters of text.

To use similarity.ai, we need to sign at https://dashboard.similarity.ai/. You can do it for free now.

How to use similarity.ai for Product Categorization

For many developers, this will be their first time using machine learning to solve product categorization. To get used to how machine learning works, we will use the similarity.ai testing web page. Go to https://dashboard.similarity.ai/, and then click ‘Classify’ in the menu at the top of the window.

We will use items from Algolia’s ecommerce dataset to test the similarity.ai machine learning tool, but you can use your own product titles and categories as well.

Add these categories in the ‘Labels’ section, each in different labels:

Cell Phones > Cell Phone Accessories > iPhone Accessories

Computers & Tablets > Laptops > All Laptops > PC Laptops

Cameras & Camcorders > Digital Camera Accessories > Camera Bags & Cases

Add these product titles in the ‘Sentences’ section each in different sentences:

ADOPTED - Leather Wrap Case for Apple® iPhone® 6 - White/Gold

Acer - Aspire 15.6\\" Laptop - Intel Celeron - 4GB Memory - 500GB Hard Drive - Diamond Black

cme Made - Fillmore 100 Camera Case - Licorice Lime

Finally, we run the classifier by clicking ‘Classify Sentences’

The first category in the list for each product is the one that the machine learning classifier chose.

It correctly categorised all three!

Here we only categorised three products into three categories. Next, we will categorise several thousand products into hundreds of categories using the similarity.ai API.

Categorising an entire storefront using similarity.ai

We use the similarity.ai api to categorise thousands of items in any programming language. We will use node.js, but any language that can make http requests works, including python, java, ruby, php, bash or scala.

We use Algolias open source ecommerce dataset here, which we saved as a local file: https://raw.githubusercontent.com/algolia/datasets/master/ecommerce/records.json

You can use your own products and categories here as well.

We need the Algolia dataset json to be saved in a file at ecommerce-products.json

We create a file called test-product-categorisation.js

We need to install the package ‘node-fetch’ at https://www.npmjs.com/node-fetch to use the similarity.ai API.

We also need to define an environment variable SIMILARITY_AI_API_KEY that contains our similarity.ai API key.

const fetch = require('node-fetch');
const fs = require('fs');
const products = require('./ecommerce-products.json');

We define a function that will get our product titles and product categories from the ecommerce products:

function GetProductInfo(){
  const TITLE_FIELD = 'name';
  const CATEGORY_FIELD = 'categories';
  const titlesList = products.map(product => product[TITLE_FIELD]);
  const categoriesList = products.map(product => product[CATEGORY_FIELD][product[CATEGORY_FIELD].length - 1]);
  const uniqueCategoriesList = [...(new Set(categoriesList))];
  console.log(`categorizing ${titlesList.length} products`);
  console.log(`categorizing ${uniqueCategoriesList.length} categories`);
  return {titlesList,uniqueCategoriesList};
}

We send our data to the similarity API to be categorised. In the request body JSON, ‘documents’ is an array of titles and ‘labels’ is an array of categories. We need to define an environment variable SIMILARITY_AI_API_KEY to tell similarity.ai who we are.

async function CategorizeProducts(){
  const {titlesList,uniqueCategoriesList} = GetProductInfo();
  let finalClassifications = [];
  // The API has a maximum byte payload size. We need to split the list of products into chunks to be below this limit.
  const CHUNK_SIZE = 2000;
  for ( let i = 0; i < titlesList.length; i += CHUNK_SIZE) {
    const body = {
      documents:titlesList.slice(i,i+CHUNK_SIZE),
      labels:uniqueCategoriesList
    };
    console.time(`categorized ${i} to ${i+CHUNK_SIZE}`);
    const response = await fetch("<https://api.ap-southeast-2.similarity.ai/classify/zeroshot>",{
      "headers": {
        "authorization": `ApiKey ${process.env.SIMILARITY_AI_API_KEY}`,
        "content-type": "application/json",
      },
      "body": JSON.stringify(body),
      "method": "POST"
    });
    const {classifications} = await response.json();
    console.timeEnd(`categorized ${i} to ${i+CHUNK_SIZE}`);
    finalClassifications = finalClassifications.concat(classifications);
  }
  fs.writeFileSync('categorized-products-output.json',JSON.stringify(finalClassifications,null,2));
  console.log(JSON.stringify(finalClassifications.slice(0,10),null,2));
}
CategorizeProducts();

Finally, we run our script using node test-product-categorisation.js

We spend 50 seconds to categorise 10000 products with 800 categories.

We save our results to the file categorized-products-output.json

‘document’ is the product title, and ‘labels[0].label’ is the machine learning programs predicted category out of 800 categories

{
    "document": "3-Year Unlimited Cloud Storage Service Activation Card - Other",
    "labels": [
      {
        "label": "Prepaid Game Cards",
        "score": 0.38910271588519424
      }
    ],
  },
  {
    "document": "360fly - Panoramic 360° HD Video Camera - Black",
    "labels": [
      {
        "label": "360 & Panoramic Cameras",
        "score": 0.675741954956854
      }
    ],
  },
  {
    "document": "3DR - Backpack for Solo - Black",
    "labels": [
      {
        "label": "Camera Backpacks",
        "score": 0.5568706497949379
      }
    ],
  },
  {
    "document": "3DR - Propellers for 3DR Solo Drones (2-Pack) - Black",
    "labels": [
      {
        "label": "Drone Parts",
        "score": 0.49319492263359466
      }
    ],
  },
  {
    "document": "3DR - Solo Gimbal - Black",
    "labels": [
      {
        "label": "Gimbals",
        "score": 0.541564647093372
      }
    ],
  },
  {
    "document": "3DR - Solo Smart Rechargeable Battery - Black",
    "labels": [
      {
        "label": "Laptop Batteries",
        "score": 0.3932439709341342
      }
    ],
  },

For same product titles, the api is not sure what category they belong to. These have a low score. To better categorise these products, we can send the product description as well, or use manual tagging on top of the machine learning algorithm to achieve very high accuracy results.

Fore more information, please see https://dashboard.similarity.ai

Product Categorization with Machine Learning
Josh PF
June 30, 2022
Find out how your business can glean insights through unstructured data with vectors

Book a demo with our experts today.