Quickstart (K Means)

Clustering showing how a certain variety of images can improve pricingClustering showing how a certain variety of images can improve pricing

Clustering showing how a certain variety of images can improve pricing

If you haven't set up your client or installed the necessary dependencies, get set up first.

Introduction

Clustering groups items so that those in the same group/cluster have meaningful similarities (i.e. specific features or properties). Clustering facilitates informed decision-making by giving significant meaning to data through the identification of different patterns. Relying on strong vector representations, Relevance AI provides you with powerful and easy-to-use clustering endpoints.

In this guide, you will learn to run KMeans clustering via only one line of code. K-Means clustering partitions your dataset into K distinct clusters.

Open In ColabOpen In Colab

1. Create a dataset and insert data

First, you need to set up a client object to interact with RelevanceAI. You also need to have a dataset under your Relevance AI account. You can either use our dummy sample data as shown in this step or follow the tutorial on how to create your own dataset.

from relevanceai import Client 

client = Client()

In this guide, we use our e-commerce database, which includes fields such as product_name, as well as the vectorized version of the field product_name_default_vector_. Loading these documents can be done via:

from relevanceai.datasets import get_dummy_ecommerce_dataset
documents = get_dummy_ecommerce_dataset()

Next, we can upload these documents into your personal Relevance AI account under the name quickstart_clustering_kmeans

DATASET_ID = 'quickstart_clustering_kmeans'
client.insert_documents(dataset_id=DATASET_ID, docs=documents)

Let's have a look at the schema to see what vector fields are available for clustering.

client.datasets.schema(dataset_id)
{
 'insert_date_': 'date',
 'product_image': 'text',
 'product_image_clip_vector_': {'vector': 512},
 'product_link': 'text',
 'product_price': 'text',
 'product_title': 'text',
 'product_title_clip_vector_': {'vector': 512},
 'query': 'text',
 'source': 'text'
}

2. Run Kmeans clustering algorithm in one go

The K parameter in K-means algorithm is set to 10 by default but it can be changed via the k argument.
Simply, call the kmeans_cluster() function with the arguments and receive the centroids.

# Vector field based on which clustering is done
VECTOR_FIELD = 'product_title_clip_vector_'

# K in the Kmeans algorithm
KMEAN_NUMBER_OF_CLUSTERS = 10

clustered_documents = client.vector_tools.cluster.kmeans_cluster(
    dataset_id = DATASET_ID, 
    vector_fields = [VECTOR_FIELD],
    k = KMEAN_NUMBER_OF_CLUSTERS,
    alias = 'kmeans_10'
)

The kmeans_cluster() function performs the following steps:

  1. loading the data
  2. clustering
  3. writing the results back to the dataset

By loading the data from the dataset after clustering is done, you can see to which cluster each data point belongs. Here, we see how the first 5 data points are clustered:

from relevanceai import show_json

sample_documents = client.datasets.documents.list(DATASET_ID, page_size=5)
samples = [{
    "product_name":d["product_name"],
    "cluster":d["_cluster_"][VECTOR_FIELD]["kmeans_10"]
} for d in sample_documents["documents"]]

show_json(samples, text_fields=["product_name", "cluster"])
Clustering results fetched back from a datasetClustering results fetched back from a dataset

Clustering results fetched back from a dataset

If you are interested to know more details about what happens behind the scene, visit our next page on step-by-step clustering.


Did this page help you?