Quickstart (step by step)

Get started with clustering in a few minutes!

Clustering showing how a certain variety of images can improve pricingClustering showing how a certain variety of images can improve pricing

Clustering showing how a certain variety of images can improve pricing

If you haven't set up your client or installed the necessary dependencies, get set up first.

Introduction

Clustering groups items so that those in the same group/cluster have meaningful similarities (i.e. specific features or properties). Clustering facilitates informed decision-making by giving significant meaning to data through the identification of different patterns. Relying on strong vector representations, Relevance AI provides you with powerful and easy-to-use clustering endpoints.

In this guide, you will learn to run clustering based on the K-Means algorithm which aims to partition your dataset into K distinct clusters.

Open In ColabOpen In Colab

1. Create a dataset and insert data

First, you need to set up a client object to interact with RelevanceAI. You will need to have a dataset under your Relevance AI account. You can either use our dummy sample data as shown below or follow the tutorial on how to create your own dataset to create your own database.

from relevanceai import Client 
client = Client()

In this guide, we use our e-commerce database, which includes fields such as product_name, as well as the vectorized version of the field product_name_default_vector_. Loading these documents can be done via:

from relevanceai.datasets import get_dummy_ecommerce_dataset
documents = get_dummy_ecommerce_dataset()

Next, we can upload these documents into your personal Relevance AI account under the name quickstart_clustering

DATASET_ID = "quickstart_clustering"
client.insert_documents(dataset_id=DATASET_ID, docs=documents)

2. Run clustering algorithm, in this quickstart we use KMeans

To run KMeans Clustering, we need to first define a clustering object, KMeans, which loads the clustering algorithm with a specified number of clusters.

from relevanceai.vector_tools.cluster import KMeans
KMEAN_NUMBER_OF_CLUSTERS = 10
clusterer = KMeans(k=KMEAN_NUMBER_OF_CLUSTERS)

Next, the algorithm is fitted on the vector field, productname_default_vector, to distinguish between clusters. The cluster to which each document belongs is returned.

VECTOR_FIELD = "product_name_default_vector_"
clustered_documents = clusterer.fit_documents(
    vector_fields = [VECTOR_FIELD], # Cluster 1 field
    documents = documents,
    return_only_clusters = True # If True, return only clusters
)

3. Update the dataset with the cluster labels

Finally, these categorised documents are uploaded back to the dataset as an additional field.

client.update_documents(dataset_id=DATASET_ID, documents=clustered_documents)

4. Insert the cluster centroids

Get the centroid's vector and insert them as centroids into Relevance AI.

centers = clusterer.get_centroid_documents()

client.services.cluster.centroids.insert(
    dataset_id = DATASET_ID,
    cluster_centers = centers,
    vector_fields = [VECTOR_FIELD],
    alias = 'default'
)

Once you have stored your cluster centroids, you can view them using the following code.

client.services.cluster.centroids.list(
    dataset_id = DATASET_ID,
    vector_fields = [VECTOR_FIELD],
    page_size = 10,
    include_vector = False,
    alias = "default"
)

Did this page help you?