If you haven't set up your client or installed the necessary dependencies, get set up first.
Clustering groups items so that those in the same group/cluster have meaningful similarities (i.e. specific features or properties). Clustering facilitates informed decision-making by giving significant meaning to data through the identification of different patterns. Relying on strong vector representations, Relevance AI provides you with powerful and easy-to-use clustering endpoints.
In this guide, you will learn to run clustering based on the K-Means algorithm which aims to partition your dataset into K distinct clusters.
First, you need to set up a client object to interact with RelevanceAI. You will need to have a dataset under your Relevance AI account. You can either use our dummy sample data as shown below or follow the tutorial on how to create your own dataset to create your own database.
from relevanceai import Client client = Client()
In this guide, we use our e-commerce database, which includes fields such as
product_name, as well as the vectorized version of the field
product_name_default_vector_. Loading these documents can be done via:
from relevanceai.datasets import get_dummy_ecommerce_dataset documents = get_dummy_ecommerce_dataset()
Next, we can upload these documents into your personal Relevance AI account under the name quickstart_clustering
DATASET_ID = "quickstart_clustering" client.insert_documents(dataset_id=DATASET_ID, docs=documents)
To run KMeans Clustering, we need to first define a clustering object, KMeans, which loads the clustering algorithm with a specified number of clusters.
from relevanceai.vector_tools.cluster import KMeans KMEAN_NUMBER_OF_CLUSTERS = 10 clusterer = KMeans(k=KMEAN_NUMBER_OF_CLUSTERS)
Next, the algorithm is fitted on the vector field, productname_default_vector, to distinguish between clusters. The cluster to which each document belongs is returned.
VECTOR_FIELD = "product_name_default_vector_" clustered_documents = clusterer.fit_documents( vector_fields = [VECTOR_FIELD], # Cluster 1 field documents = documents, return_only_clusters = True # If True, return only clusters )
Finally, these categorised documents are uploaded back to the dataset as an additional field.
Get the centroid's vector and insert them as centroids into Relevance AI.
centers = clusterer.get_centroid_documents() client.services.cluster.centroids.insert( dataset_id = DATASET_ID, cluster_centers = centers, vector_fields = [VECTOR_FIELD], alias = 'default' )
Once you have stored your cluster centroids, you can view them using the following code.
client.services.cluster.centroids.list( dataset_id = DATASET_ID, vector_fields = [VECTOR_FIELD], page_size = 10, include_vector = False, alias = "default" )
Updated 4 days ago