Quickstart

Evaluate your clusters 5 quick steps

Introduction

There are several ways to evaluate the success of a clustering algorithm. Broadly speaking, they can be categorised into internal and external methods:

  1. Internal methods that examine how much variation is explained in clusters
  2. External methods that compare the clusters to ground truth.

Relevance AI provides you with the tools to perform all these analyses. The results of this analysis should be used to decide on the hyperparameters of your clustering algorithm, including the number of clusters and clustering methodology.

Open In ColabOpen In Colab

1. Create a dataset and insert data

First, you need to set up a client object to interact with Relevance AI. You also need to have a dataset under your Relevance AI account. You can either use our dummy sample data, ecommerce-1 as shown in this step or follow the tutorial on how to create your own dataset. Running the code below, the loaded documents will be uploaded into your personal Relevance AI account under the name quickstart_clustering.

from relevanceai import Client 
from relevanceai.datasets import get_dummy_ecommerce_dataset

client = Client()
documents = get_dummy_ecommerce_dataset()

DATASET_ID = "quickstart_clustering"
client.insert_documents(dataset_id=DATASET_ID, docs=documents)

2. Run Kmeans clustering algorithm

The following code clusters the descriptiontextmultivector field using KMeans. For more information, see Quickstart (K means)].

VECTOR_FIELD = 'descriptiontextmulti_vector_'
KMEAN_NUMBER_OF_CLUSTERS = 10

centroids = client.vector_tools.cluster.kmeans_cluster(
    dataset_id = DATASET_ID, 
    vector_fields = [VECTOR_FIELD],
    k = KMEAN_NUMBER_OF_CLUSTERS,
        alias = "kmeans_10"
)

3. View distribution of clusters

You can use the distribution function to examine how the clusters are distributed

  1. within themselves and/or
  2. against a ground truth field existing in the dataset.

This is shown in the two code snippets below respectively; ideally, each cluster would represent a different category.

# Cluster Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10"
)
{'Cluster-9': 167,
 'Cluster-7': 138,
 'Cluster-2': 147,
 ...
}

Here, we have chosen the field category in our dataset as the ground truth.

# Select ground truth 
GROUND_TRUTH_FIELD = "category"

# Cluster against Ground Truth Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10",
  ground_truth_field = GROUND_TRUTH_FIELD
)
{'Home Decor & Festive Needs ': {'Cluster-1': 0.46774193548387094,
 'Cluster-6': 0.2903225806451613,
 'Cluster-2': 0.1774193548387097,
 'Cluster-4': 0.06451612903225806},
 'Mobiles & Accessories ': {'Cluster-5': 0.5802469135802469,
 'Cluster-1': 0.2345679012345679,
 'Cluster-6': 0.14814814814814814,
 'Cluster-4': 0.037037037037037035},
 'Beauty and Personal Care ': {'Cluster-6': 0.7142857142857143,
 'Cluster-0': 0.14285714285714285,
 'Cluster-4': 0.14285714285714285},
  ...
}

4. View metrics of clusters

The following code examines metrics of clusters including the Silhouette Score, Rand Score, Homogeneity and Completeness, explanation of these metrics is provided at Cluster Metrics. If a ground truth is not provided, only the Silhouette Score is shown.

# Cluster Metrics
client.vector_tools.cluster.metrics(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = 'kmeans_10', 
  ground_truth_field = GROUND_TRUTH_FIELD
)
{
'Silhouette Score': 0.11150895287708772, 
'Rand Score': 0.2953462481652017
}

5. Plot clusters

The following code plots a 3D dimension reduced version of the vectors, colour coded by clusters, and also optionally, the ground truth.

# Plot Cluster
client.vector_tools.cluster.plot(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = 'kmeans_10', 
  ground_truth_field = GROUND_TRUTH_FIELD
)

Put it all together

from relevanceai import Client 
from relevanceai.datasets import get_dummy_ecommerce_dataset

client = Client()
docs = get_dummy_ecommerce_dataset()

DATASET_ID = "quickstart_clustering"
client.insert_documents(dataset_id=DATASET_ID, docs=docs)

# Cluster vectors
VECTOR_FIELD = "descriptiontextmulti_vector_"
KMEAN_NUMBER_OF_CLUSTERS = 10

centroids = client.vector_tools.cluster.kmeans_cluster(
    dataset_id = DATASET_ID, 
    vector_fields = [VECTOR_FIELD],
    k = KMEAN_NUMBER_OF_CLUSTERS,
        alias = "kmeans_10"
)


# Cluster Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10"
)

# Select ground truth 
GROUND_TRUTH_FIELD = 'category'


# Cluster against Ground Truth Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10",
  ground_truth_field = GROUND_TRUTH_FIELD
)

# Cluster Metrics
client.vector_tools.cluster.metrics(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = 'kmeans_10', 
  ground_truth_field = GROUND_TRUTH_FIELD
)

# Plot Cluster
client.vector_tools.cluster.plot(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = 'kmeans_10', 
  ground_truth_field = GROUND_TRUTH_FIELD
)

Did this page help you?