Cluster Distribution

One way to evaluate clustering performance is to examine how well the vectors are distributed across the clusters. Depending on the individual use-case, success usually means a well spread out distribution of vectors across clusters. Optionally, if there is a field that can be used as ground truth, such a field can be used to examine the distribution of the ground truth across individual clusters. This can help drive decision making around hyperparameters of clustering including the number of clusters and clustering methodology.

Code Example

The following code examines how the clusters are distributed, firstly, within themselves, and then against a ground truth label category (i.e. category is one of the fields in the dataset). Ideally, there is a clear ground truth in each cluster category.

DATASET_ID = "quickstart_clustering"
VECTOR_FIELD = "descriptiontextmulti_vector_"

# Cluster Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10"
)
{'Cluster-9': 167,
 'Cluster-7': 138,
 'Cluster-2': 147,
 ...
}

Cluster against ground truth distribution

DATASET_ID = "quickstart_clustering"
VECTOR_FIELD = "descriptiontextmulti_vector_"
GROUND_TRUTH_FIELD = "category"


# Cluster against Ground Truth Distribution
client.vector_tools.cluster.distribution(
  dataset_id = DATASET_ID, 
  vector_field = VECTOR_FIELD, 
  cluster_alias = "kmeans_10",
  ground_truth_field = GROUND_TRUTH_FIELD
)
{'Home Decor & Festive Needs ': {'Cluster-1': 0.46774193548387094,
 'Cluster-6': 0.2903225806451613,
 'Cluster-2': 0.1774193548387097,
 'Cluster-4': 0.06451612903225806},
 'Mobiles & Accessories ': {'Cluster-5': 0.5802469135802469,
 'Cluster-1': 0.2345679012345679,
 'Cluster-6': 0.14814814814814814,
 'Cluster-4': 0.037037037037037035},
 'Beauty and Personal Care ': {'Cluster-6': 0.7142857142857143,
 'Cluster-0': 0.14285714285714285,
 'Cluster-4': 0.14285714285714285},
  ...
}

Did this page help you?