Required Knowledge: Vectors, Encoding, Classification problems
Audience: Data scientists, Vector enthusiasts, Statisticians, Machine learning engineers
Reading time: 5 minutes
Classification is the task of assigning a label to an input. Machine Learning classification requires a dataset full of labeled data. A Machine Learning model is first trained (i.e. the model observes and analyses the data), and then it will be able to inference the class of an unlabeled item (i.e. classify them into one of the seen categories/classes).
Binary classification models are simple and only know two classes (e.g. positive/negative, cat/dog, True/False). For more complex problems, we need stronger models that are able to inference multiple data classes (e.g. cat/dog/rabbit/horse, agree/disagree/neutral). Multilabel classification models are to use in such scenarios.
Typically, we may want to use a neural network to solve these kinds of problems. However, embeddings offer an alternative, less computationally intensive, solution.
Vectors reframe traditional classification into a vector search problem
One solution to substitute neural networks in classification tasks is using a technique called k-nearest neighbour (kNN) (i.e. finding the K nearest neighbouring data points). This approach requires data to be encoded (i.e vectorized). All data points are encoded/vectorized and placed along with each other to form the embedding space. This also applies to new query/search items.
To perform classification under the KNN approach, we look at the closest k vectors to our new/query item (k is decided based on a set of pilot experiments); distance is calculated in the vector space, following vector algebra. Classification is done according to the classes of these k nearest neighbours.
Here are a few key advantages and possible disadvantages of using vectors over neural networks for classifications.
Advantages and Disadvantages of Vector Similarity Approach
Resolves the cold-start issue
Fine-tuning might be necessary for specific data
Reduced computational costs
Updated about 2 months ago