Vectors for classification

Vectors are re-framing how we approach traditional deep learning problems

Required Knowledge: Vectors, Encoding, Classification problems
Audience: Data scientists, Vector enthusiasts, Statisticians, Machine learning engineers

Defining Multi-Classification

Classification is the task of assigning a label to an input. Machine Learning classification requires a dataset full of labeled data. A Machine Learning model is first trained (i.e. the model observes and analyses the data), and then it will be able to inference the class of an unlabeled item (i.e. classify them into one of the seen categories/classes).
Binary classification models are simple and only know two classes (e.g. positive/negative, cat/dog, True/False). For more complex problems, we need stronger models that are able to inference multiple data classes (e.g. cat/dog/rabbit/horse, agree/disagree/neutral). Multilabel classification models are to use in such scenarios.
Typically, we may want to use a neural network to solve these kinds of problems. However, embeddings offer an alternative, less computationally intensive, solution.

Vectors reframe traditional classification into a vector search problem

One solution to substitute neural networks in classification tasks is using a technique called k-nearest neighbour (kNN) (i.e. finding the K nearest neighbouring data points). This approach requires data to be encoded (i.e vectorized). All data points are encoded/vectorized and placed along with each other to form the embedding space. This also applies to new query/search items.
To perform classification under the KNN approach, we look at the closest k vectors to our new/query item (k is decided based on a set of pilot experiments); distance is calculated in the vector space, following vector algebra. Classification is done according to the classes of these k nearest neighbours.

Here are a few key advantages and possible disadvantages of using vectors over neural networks for classifications.

Pros

Cons

Resolves the cold-start issue
In traditional approaches, classification neural networks would have to be re-trained in order to adapt to new categories. Adding new classes to existing models would often require re-training costs, compute costs and significant time into ensuring a new model has been tested to work .

Fine-tuning might be necessary for specific data
Under specific data, the encoders might need to be finetuned which requires labeled data.

Reduced computational costs
Using excellent out-of-the-box vectors/similarity search that resolves this issue means you can then reduce the cost of initial data science experiments and ensure data to value quickly