How to vectorize text using VectorHub - Transformers
A guide on vectorizing text using Vectorhub
Using VectorHub
VectorHub provides users with access to various state of the art encoders to vectorize different data types such as text or image. It manages the encoding process as well, allowing users to focus on the data they want to encode rather than the actual model behind the scene.
On this page, we introduce sentence-transformer based text encoders.
sentence-transformers
First, sentence-transformers
must be installed. Restart the notebook when the installation is finished.
# remove `!` if running the line in a terminal
!pip install vectorhub[sentence-transformers]
Then from the sentence_transformers
category, we import our desired transformer and specific model; the full list can be accessed here.
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec("all-mpnet-base-v2")
Encoding a single text input via the encode
function and encoding a specified text field in the whole data (i.e. list of dictionaries) via the encode_documents
function are shown below.
# Encode a single input
model.encode("I love working with vectors.")
# documents are saved as a list of dictionaries
documents = [{'sentence': '"This is the first sentence."', '_id': 1}, {'sentence': '"This is the second sentence."', '_id': 2}]
# Encode the `"sentence"` field in a list of documents
encoded_documents = model.encode_documents(["sentence"], documents)
ds.upsert_documents(documents=encoded_documents)
Encoding an entire dataset using df.apply()
The easiest way to update an existing dataset with encoding results is to run df.apply
. This function fetches all the data-points in a dataset, runs the specified function (i.e. encoding in this case) and writes the result back to the dataset.
For instance, in the sample code below, we use a dataset called ecommerce_dataset
, and encode the product_description
field using the SentenceTransformer2Vec
encoder.
ds["sentence"].apply(lambda x: model.encode(x), output_field="sentence_vector")
Some famous models
Encoding using native Transformers
- BERT
Below, we show an example of how to get vectors from the popular BERT model from HuggingFace Transformers library.
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def vectorize(text):
return (
torch.mean(model(**tokenizer(text, return_tensors="pt"))[0], axis=1)
.detach()
.tolist()[0]
)
Encoding using Vectorhub's Sentence Transformers
Vectorhub helps us to more easily work with models to encode fields in our documents of different modal types.
- CLIP
Below, we show an example of how to get vectors from the popular CLIP model from HuggingFace Transformers library.
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec('clip-ViT-B-32')
text_vector = model.encode("I love working with vectors.")
# documents are saved as a list of dictionaries
documents=[{'image_url': 'https://relevance.ai/wp-content/uploads/2021/10/statue-illustration.png'}, {'image_url': 'https://relevance.ai/wp-content/uploads/2021/09/Group-193-1.png'}]
# Encode the images accessible from the URL saved in `image_url` field in a list of documents
docs_with_vecs = model.encode_documents(["image_url"], documents)
Updated about 2 months ago