Inserting and updating documents

Guide on how to insert data into RelevanceAI

Inserting data

In general data insertion to Relevance AI can be done through either of the following options:

  1. first, uploading the data to RelevanceAI and then vectorizing certain fields with a second API call (if necessary)
  2. vectorizing certain fields and uploading the data to Relevance AI with a single call.

The first option is perfect when the data already includes the desired vectors. Under both options, Relevance AI provides you with state-of-the-art models to further vectorize your data as well.

To upload multiple documents to RelevanceAI, you can use the insert_documents method.

Note that if you specify a non-existing dataset upon insertion, a new dataset will be automatically created.

Inserting the data in one go

If the dataset is not too big (not bigger than 100MB), there is no need to break it into batches. Here, you can see an example of how to upload all your data in one go. As was mentioned at Preparing data from CSV / Pandas Dataframe, data is a list of dictionaries and is passed to the endpoint via the documents argument.

client.insert_documents(dataset_id="ecommerce-sample-dataset", documents=documents)

Updating documents

To only update specific documents, use update_documents as shown below:

documents = [{"_id": "example_id", "value": 3}]
client.update_documents( dataset_id="ecommerce-sample-dataset",  documents=documents)

🚧

When To Use Update Vs Insert

insert replaces the entire document whereas update only changes the fields that are specified or newly added. It will not delete fields that are already in the dataset, nor insert new documents.

The easiest way to modify and update all documents in a dataset is to run pull_update_push in the Python SDK.

Updating An Entire Dataset

πŸ“˜

pull_update_push is the easiest way to edit documents in a single dataset

To quickly try out new experiments on your entire dataset, we built pull_update_push to easily update all documents in a dataset. It uses Python's function callables to help accelerate the modification process and immediately update documents in the vector database.

An example of pull_update_push can be found here in which a new field new_parameter is added to every single document in a specified dataset.

def encode_documents(documents):
    for d in documents:
        d["new_parameter"] = "new_value"
    return documents

client.pull_update_push(
    dataset_id="ecommerce-sample-dataset",
  update_function=encode_documents
)

About Pull Update Push

In pull_update_push, we are modifying documents and updating them into the vector database. However, there is always a chance that the process can break. We do not want to necessarily re-process the already processed ones. Therefore, it also logs IDs that are processed to a separate logging collection. If we want to continue processing, we can specify the logging_collection parameter in pull_update_push.

Architecture diagram of `pull_update_push`Architecture diagram of `pull_update_push`

Architecture diagram of pull_update_push


Did this page help you?