How to prepare data for Relevance AI

So you're ready to get started uploading data to Relevance AI? Great!

Before doing so, run through the checklist to make sure your data meets our recommendations and requirements.

  • The general format for uploading data to Relevance AI is CSV.
  • 300000 rows in the maximum number of rows in your dataset. Please contact us if your dataset is larger.
  • Make sure to read further down on this page if you are processing images or audio (i.e. media data).

File format: CSV

Your dataset should be saved in valid CSV format before being uploaded to Relevance AI's platform.

CSV files are table-like data formats, similar to what is seen on an Excel sheet. Make sure, all columns have a unique name and follow the same data type and format for values in each column (see the "Field values" section below).

No|Name|Company|Age
--|----|-------|---
1 |Jim |  ABC  |32
--|----|-------|---
2 |Jack|  XYZ  |24
--|----|-------|---
3 |Dave|  LMN  |39

Headers: Names of fields / columns

  • Column names/headers are included as the first row of the file
  • Column names/headers should be in one line (i.e multiple-line headers are not accepted)
  • Should be short but descriptive names
  • No duplicate column names (i.e. unique column name, otherwise we automatically add numbers to headers)
  • Names can only contain letters, numbers, dashes or underscores​ (any other character will be replaced by our upload engine)
  • Remove any white spaces in the field name (or white spaces will be replaced with -by our upload engine)
  • Avoid using . in the field / column names (or . will be replaced with -by our upload engine)
  • If your dataset includes vectors, make sure the vector field name ends in _vector_
    Vector fields:
    Vector fields are representations of data in another format (i.e. a list of numbers or vectors to be precise). For instance, if your dataset includes a field/column named "description" which shows the description of items in the dataset in text format, after vectorizing each description value, you will have access to the corresponding vectors. These vectors can be saved in the original dataset under a vector field (an example is provided under Data FAQs).

Values: Values under headers

  • Include only one data type and format in each column. For instance:
    • DATES - All date fields formatted in "YYYY-MM-DD" format
    • CURRENCY - Values in digits only, without the currency sign (e.g. 119.50), as opposed to: price = [$119.50, 200 dollars]
    • POSTCODES - If your data contains a postcode field that contains both numeric (e.g. 90210) and string format (e.g. SW1A 1AA) values in the one field (e.g postcodes across countries), ensure that the first postcode value (under the column header 'postcode') is a string format postcode (e.g. SW1A 1AA).
    • None / No values - When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; not a white space, not 0, not N/A, not None, not null, literally no nothing. A common sample case is when people do not respond to a question.

Categorical measures

  • If your data has coded values (i.e. Is Member = 1/0), we recommend changing the data to natural language for businesses to understand i.e. Is Member / Is Not Member, or Yes / No or True / False.

Numeric measures

  • For numeric scores like NPS, we recommend including both columns: numeric scores (e.g. 0-10 scores) and the coded value/label as an additional field (e.g. detractor, passive, promoter).

No values

When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; no white space, Not 0, not N/A, not None, not null, literally no nothing.. A common sample case is when people do not respond to a question.

_id field

There is a unique identifier per entry (_id) in datasets sitting on the Relevance AI's platform. The _id field can preexist in a CSV (i.e. included in the upload CSV by the dataset owner). Otherwise, the platform automatically adds the field with unique values.

This unique identifier is your access point to an individual entry in a dataset. So, in cases you might need to modify existing values, we recommend including the id field in advance.

Note 1: Pay attention to the spelling, it should be exactly _id (underscore, small i and small d)

Note 2: The assigned unique identifier can be accessed on the platform or when you Export your data to CSV.

A field to access the initial order of the data rows

If the sequence of the data in your CSV file is important, introduce a new field to your CSV file and include sequential values (e.g. 1, 2, 3, ...) in it. This helps to be able to sort your data after exporting them from Relevance AI.

Cleaning text data (optional)

When working with text data it is recommended (i.e. not required) to apply certain preprocessing steps which can potentially improve the analysis results. Common text pre-processing are:

  • Stop words removal: to remove frequent but not important words used in our language (e.g. the, there).
  • Lemmatization: replacing words with their common root (e.g. changes or changing become change)
  • Lowercasing: converting all characters to their lowercase form
  • Breaking into shorter pieces of text: when automatically analyzing text, processing smaller pieces of text (e.g. a sentence vs paragraph) often produces more precise results.
  • Noise removal: this step is completely data specific. Popular cleaning methods are html, URL or hashtag removal.

File formats for media data (images, audio)

Media data such as images or audio files must be accessible via a URL in order to work with them on Relevance AI's platform. If your media files are available online, simply include their corresponding URLs in your dataset. Note that you can include other fields in your CSV file as shown in the second example below.

How To Get Started: Audio Use Case

  • Save your audio file(s) in one of the common audio formats - mp3 is recommended
  • Your audio file must be accessible via a http... link. Use your preferred hosting method, include the URL(s) in your CSV and upload your csv file to Relevance AI or simply run [Connect Media] which takes care of this step.
 _id              Audio-URL
-----|--------------------------------------
  1  |  https://my-repo/my-audio-file1.mp3
  2  |  https://my-repo/my-audio-file2.mp3
  3  |  https://my-repo/my-audio-file3.mp3
  
  
  
 _id                 URL               project
-----|-------------------------------|---------
  1  |  https://my-repo/my-file1.mp3 |   X1
  2  |  https://my-repo/my-file2.mpe |   X2
  3  |  https://my-repo/my-file3.mp3 |   X1
  4  |  https://my-repo/my-file4.mp3 |   X3
  5  |  https://my-repo/my-file5.mp3 |   X2

Otherwise, you need to first host them on the internet. This is possible through Upload your media files workflow on Relevance AI.

Audio files

  • Save your audio file under common formats such as mp3
  • Make sure the moderator is the first person heard in the audio, so speaker A is always the moderator/interviewer. This helps filtering data better
  • Make sure people do not speak over each other when recording the audio

Image files

  • Save your image file under common formats such as jpg
     _id              Image-URL             project
    -----|--------------------------------|---------
      1  |  https://my-repo/my-image1.jpg |   X1
      2  |  https://my-repo/my-image2.jpg |   X2
      3  |  https://my-repo/my-image3.jpg |   X1
      4  |  https://my-repo/my-image4.jpg |   X3
      5  |  https://my-repo/my-image5.jpg |   X2
    

See the guide on How to update an existing dataset which covers all the below items

  • adding new items (i.e. rows) to an existing dataset
  • adding new fields (columns) to an existing dataset
  • modifying existing values in a dataset