How to better prepare my data

On this page, we briefly overview the steps that can be taken to possibly improve the analysis results as well as to avoid facing error messages when using the platform

File format

Your dataset should be saved in valid CSV or JSON format before being uploaded to Relevance AI's platform.

CSV files

CSV files are table-like data formats, similar to what is seen on an Excel sheet. Make sure, all columns have a unique name, and best to follow the same data type and format for data in each column (see the "Field values" section below).

No|Name|Company|Age
--|----|-------|---
1 |Jim |  ABC  |32
--|----|-------|---
2 |Jack|  XYZ  |24
--|----|-------|---
3 |Dave|  LMN  |39

JSON files

JSON files are lists of dictionaries and are more often used by programmers. Keep in mind that it is best to follow the same data type and format for data per headers (see the "Field values" section below).

[
  {"No":1, "Name":"Jim",  "Company":"ABC", "Age":32},
  {"No":1, "Name":"Jack", "Company":"XYZ", "Age":24},
  {"No":1, "Name":"Dave", "Company":"LMN", "Age":39}
]

Field / Column names

  • Choose short but descriptive enough names.
  • Unique field/column names per column per dataset
  • Avoid using . in the field / column names
  • If your dataset includes vectors, make sure the vector field name ends in _vector_
    Vector fields:
    Vector fields are representations of data in another format - (list of numbers or vectors to be precise). For instance, if your dataset includes a field/column named "description" which shows the description of items in the dataset in text format, after vectorizing each description value, you have access to the corresponding vectors. These vectors can be saved in the dataset under a vector field (an example is provided under Data FAQs).

Field / Column values

Format

Best to include only one data type and format in each column. For instance:

  • All date values under "yyyy-mm-dd" format as opposed to date = [21-01-18, 2021-Jan-18, March 13th 2020]
  • Price values all in only digits without the $ sign as opposed to price = [ 150, $119.50, 200 dollars]

Optional cleaning for text

When working with textual data it is recommended (i.e. not required) to apply certain preprocessing steps which can potentially improve the analysis results. Common text pre-processing are:

  • Stop words removal: to remove frequent but not important words used in our language (e.g. the, there).
  • Lemmatization: replacing words with their common root (e.g. changes or changing become change)
  • Lowercasing: converting all characters to their lowercase form
  • Breaking into shorter pieces of text: when automatically analyzing text, processing smaller pieces of text (e.g. a sentence vs paragraph) often produces more precise results.
  • Noise removal: this step is completely data specific. Some famous cleanings are Html, URL or hashtag removal.

Media data

Media data such as images or audio files must be accessible via a URL in order to work with them on Relevance AI's platform. If your media files are available online, simply include their corresponding URLs in your dataset. Otherwise, you need to first host them on the web. Contact us via our website, if you need space in our dataset for uploading your media online.


Did this page help you?