How do I manage data quality? Which pre-processing techniques are recommended when working with text?

When working with textual data it is recommended (i.e. not required) to apply certain preprocessing steps which can potentially improve the analysis results. Common text pre-processing are:

  • Stop words removal: to remove frequent but not important words used in our language (e.g. the, there).
  • Stemming: replacing words with their word stem (e.g. changes or changing become chang-)
  • Lemmatization: replacing words with their common root (e.g. changes or changing become change)
  • Lowercasing: converting all characters to their lowercase form
  • Text cleaning: this step is completely data specific. Some famous text cleanings are Html, URL or hashtag removal.
  • Breaking into shorter pieces of text: when automatically analyzing text, processing smaller pieces of text (e.g. a sentence vs paragraph) often produces more precise results.

Did this page help you?