24.1 One-token-per-row (unnest_tokens)

  • String: Text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
  • Corpus: These types of objects typically contain raw strings annotated with additional metadata and details.
  • Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf (see Chapter 3).