Workflow

quanteda has three basic types of objects:

  1. Corpus

    • Saves character strings and variables in a data frame
    • Combines texts with document-level variables
  2. Tokens

    • Stores tokens in a list of vectors
    • More efficient than character strings, but preserves positions of words
    • Positional (string-of-words) analysis is performed using textstat_collocations(), tokens_ngrams() and tokens_select() or fcm() with window option
  3. Document-feature matrix (DFM)

    • Represents frequencies of features in documents in a matrix
    • The most efficient structure, but it does not have information on positions of words
    • Non-positional (bag-of-words) analysis are profrmed using many of the textstat_* and textmodel_* functions

Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.

Text files
Document-level variables
Corpus
Tokens
Positional analysis (string-of-words)
Non-positional analysis (bag-of-words)
DFM

For example, if character vectors are given to dfm(), it internally constructs corpus and tokens objects before creating a DFM.