Workflow
quanteda has three basic types of objects:
-
Corpus
- Saves character strings and variables in a data frame
- Combines texts with document-level variables
-
Tokens
- Stores tokens in a list of vectors
- More efficient than character strings, but preserves positions of words
- Positional (string-of-words) analysis is performed using
textstat_collocations(), tokens_ngrams() and tokens_select() or fcm() with window option
-
Document-feature matrix (DFM)
- Represents frequencies of features in documents in a matrix
- The most efficient structure, but it does not have information on positions of words
- Non-positional (bag-of-words) analysis are profrmed using many of the
textstat_* and textmodel_* functions
Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.
For example, if character vectors are given to dfm(), it internally constructs corpus and tokens objects before creating a DFM.