quanteda tutorials > Basic Operations > Workflow

Workflow

quanteda has three basic types of objects:

Corpus
- Saves character strings and variables in a data frame
- Combines texts with document-level variables
Tokens
- Stores tokens in a list of vectors
- More efficient than character strings, but preserves positions of words
- Positional (string-of-words) analysis is performed using textstat_collocations(), tokens_ngrams() and tokens_select() or fcm() with window option
Document-feature matrix (DFM)
- Represents frequencies of features in documents in a matrix
- The most efficient structure, but it does not have information on positions of words
- Non-positional (bag-of-words) analysis are profrmed using many of the textstat_* and textmodel_* functions

Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.

For example, if character vectors are given to dfm(), it internally constructs corpus and tokens objects before creating a DFM.