quanteda tutorials > Basic Operations > Workflow

Workflow

quanteda has three basic types of objects:

Corpus
- Saves character strings and variables in a data frame
- Combines texts with document-level variables
Tokens
- Stores tokens in a list of vectors
- More efficient than character strings, but preserves positions of words
- Positional (string-of-words) analysis is performed using textstat_collocations(), tokens_ngrams() and tokens_select() or fcm() with window option
Document-feature matrix (DFM)
- Represents frequencies of features in documents in a matrix
- The most efficient structure, but it does not have information on positions of words
- Non-positional (bag-of-words) analysis are profrmed using many of the textstat_* and textmodel_* functions

Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.

graph TD D[Text files] V[Document-level variables] C(Corpus) T(Tokens) AP["Positional analysis (string-of-words)"] AN["Non-positional analysis (bag-of-words)"] M(DFM) style C stroke-width:4px style T stroke-width:4px style M stroke-width:4px D --> C V --> C C --> T T --> M T -.-> AP M -.-> AN

For example, if character vectors are given to dfm(), it internally constructs corpus and tokens objects before creating a DFM.