quanteda tutorials > Basic Operations > Document-feature matrix > Group documents

Group documents

require(quanteda)
options(width = 110)

toks_inaug <- tokens(data_corpus_inaugural)
dfmat_inaug <- dfm(toks_inaug)
print(dfmat_inaug)

## Document-feature matrix of: 59 documents, 9,439 features (91.84% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives : among vicissitudes
##   1789-Washington               1  71 116      1  48     2               2 1     1            1
##   1793-Washington               0  11  13      0   2     0               0 1     0            0
##   1797-Adams                    3 140 163      1 130     0               2 0     4            0
##   1801-Jefferson                2 104 130      0  81     0               0 1     1            0
##   1805-Jefferson                0 101 143      0  93     0               0 0     7            0
##   1809-Madison                  1  69 104      0  43     0               0 0     0            0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]

head(colSums(dfmat_inaug), 10)

## fellow-citizens              of             the          senate             and           house 
##              39            7180           10183              15            5406              11 
## representatives               :           among    vicissitudes 
##              19             144             108               5

dfm_group() merges documents based on a vector given to the groups argument. In grouping documents, it takes the sums of feature frequencies.

dfmat_party <- dfm_group(dfmat_inaug, groups = Party)
print(dfmat_party)

## Document-feature matrix of: 6 documents, 9,439 features (66.93% sparse) and 1 docvar.
##                        features
## docs                    fellow-citizens   of  the senate  and house representatives  : among vicissitudes
##   Democratic                          3 1994 2742      2 1728     4               3 54    25            3
##   Democratic-Republican              10  945 1416      0  640     0               2  1    16            1
##   Federalist                          3  140  163      1  130     0               2  0     4            0
##   none                                1   82  129      1   50     2               2  2     1            1
##   Republican                          9 3055 4408      5 2386     4               6 86    52            0
##   Whig                               13  964 1325      6  472     1               4  1    10            0
## [ reached max_nfeat ... 9,429 more features ]

head(colSums(dfmat_party), 10)

## fellow-citizens              of             the          senate             and           house 
##              39            7180           10183              15            5406              11 
## representatives               :           among    vicissitudes 
##              19             144             108               5

From quanteda package version 3.0 onwards, dfm_group() supports non-standard evaluation. This means that the name of the grouping variable should not be quoted by ".

dfm_group() identifies document-level variables that are the same within groups and keeps these variables.

docvars(dfmat_party)

##                   Party
## 1            Democratic
## 2 Democratic-Republican
## 3            Federalist
## 4                  none
## 5            Republican
## 6                  Whig