require(quanteda)
require(quanteda.textstats)
options(width = 110)
dfm()
constructs a document-feature matrix (DFM) from a tokens object.
toks_inaug <- tokens(data_corpus_inaugural, remove_punct = TRUE)
dfmat_inaug <- dfm(toks_inaug)
print(dfmat_inaug)
## Document-feature matrix of: 59 documents, 9,423 features (91.89% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives among vicissitudes incident
## 1789-Washington 1 71 116 1 48 2 2 1 1 1
## 1793-Washington 0 11 13 0 2 0 0 0 0 0
## 1797-Adams 3 140 163 1 130 0 2 4 0 0
## 1801-Jefferson 2 104 130 0 81 0 0 1 0 0
## 1805-Jefferson 0 101 143 0 93 0 0 7 0 0
## 1809-Madison 1 69 104 0 43 0 0 0 0 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,413 more features ]
You can get the number of documents and features ndoc()
and nfeat()
.
ndoc(dfmat_inaug)
## [1] 59
nfeat(dfmat_inaug)
## [1] 9423
You can also obtain the names of documents and features by docnames()
and featnames()
.
head(docnames(dfmat_inaug), 20)
## [1] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" "1805-Jefferson"
## [6] "1809-Madison" "1813-Madison" "1817-Monroe" "1821-Monroe" "1825-Adams"
## [11] "1829-Jackson" "1833-Jackson" "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan" "1861-Lincoln" "1865-Lincoln"
head(featnames(dfmat_inaug), 20)
## [1] "fellow-citizens" "of" "the" "senate" "and"
## [6] "house" "representatives" "among" "vicissitudes" "incident"
## [11] "to" "life" "no" "event" "could"
## [16] "have" "filled" "me" "with" "greater"
Just like normal matrices, you can userowSums()
and colSums()
to calculate marginals.
head(rowSums(dfmat_inaug), 10)
## 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson 1809-Madison
## 1430 135 2318 1726 2166 1175
## 1813-Madison 1817-Monroe 1821-Monroe 1825-Adams
## 1210 3370 4472 2915
head(colSums(dfmat_inaug), 10)
## fellow-citizens of the senate and house
## 39 7180 10183 15 5406 11
## representatives among vicissitudes incident
## 19 108 5 8
The most frequent features can be found using topfeatures()
.
topfeatures(dfmat_inaug, 10)
## the of and to in a our we that be
## 10183 7180 5406 4591 2827 2292 2224 1827 1813 1502
If you want to convert the frequency count to a proportion within documents, use dfm_weight(scheme = "prop")
.
dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")
print(dfmat_inaug_prop)
## Document-feature matrix of: 59 documents, 9,423 features (91.89% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives
## 1789-Washington 0.0006993007 0.04965035 0.08111888 0.0006993007 0.03356643 0.001398601 0.0013986014
## 1793-Washington 0 0.08148148 0.09629630 0 0.01481481 0 0
## 1797-Adams 0.0012942192 0.06039689 0.07031924 0.0004314064 0.05608283 0 0.0008628128
## 1801-Jefferson 0.0011587486 0.06025492 0.07531866 0 0.04692932 0 0
## 1805-Jefferson 0 0.04662973 0.06602031 0 0.04293629 0 0
## 1809-Madison 0.0008510638 0.05872340 0.08851064 0 0.03659574 0 0
## features
## docs among vicissitudes incident
## 1789-Washington 0.0006993007 0.0006993007 0.0006993007
## 1793-Washington 0 0 0
## 1797-Adams 0.0017256255 0 0
## 1801-Jefferson 0.0005793743 0 0
## 1805-Jefferson 0.0032317636 0 0
## 1809-Madison 0 0 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,413 more features ]
textstat_frequency()
, described in Chapter 4, offers more advanced functionalities than topfeatures()
and returns a data.frame
object, making it easier to use the output for further analyses.
You can also weight the frequency count by uniqueness of the features across documents using dfm_tfidf()
.
dfmat_inaug_tfidf <- dfm_tfidf(dfmat_inaug)
print(dfmat_inaug_tfidf)
## Document-feature matrix of: 59 documents, 9,423 features (91.89% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives among vicissitudes
## 1789-Washington 0.4920984 0 0 0.8166095 0 1.735524 1.249448 0.1373836 1.071882
## 1793-Washington 0 0 0 0 0 0 0 0 0
## 1797-Adams 1.4762952 0 0 0.8166095 0 0 1.249448 0.5495342 0
## 1801-Jefferson 0.9841968 0 0 0 0 0 0 0.1373836 0
## 1805-Jefferson 0 0 0 0 0 0 0 0.9616849 0
## 1809-Madison 0.4920984 0 0 0 0 0 0 0 0
## features
## docs incident
## 1789-Washington 0.9927008
## 1793-Washington 0
## 1797-Adams 0
## 1801-Jefferson 0
## 1805-Jefferson 0
## 1809-Madison 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,413 more features ]
Even after applying dfm_weight()
or dfm_tfidf()
, topfeatures()
works on a document-feature matrix, but it can be misleading if applied to more than one document.