A feature-ouccerances matrix (FCM) records number of co-occurances of tokens. This is a special object in quanteda, but behaves similarly to a DFM.
corp <- download('data_corpus_guardian')
When a corpus is large, you have to select features of a DFM before constructing a FCM.
news_dfm <- dfm(corp, remove = stopwords('en'), remove_punct = TRUE) news_dfm <- dfm_remove(news_dfm, pattern = c('*-time', 'updated-*', 'gmt', 'bst')) news_dfm <- dfm_trim(news_dfm, min_termfreq = 100) topfeatures(news_dfm)
## said people one new also us ## 28413 11169 9884 8024 7901 7091 ## can government year last ## 6972 6821 6570 6335
##  4209
You can construct a FCM from a DFM or a tokens object using
topfeatures() returns the most frequntly co-occuring words.
news_fcm <- fcm(news_dfm) dim(news_fcm)
##  4209 4209
You can select features of a FCM using
feat <- names(topfeatures(news_fcm, 50)) news_fcm <- fcm_select(news_fcm, pattern = feat) dim(news_fcm)
##  50 50
A FCM can be used to train word embedding models with the text2vec package, or to visualize a semantic network analysis with
size <- log(colSums(dfm_select(news_dfm, feat))) set.seed(144) textplot_network(news_fcm, min_freq = 0.8, vertex_size = size / max(size) * 3)