Simple frequency analysis

require(quanteda)
require(quanteda.corpora)
require(ggplot2)

Unlike topfeatures(), textstat_frequency() shows both term and document frequencies. You can also use the function to find the most frequent features within groups.

tweet_corp <- download(url = 'https://www.dropbox.com/s/846skn1i5elbnd2/data_corpus_sampletweets.rds?dl=1')

We analyse the most frequent hashtags using select = "#*" when creating the dfm.

tweet_toks <- tokens(tweet_corp, remove_punct = TRUE) 
tweet_dfm <- dfm(tweet_toks, select = "#*")
freq <- textstat_frequency(tweet_dfm, n = 5, groups = docvars(tweet_dfm, 'lang'))
head(freq, 20)
##              feature frequency rank docfreq     group
## 1           #twitter         1    1       1    Basque
## 2     #canviemeuropa         1    2       1    Basque
## 3             #prest         1    3       1    Basque
## 4           #psifizo         1    4       1    Basque
## 5     #ekloges2014gr         1    5       1    Basque
## 6            #ep2014         1    1       1 Bulgarian
## 7         #yourvoice         1    2       1 Bulgarian
## 8      #eudebate2014         1    3       1 Bulgarian
## 9            #велико         1    4       1 Bulgarian
## 10 #savedonbaspeople         1    1       1  Croatian
## 11   #vitoriagasteiz         1    2       1  Croatian
## 12           #ep14dk        31    1      31    Danish
## 13            #dkpol        18    2      18    Danish
## 14            #eupol         7    3       7    Danish
## 15        #vindtilep         6    4       6    Danish
## 16    #patentdomstol         4    5       4    Danish
## 17           #ep2014        34    1      34     Dutch
## 18              #vvd        10    2      10     Dutch
## 19               #eu         8    3       6     Dutch
## 20              #pvv         8    4       8     Dutch

You can also plot the Twitter hashtag frequencies easily using ggplot().

tweet_dfm %>% 
  textstat_frequency(n = 15) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  theme_minimal()

Alternative, you can create a Wordcloud of the 100 most common tags.

textplot_wordcloud(tweet_dfm, max_words = 100)

Finally, it is possible to compare different groups within one Wordcloud. We first create a dummy variable that indicates whether a tweet was posted in English or a different language. Afterwards, we compare the most frequent hashtags of English and non-English tweets.

# create document-level variable indicating whether Tweet was in English or other language
docvars(tweet_corp, "dummy_english") <- factor(ifelse(docvars(tweet_corp, "lang") == "English", "English", "Not English"))

# create a grouped dfm and compare groups
tweet_corp_language <- dfm(tweet_corp, select = "#*", groups = "dummy_english")
textplot_wordcloud(tweet_corp_language, comparison = TRUE, max_words = 200)