We can compound multi-word expressions through collocation analysis. In this example, we will identify sequences of capitalized words and compound them as proper names, which are important linguistic features of newspaper articles.
require(quanteda)
require(quanteda.textstats)
require(quanteda.corpora)
options(width = 110)
This corpus contains 6,000 Guardian news articles from 2012 to 2016.
corp_news <- download("data_corpus_guardian")
We remove punctuation marks and symbols in tokens()
and stopwords in tokens_remove()
with padding = TRUE
to keep the original positions of tokens.
toks_news <- tokens(corp_news, remove_punct = TRUE, remove_symbols = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
One of the most common type of multi-word expressions is proper names, which we can select simply based on capitalization in English texts.
toks_news_cap <- tokens_select(toks_news,
pattern = "^[A-Z]",
valuetype = "regex",
case_insensitive = FALSE,
padding = TRUE)
tstat_col_cap <- textstat_collocations(toks_news_cap, min_count = 10, tolower = FALSE)
head(tstat_col_cap, 20)
## collocation count count_nested length lambda z
## 1 David Cameron 860 0 2 8.288932 149.63137
## 2 Donald Trump 774 0 2 8.459635 124.65110
## 3 George Osborne 362 0 2 8.780452 109.09755
## 4 Hillary Clinton 525 0 2 9.226408 104.05605
## 5 New York 1016 0 2 10.580500 101.53958
## 6 Islamic State 330 0 2 9.934794 99.47335
## 7 White House 479 0 2 10.054592 97.46159
## 8 European Union 348 0 2 8.371449 96.20098
## 9 Jeremy Corbyn 244 0 2 8.862147 92.17368
## 10 Boris Johnson 245 0 2 9.796771 85.93115
## 11 Bernie Sanders 394 0 2 10.034026 85.75118
## 12 Guardian Australia 237 0 2 6.460533 85.44272
## 13 Northern Ireland 205 0 2 10.015792 84.37474
## 14 Home Office 216 0 2 9.823115 79.74718
## 15 Ed Miliband 173 0 2 9.984852 79.42691
## 16 Barack Obama 343 0 2 9.892183 78.99228
## 17 South Africa 172 0 2 7.701286 78.97229
## 18 Ted Cruz 417 0 2 10.888985 78.83869
## 19 Black Friday 190 0 2 8.591620 78.00359
## 20 South Carolina 271 0 2 9.537382 77.88448
We will only compound strongly associated multi-word expressions here by subsetting tstat_col_cap
with the z-score (z > 3
).
toks_comp <- tokens_compound(toks_news, pattern = tstat_col_cap[tstat_col_cap$z > 3,],
case_insensitive = FALSE)
kw_comp <- kwic(toks_comp, pattern = c("London_*", "British_*"))
head(kw_comp, 10)
## Keyword-in-context with 10 matches.
## [text9204, 398] researchers publishing | British_Medical_Journal | found drop heart
## [text150582, 373] including Bermuda | British_Virgin_Islands | Cayman_Islands
## [text150582, 663] included Panama | British_Virgin_Islands | published
## [text120395, 1117] director general | British_Chambers | Commerce said businesses
## [text64192, 300] Guardian 90 York_Way | London_N1 | 9GU Please include
## [text145860, 1814] Association | British_Insurers | ABI says insurers
## [text148174, 435] EZY5258 Rome Fiumicino | London_Gatwick | 29 March delayed
## [text109224, 17] Coast range | British_Columbia | Hanging
## [text109224, 115] coast | British_Columbia | Today however
## [text109224, 220] Alberta coast | British_Columbia | plan