Newsmap is a semi-supervised model for geographical document classification. While (full) supervised models are trained on manually classified data, this semi-supervised model learns from “seed words” in dictionaries.
Install the newsmap package from CRAN.
install.packages("newsmap")
require(quanteda)
require(quanteda.corpora)
require(newsmap)
require(maps)
require(ggplot2)
Download a corpus with news articles using quanteda.corpora‘s download()
function.
corp_news <- download(url = "https://www.dropbox.com/s/r8zhsu8zvjzhnml/data_corpus_yahoonews.rds?dl=1")
corp_news
contains 10,000 news summaries downloaded from Yahoo News in 2014.
ndoc(corp_news)
## [1] 10000
range(corp_news$date)
## [1] "2014-01-01" "2014-12-31"
Proper nouns are the most useful features of documents for geographical classification. However, not all capitalized words are proper nouns, so we define custom stopwords.
month <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December")
day <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
agency <- c("AP", "AFP", "Reuters")
toks_news <- tokens(corp_news, remove_punct = TRUE) %>%
tokens_remove(pattern = c(stopwords("en"), month, day, agency),
valuetype = "fixed", padding = TRUE)
newsmap contains seed geographical dictionaries in English, German, Spanish, Japanese and Russian languages. data_dictionary_newsmap_en
is the seed dictionary for English texts.
toks_label <- tokens_lookup(toks_news, dictionary = data_dictionary_newsmap_en,
levels = 3) # level 3 is countries
dfmat_label <- dfm(toks_label, tolower = FALSE)
dfmat_feat <- dfm(toks_news, tolower = FALSE)
dfmat_feat_select <- dfm_select(dfmat_feat, pattern = "^[A-Z][A-Za-z0-9]+",
valuetype = "regex", case_insensitive = FALSE) %>%
dfm_trim(min_termfreq = 10)
tmod_nm <- textmodel_newsmap(dfmat_feat_select, y = dfmat_label)
The seed dictionary contains only names of countries and capital cities, but the model additionally extracts features associated to the countries. These country codes are defined in ISO 3166-1.
coef(tmod_nm, n = 15)[c("US", "GB", "FR", "BR", "JP")]
## as(<matrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(as(as(., "dMatrix"), "generalMatrix"), "TsparseMatrix") instead
## $US
## WASHINGTON US American Washington YORK States Americans
## 7.154239 7.036785 6.829831 6.605774 6.369994 6.054570 5.359837
## York Brunnstrom Kirby Platinum Anglo Stewart Keystone
## 4.993287 3.781652 3.701609 3.614598 3.568078 3.162613 3.088505
## Admiral
## 3.008462
##
## $GB
## British LONDON London Britain Britain's UK UKIP Kingdom
## 7.877081 7.846983 7.562997 7.264183 6.670409 5.487239 4.905318 4.774289
## Tesco Hamza Cameron's Osborne Salmond Clegg Cameron
## 4.445785 4.358774 3.857999 3.752638 3.695480 3.665627 3.595009
##
## $FR
## French France PARIS Paris Hollande
## 8.183063 8.088991 7.541210 7.303251 6.532546
## Hollande's Fabius Valls Francois Saint-Germain
## 5.401143 5.295783 5.295783 5.277091 4.970361
## Froome Le France's Renault Pen
## 4.803306 3.755338 3.734547 3.704694 3.504023
##
## $BR
## Brazil SAO PAULO RIO JANEIRO Brazilian Rio DE
## 8.174995 7.261448 7.247654 7.048526 7.048526 6.996340 6.922232 6.355378
## Janeiro Sao Paulo BELO HORIZONTE BRASILIA Dilma
## 6.303193 5.966720 5.966720 5.915427 5.915427 5.804201 5.306363
##
## $JP
## Japan Japanese TOKYO Abe Tokyo Shinzo
## 8.166952 7.896661 7.764115 7.063062 6.962979 6.791129
## Abe's Tokyo's Fukushima Japan's Nikkei Toyota
## 5.810299 5.317823 5.171219 4.390779 3.836218 3.479543
## Pyongyang Asia-Pacific Honda
## 3.182292 3.156316 3.143071
Names of people, organizations and places are often multi-word expressions. To distinguish between “New York” and “York”, for example, it is useful to compound tokens using tokens_compound()
as explained in Advanced Operations.
You can predict the most strongly associated countries using predict()
and count the frequency using table()
.
pred_nm <- predict(tmod_nm)
head(pred_nm, 20)
## text1 text2 text3 text4 text5 text6 text7 text8 text9 text10 text11
## KP SY IQ RU TH CN UA SY GB US SY
## text12 text13 text14 text15 text16 text17 text18 text19 text20
## US UA SY LK ES AU CR ID BH
## 204 Levels: BI DJ ER ET KE MG MU MW MZ RE RW SO TZ UG ZM ZW AO CD CF CG ... WS
Factor levels are set to obtain zero counts for countries that did not appear in the corpus.
count <- sort(table(factor(pred_nm, levels = colnames(dfmat_label))), decreasing = TRUE)
head(count, 20)
##
## GB US RU UA AU CN CA FR IQ BR SY DE ZA NZ JP IL IN ES EG PS
## 621 578 516 440 367 362 319 311 295 278 262 250 236 228 198 197 187 182 157 155
You can visualise the distribution of global news attention using geom_map()
.
dat_country <- as.data.frame(count, stringsAsFactors = FALSE)
colnames(dat_country) <- c("id", "frequency")
world_map <- map_data(map = "world")
world_map$region <- iso.alpha(world_map$region) # convert country name to ISO code
ggplot(dat_country, aes(map_id = id)) +
geom_map(aes(fill = frequency), map = world_map) +
expand_limits(x = world_map$long, y = world_map$lat) +
scale_fill_continuous(name = "Frequency") +
theme_void() +
coord_fixed()