Latent Semantic Scaling (LSS) is a flexible and cost-efficient semisupervised document scaling technique. The technique relies on word embeddings and users only need to provide a small set of “seed words” to locate documents on a specific dimension.
Install the LSX package from CRAN.
install.packages("LSX")
require(quanteda)
require(quanteda.corpora)
require(LSX)
Download a corpus with news articles using quanteda.corpora‘s download()
function.
corp_news <- download("data_corpus_guardian")
We segment news articles into sentences in the corpus to accurately estimate semantic proximity between words. We also use the Marimo stopwords list (source = "marimo"
) to remove words commonly used in news reports.
# tokenize text corpus and remove various features
corp_sent <- corpus_reshape(corp_news, to = "sentences")
toks_sent <- corp_sent %>%
tokens(remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) %>%
tokens_remove(stopwords("en", source = "marimo")) %>%
tokens_remove(c("*-time", "*-timeUpdated", "GMT", "BST", "*.com"))
# create a document feature matrix from the tokens object
dfmat_sent <- toks_sent %>%
dfm(remove = "") %>%
dfm_trim(min_termfreq = 5)
topfeatures(dfmat_sent, 20)
## people new also us can government last
## 11168 8024 7901 7090 6972 6821 6335
## now years time first just uk police
## 5883 5839 5694 5380 5369 4874 4621
## like party get make made minister
## 4582 3890 3852 3844 3752 3680
We use generic sentiment seed words to perform sentiment analysis.
seed <- as.seedwords(data_dictionary_sentiment)
print(seed)
## good nice excellent positive fortunate correct
## 1 1 1 1 1 1
## superior bad nasty poor negative unfortunate
## 1 -1 -1 -1 -1 -1
## wrong inferior
## -1 -1
With the seed words, LSS computes polarity of words frequent in the context of economy. We identify context words by char_context(pattern = "econom*")
before fitting the model.
# identify context words
eco <- char_context(toks_sent, pattern = "econom*", p = 0.05)
# run LSS model
tmod_lss <- textmodel_lss(dfmat_sent, seeds = seed,
terms = eco, k = 300, cache = TRUE)
## Reading cache file: lss_cache/svds_d3662ce2f8b0820f.RDS
head(coef(tmod_lss), 20) # most positive words
## far positive status job use share quarter
## 0.12629148 0.10050133 0.09527710 0.07910809 0.07489797 0.07164349 0.07082497
## every force behind third rolling hopes maintain
## 0.06734511 0.06681834 0.06583539 0.06531428 0.06505486 0.06442772 0.06070326
## strategy welcome halt currently example slowed
## 0.06055693 0.06036880 0.05873444 0.05825310 0.05790878 0.05690293
tail(coef(tmod_lss), 20) # most negative words
## uncertainty worse cut taxes blamed reserve
## -0.08457099 -0.08689701 -0.09193834 -0.09234093 -0.09276866 -0.09339824
## raising easing low cutting raise warned
## -0.09962980 -0.10091116 -0.10322200 -0.10499354 -0.10629547 -0.11080288
## interest bubble negative rates poor things
## -0.12025183 -0.12144500 -0.13226094 -0.13293624 -0.13458223 -0.13849767
## bad wrong
## -0.14773609 -0.20868570
By highlighting negative words in a manually compiled sentiment dictionary (data_dictionary_LSD2015
), we can confirm that many of the words (but not all of them) have negative meanings in the corpus.
textplot_terms(tmod_lss, data_dictionary_LSD2015["negative"])
We reconstruct original articles from their sentences using dfm_group()
before predicting polarity of documents.
dfmat_doc <- dfm_group(dfmat_sent)
dat <- docvars(dfmat_doc)
dat$fit <- predict(tmod_lss, newdata = dfmat_doc)
We can smooth polarity scores of documents to visualize the trend using smooth_lss()
. If engine = "locfit"
, smoothing is very fast even when there are many documents.
dat_smooth <- smooth_lss(dat, engine = "locfit")
head(dat_smooth)
## date time fit se.fit
## 1 2012-01-02 0 -0.1969236 0.1260521
## 2 2012-01-03 1 -0.1972945 0.1240897
## 3 2012-01-04 2 -0.1976635 0.1221659
## 4 2012-01-05 3 -0.1980304 0.1202802
## 5 2012-01-06 4 -0.1983952 0.1184324
## 6 2012-01-07 5 -0.1987576 0.1166221
In the plot below, the circles are polarity scores of documents and the curve is their local means with 95% confidence intervals.
plot(dat$date, dat$fit, col = rgb(0, 0, 0, 0.05), pch = 16, ylim = c(-0.5, 0.5),
xlab = "Time", ylab = "Economic sentiment")
lines(dat_smooth$date, dat_smooth$fit, type = "l")
lines(dat_smooth$date, dat_smooth$fit + dat_smooth$se.fit * 1.96, type = "l", lty = 3)
lines(dat_smooth$date, dat_smooth$fit - dat_smooth$se.fit * 1.96, type = "l", lty = 3)
abline(h = 0, lty = c(1, 2))