Latent Semantic Scaling (LSS) is a flexible and cost-efficient semi-supervised document scaling technique. The technique relies on word embeddings and users only need to provide a small set of “seed words” to locate documents on a specific dimension.
Install the LSX package from CRAN.
install.packages("LSX")
require(quanteda)
require(quanteda.corpora)
require(LSX)
Download a corpus with news articles using quanteda.corpora‘s download()
function.
corp_news <- download("data_corpus_guardian")
We must segment news articles into sentences in the corpus to accurately estimate semantic proximity between words. We can also use the Marimo stopwords list (source = "marimo"
) to remove words commonly used in news reports.
# tokenize text corpus and remove various features
corp_sent <- corpus_reshape(corp_news, to = "sentences")
toks_sent <- corp_sent %>%
tokens(remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) %>%
tokens_remove(stopwords("en", source = "marimo")) %>%
tokens_remove(c("*-time", "*-timeUpdated", "GMT", "BST", "*.com"))
# create a document feature matrix from the tokens object
dfmat_sent <- toks_sent %>%
dfm() %>%
dfm_remove(pattern = "") %>%
dfm_trim(min_termfreq = 5)
topfeatures(dfmat_sent, 20)
## people new also us can government last
## 11168 8024 7901 7090 6972 6821 6335
## now years time first just uk police
## 5883 5839 5694 5380 5369 4874 4621
## like party get make made minister
## 4582 3890 3852 3844 3752 3680
We will use generic sentiment seed words to perform sentiment analysis.
seed <- as.seedwords(data_dictionary_sentiment)
print(seed)
## good nice excellent positive fortunate correct
## 1 1 1 1 1 1
## superior bad nasty poor negative unfortunate
## 1 -1 -1 -1 -1 -1
## wrong inferior
## -1 -1
With the seed words, LSS computes polarity of words frequent in the context of economy. We can identify context words by char_context(pattern = "econom*")
before fitting the model.
# identify context words
eco <- char_context(toks_sent, pattern = "econom*", p = 0.05)
# run LSS model
tmod_lss <- textmodel_lss(dfmat_sent, seeds = seed,
terms = eco, k = 300, cache = TRUE)
## Reading cache file: lss_cache/svds_402bda98dac8a560.RDS
head(coef(tmod_lss), 20) # most positive words
## good status positive opportunity success energy
## 0.12822551 0.09776482 0.09765251 0.09381356 0.07952077 0.07830681
## quarter third model strategy rolling slowed
## 0.07211174 0.06657593 0.06474845 0.06424124 0.06345038 0.06302725
## welcome fourth maintain key points regional
## 0.05989433 0.05975046 0.05920437 0.05887532 0.05830341 0.05798655
## polled halt
## 0.05797763 0.05778025
tail(coef(tmod_lss), 20) # most negative words
## worse policymakers uncertainty low raising taxes
## -0.08342634 -0.08379120 -0.08644833 -0.08895333 -0.09099723 -0.09112566
## reserve cut raise easing blamed caused
## -0.09180054 -0.09293358 -0.09296114 -0.09337542 -0.09574268 -0.09663011
## cutting interest warned bubble rates negative
## -0.09668367 -0.10814302 -0.11510122 -0.11667443 -0.11899036 -0.12778956
## poor bad
## -0.13595186 -0.14532555
By highlighting negative words in a manually compiled sentiment dictionary (data_dictionary_LSD2015
), we can confirm that many of the words (but not all of them) have negative meanings in the corpus.
textplot_terms(tmod_lss, data_dictionary_LSD2015["negative"])
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We must reconstruct original articles from their sentences using dfm_group()
before predicting polarity of documents.
dfmat_doc <- dfm_group(dfmat_sent)
dat <- docvars(dfmat_doc)
dat$fit <- predict(tmod_lss, newdata = dfmat_doc)
We can smooth polarity scores of documents to visualize the trend using smooth_lss()
. If engine = "locfit"
, smoothing is very fast even when there are many documents.
dat_smooth <- smooth_lss(dat, engine = "locfit")
head(dat_smooth)
## date time fit se.fit
## 1 2012-01-02 0 -0.2941611 0.1258069
## 2 2012-01-03 1 -0.2933006 0.1238591
## 3 2012-01-04 2 -0.2924451 0.1219494
## 4 2012-01-05 3 -0.2915943 0.1200775
## 5 2012-01-06 4 -0.2907481 0.1182429
## 6 2012-01-07 5 -0.2899064 0.1164455
In the plot below, the circles are polarity scores of documents and the curve is their local means with 95% confidence intervals.
plot(dat$date, dat$fit, col = rgb(0, 0, 0, 0.05), pch = 16, ylim = c(-0.5, 0.5),
xlab = "Time", ylab = "Economic sentiment")
lines(dat_smooth$date, dat_smooth$fit, type = "l")
lines(dat_smooth$date, dat_smooth$fit + dat_smooth$se.fit * 1.96, type = "l", lty = 3)
lines(dat_smooth$date, dat_smooth$fit - dat_smooth$se.fit * 1.96, type = "l", lty = 3)
abline(h = 0, lty = c(1, 2))