Naive Bayes is a supervised model usually used to classify documents into two or more categories. We train the classifier using class labels attached to documents, and predict the most likely class(es) of new unlabeled documents.
require(quanteda)
require(quanteda.textmodels)
require(caret)
data_corpus_moviereviews
from the quanteda.textmodels package contains 2000 movie reviews classified either as “positive” or “negative”.
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
## Corpus consisting of 2000 documents, showing 5 documents:
##
## Text Types Tokens Sentences sentiment id1 id2
## cv000_29416.txt 354 841 9 neg cv000 29416
## cv001_19502.txt 156 278 1 neg cv001 19502
## cv002_17424.txt 276 553 3 neg cv002 17424
## cv003_12683.txt 313 555 2 neg cv003 12683
## cv004_12641.txt 380 841 2 neg cv004 12641
The variable “Sentiment” indicates whether a movie review was classified as positive or negative. In this example, we will use 1500 reviews as the training set and build a Naive Bayes classifier based on this subset. In the second step, we will predict the sentiment for the remaining reviews (our test set).
Since the first 1000 reviews are negative and the remaining reviews are classified as positive, we need to draw a random sample of the documents.
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
## [1] 590 874 1602 985 1692 789 553 1980 1875 1705
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
Next, we will train the naive Bayes classifier using textmodel_nb()
.
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
##
## Call:
## textmodel_nb.dfm(x = dfmat_training, y = dfmat_training$sentiment)
##
## Class Priors:
## (showing first 2 elements)
## neg pos
## 0.5 0.5
##
## Estimated Feature Scores:
## plot two teen coupl go church parti
## neg 0.002579 0.002318 0.0002870 0.0007157 0.002663 8.719e-05 0.0002652
## pos 0.001507 0.002338 0.0001656 0.0005456 0.002348 8.768e-05 0.0002728
## drink drive get accid one guy die
## neg 1.199e-04 0.0003052 0.004486 9.445e-05 0.007389 0.0014458 0.0005485
## pos 9.417e-05 0.0002630 0.003783 1.851e-04 0.007355 0.0009937 0.0005488
## girlfriend continu see life nightmar deal watch
## neg 0.0003124 0.0003161 0.002557 0.001435 0.0001199 0.0004323 0.001642
## pos 0.0002338 0.0003215 0.003020 0.002497 0.0001202 0.0005196 0.001539
## movi sorta find critiqu mind-fuck generat touch
## neg 0.010117 1.090e-05 0.001453 9.445e-05 3.633e-06 0.0002652 0.0002289
## pos 0.007657 1.624e-05 0.001630 8.443e-05 3.247e-06 0.0002923 0.0004449
## cool idea
## neg 0.0003052 0.0008210
## pos 0.0002273 0.0005845
Naive Bayes can only take features into consideration that occur both in the training set and the test set, but we can make the features identical using dfm_match()
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
Let’s inspect how well the classification worked.
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
## predicted_class
## actual_class neg pos
## neg 213 45
## pos 37 205
From the cross-table we can see that the number of false positives and false negatives is similar. The classifier made mistakes in both directions, but does not seem to over- or under-estimate one class.
We can use the function confusionMatrix()
from the caret package to assess the performance of the classification.
confusionMatrix(tab_class, mode = "everything", positive = "pos")
## Confusion Matrix and Statistics
##
## predicted_class
## actual_class neg pos
## neg 213 45
## pos 37 205
##
## Accuracy : 0.836
## 95% CI : (0.8006, 0.8674)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.672
##
## Mcnemar's Test P-Value : 0.4395
##
## Sensitivity : 0.8200
## Specificity : 0.8520
## Pos Pred Value : 0.8471
## Neg Pred Value : 0.8256
## Precision : 0.8471
## Recall : 0.8200
## F1 : 0.8333
## Prevalence : 0.5000
## Detection Rate : 0.4100
## Detection Prevalence : 0.4840
## Balanced Accuracy : 0.8360
##
## 'Positive' Class : pos
##
Precision, recall and the F1 score are frequently used to assess the classification performance. Precision is measured as TP / (TP + FP)
, where TP
are the number of true positives and FP
are the false positives. Recall divides the true positives by the sum of true positives and false negatives TP / (TP + FN)
. Finally, the F1 score is a harmonic mean of precision and recall 2 * (Precision * Recall) / (Precision + Recall)
.