Implementing Naive Bayes for text classification using Quanteda - r

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.

As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Related

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

My KNN model - Vector to data.frame issue

I need your help this time. I'm working on my KNN model (looking for their probabilities).
predictions<- knn(x_training, x_testing, y_training, k = 5, prob = TRUE)
However, I'd like to get a dataframe with it. When I applied the data.frame function, I get only 0/1 (whether is true or false) but not the probability.
It is a little difficult to understand your question, you should try to give us a reproducible example with code. It is much easier for us as a community to answer your questions efficiently if you do.
Here is an example:
library(class)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
predictions <- knn(train, test, cl, k = 3, prob=TRUE)
I believe you are running into trouble because you are trying to coerce the output of the KNN function into a dataframe. However, the KNN output has the probabilities as an attribute of the data.
So you need to access the probabilities using the attr() function. For more info type:
?attr
into the R console.
To achieve your desired outcome you need to do this:
data.frame(Value=predictions,Prob=attr(predictions,"prob"))

Text Analysis in R: How to add variables to my machine learning classifier in addition to the tokens?

how to consider additional variables
I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words.
for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.
these are the indexes I created, for each document.
dfneg <- cbind(negDfm1#docvars$label , negDfm1#x ,posDfm#x , angDfm#x ,
disgDfm1#x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)
this is the document features matrix I will work with:
DFM
newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE , remove_punct =
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)
id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)
# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)
# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train)
finally, I run a classification, for instance, a Naive Bayes classifier or lasso
Naive Bayes classifier or lasso
NBmodel <- textmodel_nb(train , train#docvars$label)
lasso <- cv.glmnet(train, train#docvars$label,
family="binomial", alpha=1, nfolds=10,
type.measure="class")
this is what I tried after creating the dfm, but it didn't work
newsdfm#Dimnames$features$negz <- dfneg$neg
newsdfm#Dimnames$features$posz <- dfneg$pos
newsdfm#Dimnames$features$angz <- dfneg$ang
newsdfm#Dimnames$features$disgz <- dfneg$disg
then I thought of creating document variables before creating newsdfm
docvars(newscorp , "negz") <- dfneg$neg
docvars(newscorp , "posz") <- dfneg$pos
docvars(newscorp , "angz") <- dfneg$ang
docvars(newscorp , "disgz") <- dfneg$disg
but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.
In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.
any suggestion is highly appreciated
thank you in advance,
Carlo
Internally, dfm are sparse matrices, but it is better to avoid manipulating them directly if possible.
For adding new features for textmodel_nb(), you need to add them to the dfm. As you might expect, the easiest way to do so is to use cbind() to dfm.
In your example, you can run something like this:
additional_features <- dfneg[, c("neg", "pos", "ang", "disg")] %>% as.matrix()
newsdfm_added <- cbind(newsdfm, additional_features)
As you see, I firstly created a matrix of additional features and then run cbind(). When you execute cbind() you will get the following warning:
Warning messages:
1: cbinding dfms with different docnames
2: cbinding dfms with overlapping features will result in duplicated features
As this indicates you have to make sure that the colnames for the additional features should not be in the original dfm.

Mclust in R: How to output cluster centers

I'm currently using RStudio for doing text mining on Support tickets, clustering them by their description (freetext). For this, I compare kmeans to EM algorithm. I prepared the data with the tm package, and now I try do apply clustering algorithms to the data matrix.
With the kmeans() function, I can use following Code snippet to Output the 5 most frequent Terms in text Clusters (kmeans21):
> for (i in 1:num_cluster) {
cat(paste("cluster ", i, ": ", sep = ""))
s <- sort(kmeans21$centers[i, ], decreasing = T)
cat(names(s)[1:5], "\n")
}
Until now, I couldnt find a function to do the same within the mclust package. My data has the following Format:
> bic21 <- MclustBIC(m1, G=21)
> emmodel21 <- summary(bic21, data = m1)
With the command
> emmodel21$classification
I can see the Cluster for each supportticket, but is there also the possibility to Output the most frequent Terms like in the first Code block for kmeans?
I think you can try
summary(mod1, parameters = TRUE)
Just tried the same example in the link
library(mclust)
data(diabetes)
X <- diabetes[,-1]
BIC <- mclustBIC(X)
mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)
Slightly altering the first example in the vignette:
data(diabetes)
X <- diabetes[,-1]
mod <- mclust(X)
means <- mod$parameters$means
The means object is now a matrix containing the means of the clusters.

Understanding how to use nnet in R

This is my first attempt using a machine learning paradigm in R. I'm using a planet data set (url: https://www.kaggle.com/mrisdal/open-exoplanet-catalogue) and I simply want to predict a planet's size based on the size of its Sun. This is the code I currently have, using nnet():
library(nnet)
#Organize data:
cols_to_keep = c(1,4,21)
full_data <- na.omit(read.csv('Planet_Data.csv')[, cols_to_keep])
#Split data:
train_data <- full_data[sample(nrow(full_data), round(nrow(full_data)/2)),]
rownames(train_data) <- 1:nrow(train_data)
test_data <- full_data[!rownames(full_data) %in% rownames(data1),]
rownames(test_data) <- 1:nrow(test_data)
#nnet
nnet_attempt <- nnet(RadiusJpt~HostStarRadiusSlrRad, data=train_data, size=0, linout=TRUE, skip=TRUE, maxNWts=10000, trace=FALSE, maxit=1000, decay=.001)
nnet_newdata <- predict(nnet_attempt, newdata=test_data)
nnet_newdata
When I print nnet_newdata I get a value for each row in my data, but I don't really understand what these values mean. Is this a proper way to use the nnet() package to predict a simple regression?
Thanks
When predict is called for an object with class nnet you will get, by default, the raw output from the nnet model applied to your new dataset. If, instead, yours is a classification problem, you can use type = "class".
See here.

Resources