Perform feature selection over document-term matrix in R - r

I have a matrix with 99,814 items containing reviews and their respective polarities (positive or negative), and I was looking to do some feature selection over the terms of the corpus to select only those that are more determinant for the identification of each score before I pass it to a model.
The problem is I am currently working with 16,554 terms, so trying to transform the document-term matrix into a sparse matrix so I can apply something like chi-squared to the terms is getting me a "Cholmod error out of memory" message.
So my question is: is there any feasible way I can get the chi-squared value of all terms with the matrix in its more "memory efficient" format? Or am I out of luck?
Here's some sample code that should give one an idea of what I am trying to do. I am using the text2vec library to do the transformation on the text.
library(text2vec)
review_matrix <- data.frame(id=c(1,2,3),
review=c('This review is negative',
'This review is positive',
'This review is positive'),
sentiment=c('Negative', 'Positive', 'Positive'))
tokenizer <- word_tokenizer
tokens <- tokenizer(review_matrix$review)
iterator <- itoken(tokens,
ids = review_matrix$reviewId,
progressbar = FALSE)
vocabulary <- create_vocabulary(iterator)
vectorizer <- vocab_vectorizer(vocabulary)
document_term_matrix <- create_dtm(iterator, vectorizer)
model_tf_idf <- TfIdf$new()
document_term_matrix <- model_tf_idf$fit_transform(document_term_matrix)
# This is where I am trying to do the chisq.test

Related

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

Stock price prediction based on financial news in R with SVM

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

Text Analysis in R: How to add variables to my machine learning classifier in addition to the tokens?

how to consider additional variables
I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words.
for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.
these are the indexes I created, for each document.
dfneg <- cbind(negDfm1#docvars$label , negDfm1#x ,posDfm#x , angDfm#x ,
disgDfm1#x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)
this is the document features matrix I will work with:
DFM
newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE , remove_punct =
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)
id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)
# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)
# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train)
finally, I run a classification, for instance, a Naive Bayes classifier or lasso
Naive Bayes classifier or lasso
NBmodel <- textmodel_nb(train , train#docvars$label)
lasso <- cv.glmnet(train, train#docvars$label,
family="binomial", alpha=1, nfolds=10,
type.measure="class")
this is what I tried after creating the dfm, but it didn't work
newsdfm#Dimnames$features$negz <- dfneg$neg
newsdfm#Dimnames$features$posz <- dfneg$pos
newsdfm#Dimnames$features$angz <- dfneg$ang
newsdfm#Dimnames$features$disgz <- dfneg$disg
then I thought of creating document variables before creating newsdfm
docvars(newscorp , "negz") <- dfneg$neg
docvars(newscorp , "posz") <- dfneg$pos
docvars(newscorp , "angz") <- dfneg$ang
docvars(newscorp , "disgz") <- dfneg$disg
but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.
In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.
any suggestion is highly appreciated
thank you in advance,
Carlo
Internally, dfm are sparse matrices, but it is better to avoid manipulating them directly if possible.
For adding new features for textmodel_nb(), you need to add them to the dfm. As you might expect, the easiest way to do so is to use cbind() to dfm.
In your example, you can run something like this:
additional_features <- dfneg[, c("neg", "pos", "ang", "disg")] %>% as.matrix()
newsdfm_added <- cbind(newsdfm, additional_features)
As you see, I firstly created a matrix of additional features and then run cbind(). When you execute cbind() you will get the following warning:
Warning messages:
1: cbinding dfms with different docnames
2: cbinding dfms with overlapping features will result in duplicated features
As this indicates you have to make sure that the colnames for the additional features should not be in the original dfm.

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Review star rating - prediction in R

I have a dataset of reviews that have the following structure:
{
"reviewerID": "XXXX",
"asin": "12345XXX",
"reviewerName": "Paul",
"helpful": [2, 5],
"reviewText": "Nice product, works as it should.",
"overall": 5.0,
"summary": "Nice product",
"unixReviewTime": 1152700000,
"reviewTime": "08 14, 2010"
}
I have got a bunch of reviews and would like to create a forecast based on the text of the review ("reviewText") using some text mining techniques.
I would like to train a classifier and then have an accuracy measure how well the system works. The rating of each review is given ("overall").
So far I did the following:
Required packages (not all are required)
library(plyr)
library(rjson)
library(magrittr)
library(lubridate)
library(stringi)
library(doSNOW)
library(tm)
library(NLP)
library(wordcloud)
library(SnowballC)
library(rpart)
The input data is available in JSON format:
Sample Input
Out of this table reviewTexts are converted to a corpus.
Create a corpus and apply some pre-processing steps
corpus <- Corpus(VectorSource(tr.review.asin$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
Making a document term matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)
Creating a training and test set
dtmsparse <- as.data.frame(as.matrix(dtm))
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Creating a model
train$overall <- tr.review.asin[1:6500,]$overall
model <- rpart(overall ~., data = train, method= 'class')
mypred <- predict(model, newdata =test, type = 'class')
When I am plotting obs_test and mypred I am getting the following plot:
Plot obs_test and mypred
Unfortunately I don't have an idea if the last lines will guide me to a solution.
I would like to have a procedure where I can test how well my model is forecasting (comparison between real overall rating and predicted rating).
so it completely slipped my attention that you are actually dealing with a classification problem and not with regression. hence a complete edit.
to see how well a classification tree one would want to know how many instances (in the test data) were misclassified, i.e. the assigned class was not the same as the observed class. it is also informative to see how well the prediction model works on each individual class. using confusionMatrix function from the caret package you can do the following:
install.packages(`caret`)
library(caret)
mypred <- predict(model, newdata =test, type = 'class')
obs <- tr.review.asin[6501:7561,]$overall
confusionMatrix(obs, mypred)
you will get a confusion matrix and some stats as output. confusion matrix tells you on how many instances predictions and observations coincide for each class -- these will be values on the diagonal. in general ijth entry of the matrix will tell you how many instances were classified as j whilst the real class was i.
in the Overall Statistics section of the confusionMatrix output you will see Accuracy -- this is the percentage of the instances in the test set that were classified correctly.
next in the Statistics by Class section the row named Pos Pred Value will tell you what percentage of onbservations in class x were classified correctly. there is a bunch of other useful statistics that the function outputs and you can read up on it on-line, for example here or here.
i hope this helps.

Resources