Review star rating - prediction in R - r

I have a dataset of reviews that have the following structure:
{
"reviewerID": "XXXX",
"asin": "12345XXX",
"reviewerName": "Paul",
"helpful": [2, 5],
"reviewText": "Nice product, works as it should.",
"overall": 5.0,
"summary": "Nice product",
"unixReviewTime": 1152700000,
"reviewTime": "08 14, 2010"
}
I have got a bunch of reviews and would like to create a forecast based on the text of the review ("reviewText") using some text mining techniques.
I would like to train a classifier and then have an accuracy measure how well the system works. The rating of each review is given ("overall").
So far I did the following:
Required packages (not all are required)
library(plyr)
library(rjson)
library(magrittr)
library(lubridate)
library(stringi)
library(doSNOW)
library(tm)
library(NLP)
library(wordcloud)
library(SnowballC)
library(rpart)
The input data is available in JSON format:
Sample Input
Out of this table reviewTexts are converted to a corpus.
Create a corpus and apply some pre-processing steps
corpus <- Corpus(VectorSource(tr.review.asin$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
Making a document term matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)
Creating a training and test set
dtmsparse <- as.data.frame(as.matrix(dtm))
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Creating a model
train$overall <- tr.review.asin[1:6500,]$overall
model <- rpart(overall ~., data = train, method= 'class')
mypred <- predict(model, newdata =test, type = 'class')
When I am plotting obs_test and mypred I am getting the following plot:
Plot obs_test and mypred
Unfortunately I don't have an idea if the last lines will guide me to a solution.
I would like to have a procedure where I can test how well my model is forecasting (comparison between real overall rating and predicted rating).

so it completely slipped my attention that you are actually dealing with a classification problem and not with regression. hence a complete edit.
to see how well a classification tree one would want to know how many instances (in the test data) were misclassified, i.e. the assigned class was not the same as the observed class. it is also informative to see how well the prediction model works on each individual class. using confusionMatrix function from the caret package you can do the following:
install.packages(`caret`)
library(caret)
mypred <- predict(model, newdata =test, type = 'class')
obs <- tr.review.asin[6501:7561,]$overall
confusionMatrix(obs, mypred)
you will get a confusion matrix and some stats as output. confusion matrix tells you on how many instances predictions and observations coincide for each class -- these will be values on the diagonal. in general ijth entry of the matrix will tell you how many instances were classified as j whilst the real class was i.
in the Overall Statistics section of the confusionMatrix output you will see Accuracy -- this is the percentage of the instances in the test set that were classified correctly.
next in the Statistics by Class section the row named Pos Pred Value will tell you what percentage of onbservations in class x were classified correctly. there is a bunch of other useful statistics that the function outputs and you can read up on it on-line, for example here or here.
i hope this helps.

Related

Perform feature selection over document-term matrix in R

I have a matrix with 99,814 items containing reviews and their respective polarities (positive or negative), and I was looking to do some feature selection over the terms of the corpus to select only those that are more determinant for the identification of each score before I pass it to a model.
The problem is I am currently working with 16,554 terms, so trying to transform the document-term matrix into a sparse matrix so I can apply something like chi-squared to the terms is getting me a "Cholmod error out of memory" message.
So my question is: is there any feasible way I can get the chi-squared value of all terms with the matrix in its more "memory efficient" format? Or am I out of luck?
Here's some sample code that should give one an idea of what I am trying to do. I am using the text2vec library to do the transformation on the text.
library(text2vec)
review_matrix <- data.frame(id=c(1,2,3),
review=c('This review is negative',
'This review is positive',
'This review is positive'),
sentiment=c('Negative', 'Positive', 'Positive'))
tokenizer <- word_tokenizer
tokens <- tokenizer(review_matrix$review)
iterator <- itoken(tokens,
ids = review_matrix$reviewId,
progressbar = FALSE)
vocabulary <- create_vocabulary(iterator)
vectorizer <- vocab_vectorizer(vocabulary)
document_term_matrix <- create_dtm(iterator, vectorizer)
model_tf_idf <- TfIdf$new()
document_term_matrix <- model_tf_idf$fit_transform(document_term_matrix)
# This is where I am trying to do the chisq.test

How to sample 75 percent of rows from a dtm? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
How I can sample a dtm? I trying a lot of code but return me the same error
Error in dtm[splitter, ] : incorrect number of dimensions
This is the code:
n <- dtm$nrow
splitter <- sample(1:n, round(n * 0.75))
train_set <- dtm[splitter, ]
valid_set <- dtm[-splitter, ]
You can use the quanteda package for this. See example below:
Created data example based on the crude data set from tm:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
crude <- tm_map(crude, stemDocument)
dtm <- DocumentTermMatrix(crude)
library(quanteda)
# Transform your dtm into a dfm for quanteda
my_dfm <- as.dfm(dtm)
# number of documents
ndocs(my_dfm)
[1] 20
set.seed(4242)
# create training
train_set <- dfm_sample(my_dfm,
size = round(ndoc(my_dfm) * 0.75), # set sample size
margin = "documents")
# create test set by select the documents that do not match the documents in the training set.
test_set <- dfm_subset(my_dfm, !docnames(my_dfm) %in% docnames(train_set))
# number of documents in train
ndoc(train_set)
[1] 15
# number of documents in test
ndoc(test_set)
[1] 5
Afterwards you can use the quanteda function convert to convert your train and test sets to be used with topicmodels, lda, lsa, etc. See ?convert for more info.
Try using caret package:
library(caret)
#help(package="caret")
index <- createDataPartition(sample, times = 1, p=0.75, list = FALSE)
train <- news.raw[index,]
test <- news.raw[-index,]
Hope this helps.!

Stock price prediction based on financial news in R with SVM

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Getting a constant answer while finding patterns with neuralnet package

I'm trying to find patterns in a large dataset using the neuralnet package.
My data file looks something like this (30,204,447 rows) :
id.company,EPS.or.Sales,FQ.or.FY,fiscal,date,value
000001,EPS,FY,2001,20020201,-5.520000
000001,SAL,FQ,2000,20020401,70.300003
000001,SAL,FY,2001,20020325,49.200001
000002,EPS,FQ,2008,20071009,-4.000000
000002,SAL,FY,2008,20071009,1.400000
I have split this initial file into four new files for annual/quarterly sales/EPS and it is on those files that I want to use neural networks to see if I can use the variables id.company, fiscal and date in the case below to predict the annual sales results.
To do so, I have written the following code:
dataset <- read.table("fy_sal_data.txt",header=T, sep="\t") #my file doesn't actually use comas as separators
#extract training set and testing set
trainset <- dataset[1:1000, ]
testset <- dataset[1001:2000, ]
#building the NN
ann <- neuralnet(value ~ id.company + fiscal + date, trainset, hidden = 3,
lifesign="minimal", threshold=0.01)
#testing the output
temp_test <- subset(testset, select=c("id.company", "fiscal", "date"))
ann.results <- compute(ann, temp_test)
#display the results
cleanoutput <- cbind(testset$value, as.data.frame(ann.results$net.result))
colnames(cleanoutput) <- c("Expected Output", "NN Output")
head(cleanoutput, 30)
Now my problem is that the compute function returns a constant answer no matter the inputs of the testing set.
Expected Output NN Output
1001 2006.500000 1417.796651
1002 2009.000000 1417.796651
1003 2006.500000 1417.796651
1004 2002.500000 1417.796651
I am very new to R and its neural networks packages but I have found online that some of the reasons for such results can be either:
an insufficient number of training examples (here I'm using a thousand ones but I've also tried using a million rows and the results were the same, only it took 4h to train)
or an error in the formula.
I am sure I'm doing something wrong but I can't seem to figure out what.

Resources