How to sample 75 percent of rows from a dtm? [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
How I can sample a dtm? I trying a lot of code but return me the same error
Error in dtm[splitter, ] : incorrect number of dimensions
This is the code:
n <- dtm$nrow
splitter <- sample(1:n, round(n * 0.75))
train_set <- dtm[splitter, ]
valid_set <- dtm[-splitter, ]

You can use the quanteda package for this. See example below:
Created data example based on the crude data set from tm:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
crude <- tm_map(crude, stemDocument)
dtm <- DocumentTermMatrix(crude)
library(quanteda)
# Transform your dtm into a dfm for quanteda
my_dfm <- as.dfm(dtm)
# number of documents
ndocs(my_dfm)
[1] 20
set.seed(4242)
# create training
train_set <- dfm_sample(my_dfm,
size = round(ndoc(my_dfm) * 0.75), # set sample size
margin = "documents")
# create test set by select the documents that do not match the documents in the training set.
test_set <- dfm_subset(my_dfm, !docnames(my_dfm) %in% docnames(train_set))
# number of documents in train
ndoc(train_set)
[1] 15
# number of documents in test
ndoc(test_set)
[1] 5
Afterwards you can use the quanteda function convert to convert your train and test sets to be used with topicmodels, lda, lsa, etc. See ?convert for more info.

Try using caret package:
library(caret)
#help(package="caret")
index <- createDataPartition(sample, times = 1, p=0.75, list = FALSE)
train <- news.raw[index,]
test <- news.raw[-index,]
Hope this helps.!

Related

Random Forest: number of items to replace is not a multiple of replacement length

We created a table in R with values from the S&P500 and added rows like the simple 10 Days Moving Average. We set the NA-values to 0. Example:
myStartDate <- '2020-01-01'
myEndDate <- Sys.Date()
Dataset$SMA10 <- SMA(Dataset[,"Close"], 10)
Dataset$SMA10 <- as.numeric(Dataset$SMA10)
Dataset$SMA10[is.na(Dataset$SMA10)] <- 0
Our goal is to create a random forest model. Therefore we split the data into a train and a valid data:
set.seed(100)
train <- sample(nrow(Dataset), 0.5*nrow(Dataset), replace = FALSE)
TrainSet <- Dataset [train,]
ValidSet <- Dataset [-train,]
Now if we want to generate the model with following code;
model1 <- randomForest(SMA10~.,data=TrainSet, mtry=5, importance=TRUE,ntree=500)
print(model1)
we get this error message:
Error in x[, i] <- frame[[i]] : number of items to replace is not a multiple of replacement length
By looking up this error in the forum, we found that this is related with NA-Values. Therefore we are a little confused, because we have no NA-Values in our table. Can you tell us what we are doing wrong? Thank you very much in advance.

Stock price prediction based on financial news in R with SVM

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

Using tm and rpart in R: decision tree for textual data?

I am using the tm package in R to create a corpus of text documents and I would like to create a decision tree with rpart for classification purposes. However, I can't find any examples on the internet about using textual data with rpart. Is it even possible or are there other packages I could use?
Here's a starter:
library(tm)
library(rpart)
docs <- c(txt1="Hello world", txt2="lorem ipsum")
dtm <- DocumentTermMatrix(Corpus(VectorSource(docs)), control = list(weight = weightBin))
m <- as.matrix(dtm)
train <- as.data.frame(m)
train$Docs <- factor(rownames(m), labels=names(docs))
fit <- rpart(Docs~.,data=train, control = rpart.control(minsplit=1))
test <- data.frame(hello=c(1,0),world=c(0,0),ipsum=c(0,1),lorem=c(0,0), row.names=letters[1:2])
predict(fit, newdata=test, type="class")
# a b
# txt1 txt2
# Levels: txt1 txt2

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Review star rating - prediction in R

I have a dataset of reviews that have the following structure:
{
"reviewerID": "XXXX",
"asin": "12345XXX",
"reviewerName": "Paul",
"helpful": [2, 5],
"reviewText": "Nice product, works as it should.",
"overall": 5.0,
"summary": "Nice product",
"unixReviewTime": 1152700000,
"reviewTime": "08 14, 2010"
}
I have got a bunch of reviews and would like to create a forecast based on the text of the review ("reviewText") using some text mining techniques.
I would like to train a classifier and then have an accuracy measure how well the system works. The rating of each review is given ("overall").
So far I did the following:
Required packages (not all are required)
library(plyr)
library(rjson)
library(magrittr)
library(lubridate)
library(stringi)
library(doSNOW)
library(tm)
library(NLP)
library(wordcloud)
library(SnowballC)
library(rpart)
The input data is available in JSON format:
Sample Input
Out of this table reviewTexts are converted to a corpus.
Create a corpus and apply some pre-processing steps
corpus <- Corpus(VectorSource(tr.review.asin$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
Making a document term matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)
Creating a training and test set
dtmsparse <- as.data.frame(as.matrix(dtm))
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Creating a model
train$overall <- tr.review.asin[1:6500,]$overall
model <- rpart(overall ~., data = train, method= 'class')
mypred <- predict(model, newdata =test, type = 'class')
When I am plotting obs_test and mypred I am getting the following plot:
Plot obs_test and mypred
Unfortunately I don't have an idea if the last lines will guide me to a solution.
I would like to have a procedure where I can test how well my model is forecasting (comparison between real overall rating and predicted rating).
so it completely slipped my attention that you are actually dealing with a classification problem and not with regression. hence a complete edit.
to see how well a classification tree one would want to know how many instances (in the test data) were misclassified, i.e. the assigned class was not the same as the observed class. it is also informative to see how well the prediction model works on each individual class. using confusionMatrix function from the caret package you can do the following:
install.packages(`caret`)
library(caret)
mypred <- predict(model, newdata =test, type = 'class')
obs <- tr.review.asin[6501:7561,]$overall
confusionMatrix(obs, mypred)
you will get a confusion matrix and some stats as output. confusion matrix tells you on how many instances predictions and observations coincide for each class -- these will be values on the diagonal. in general ijth entry of the matrix will tell you how many instances were classified as j whilst the real class was i.
in the Overall Statistics section of the confusionMatrix output you will see Accuracy -- this is the percentage of the instances in the test set that were classified correctly.
next in the Statistics by Class section the row named Pos Pred Value will tell you what percentage of onbservations in class x were classified correctly. there is a bunch of other useful statistics that the function outputs and you can read up on it on-line, for example here or here.
i hope this helps.

Resources