Stock price prediction based on financial news in R with SVM - r

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

Related

Text Analysis in R: How to add variables to my machine learning classifier in addition to the tokens?

how to consider additional variables
I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words.
for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.
these are the indexes I created, for each document.
dfneg <- cbind(negDfm1#docvars$label , negDfm1#x ,posDfm#x , angDfm#x ,
disgDfm1#x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)
this is the document features matrix I will work with:
DFM
newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE , remove_punct =
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)
id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)
# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)
# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train)
finally, I run a classification, for instance, a Naive Bayes classifier or lasso
Naive Bayes classifier or lasso
NBmodel <- textmodel_nb(train , train#docvars$label)
lasso <- cv.glmnet(train, train#docvars$label,
family="binomial", alpha=1, nfolds=10,
type.measure="class")
this is what I tried after creating the dfm, but it didn't work
newsdfm#Dimnames$features$negz <- dfneg$neg
newsdfm#Dimnames$features$posz <- dfneg$pos
newsdfm#Dimnames$features$angz <- dfneg$ang
newsdfm#Dimnames$features$disgz <- dfneg$disg
then I thought of creating document variables before creating newsdfm
docvars(newscorp , "negz") <- dfneg$neg
docvars(newscorp , "posz") <- dfneg$pos
docvars(newscorp , "angz") <- dfneg$ang
docvars(newscorp , "disgz") <- dfneg$disg
but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.
In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.
any suggestion is highly appreciated
thank you in advance,
Carlo
Internally, dfm are sparse matrices, but it is better to avoid manipulating them directly if possible.
For adding new features for textmodel_nb(), you need to add them to the dfm. As you might expect, the easiest way to do so is to use cbind() to dfm.
In your example, you can run something like this:
additional_features <- dfneg[, c("neg", "pos", "ang", "disg")] %>% as.matrix()
newsdfm_added <- cbind(newsdfm, additional_features)
As you see, I firstly created a matrix of additional features and then run cbind(). When you execute cbind() you will get the following warning:
Warning messages:
1: cbinding dfms with different docnames
2: cbinding dfms with overlapping features will result in duplicated features
As this indicates you have to make sure that the colnames for the additional features should not be in the original dfm.

Extracting Class Probabilities from SparkR ML Classification Functions

I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are.
spark.gbt()
spark.mlp()
spark.randomForest()
Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence."
I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.
i ran into the same problem, and following this answer now use SparkR:::callJMethod to transform the probability DenseVector (which R cannot deserialize) to an Array (which R reads as a List). It's not very elegant or fast, but it does the job:
denseVectorToArray <- function(dv) {
SparkR:::callJMethod(dv, "toArray")
}
e.g.:
start your spark session
#library(SparkR)
#sparkR.session(master = "local")
generate toy data
data <- data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
someString = base::sample(c("this", "that"),
100, replace=TRUE),
stringsAsFactors=FALSE)
trainidxs <- base::sample(nrow(data), nrow(data)*0.7)
traindf <- as.DataFrame(data[trainidxs,])
testdf <- as.DataFrame(data[-trainidxs,])
train a random forest and run predictions:
rf <- spark.randomForest(traindf,
clicked~.,
type = "classification",
maxDepth = 2,
maxBins = 2,
numTrees = 100)
predictions <- predict(rf, testdf)
collect your predictions:
collected = SparkR::collect(predictions)
now extract the probabilities:
collected$probabilities <- lapply(collected$probability, function(x) denseVectorToArray(x))
str(probs)
ofcourse, the function wrapper around SparkR:::callJMethod is a bit of an overkill. You can also use it directly, e.g. with dplyr:
withprobs = collected %>%
rowwise() %>%
mutate("probabilities" = list(SparkR:::callJMethod(probability,"toArray"))) %>%
mutate("prob0" = probabilities[[1]], "prob1" = probabilities[[2]])

How to use weights from survey package in TermDocumentMatrix

I work a lot with samples that I want to generalize to larger populations. However, most times the samples are biased and need to be weighted with the survey package. However, I have not found a way to weight Term Document Matrix on these kind of weights. Consider this example
library(tm)
library(wordcloud)
set.seed(123)
# Consider this example: I have performed a sample from a population and now have
# 1000 observations of text. In the data I also have information about gender.
# The sample
data <- rbind(data.frame(gender = "M",
words = sample(c("education", "money", "family",
"house", "debts"),
600, replace = TRUE)),
data.frame(gender = "F",
words = sample(c("career", "bank", "friends",
"drinks", "relax"),
400, replace = TRUE)))
# I create a simple wordcloud
text <- paste(data$words, collapse = " ")
matrix <- as.matrix(
TermDocumentMatrix(
VCorpus(
VectorSource(text)
)
)
)
Which produces a wordcloud that looks something like this:
As you can see, the terms mentioned by men are bigger because the appear more often. However, I know the true distribution of this population, thus this wordcloud is biased.
The true gender distribution
true_gender_dist <- data.frame(gender = c("M", "F"), freq = nrow(data) * c(0.49,0.51))
With the survey package I can weight the data with the rake function
library(survey)
rake_data <- rake(design = svydesign(ids = ~1, data = data),
sample.margins = list(~gender),
population.margins = list(true_gender_dist))
In order to use the weights in analysis, visualizations etc. (that are not included in the survey package) I add the weights to the original data.
data_weighted <- cbind(data, data.frame(weights = weights(rake_data)))
So far so good. However, I would like to make a wordcloud that take these weighs into consideration.
My first attempt would be to use the weights in making the Term Document Matrix.
text_corp <- VCorpus(VectorSource(text))
w_tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weights(rake_data)))
But then I get:
Error in .TermDocumentMatrix(m, weighting) : invalid weighting
Is this at all possible?
I can't comment yet, so I'll use the answer to comment your question:
You could be interested in the R package stm (structured topic models). It provides possibilities to infer latent topics regarding meta variables (continuous and/or discrete).
You can generate different kinds of plots to check out how metavariables influence
a) the selected topics depending,
b) the preferred words inside one topic,
c) and some more :)
Some links, if you're interested:
Paper describing the R package
R documentation
Some more Papers <-- this is a really good collection, if you want to dive into the subject some more!

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Review star rating - prediction in R

I have a dataset of reviews that have the following structure:
{
"reviewerID": "XXXX",
"asin": "12345XXX",
"reviewerName": "Paul",
"helpful": [2, 5],
"reviewText": "Nice product, works as it should.",
"overall": 5.0,
"summary": "Nice product",
"unixReviewTime": 1152700000,
"reviewTime": "08 14, 2010"
}
I have got a bunch of reviews and would like to create a forecast based on the text of the review ("reviewText") using some text mining techniques.
I would like to train a classifier and then have an accuracy measure how well the system works. The rating of each review is given ("overall").
So far I did the following:
Required packages (not all are required)
library(plyr)
library(rjson)
library(magrittr)
library(lubridate)
library(stringi)
library(doSNOW)
library(tm)
library(NLP)
library(wordcloud)
library(SnowballC)
library(rpart)
The input data is available in JSON format:
Sample Input
Out of this table reviewTexts are converted to a corpus.
Create a corpus and apply some pre-processing steps
corpus <- Corpus(VectorSource(tr.review.asin$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
Making a document term matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)
Creating a training and test set
dtmsparse <- as.data.frame(as.matrix(dtm))
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Creating a model
train$overall <- tr.review.asin[1:6500,]$overall
model <- rpart(overall ~., data = train, method= 'class')
mypred <- predict(model, newdata =test, type = 'class')
When I am plotting obs_test and mypred I am getting the following plot:
Plot obs_test and mypred
Unfortunately I don't have an idea if the last lines will guide me to a solution.
I would like to have a procedure where I can test how well my model is forecasting (comparison between real overall rating and predicted rating).
so it completely slipped my attention that you are actually dealing with a classification problem and not with regression. hence a complete edit.
to see how well a classification tree one would want to know how many instances (in the test data) were misclassified, i.e. the assigned class was not the same as the observed class. it is also informative to see how well the prediction model works on each individual class. using confusionMatrix function from the caret package you can do the following:
install.packages(`caret`)
library(caret)
mypred <- predict(model, newdata =test, type = 'class')
obs <- tr.review.asin[6501:7561,]$overall
confusionMatrix(obs, mypred)
you will get a confusion matrix and some stats as output. confusion matrix tells you on how many instances predictions and observations coincide for each class -- these will be values on the diagonal. in general ijth entry of the matrix will tell you how many instances were classified as j whilst the real class was i.
in the Overall Statistics section of the confusionMatrix output you will see Accuracy -- this is the percentage of the instances in the test set that were classified correctly.
next in the Statistics by Class section the row named Pos Pred Value will tell you what percentage of onbservations in class x were classified correctly. there is a bunch of other useful statistics that the function outputs and you can read up on it on-line, for example here or here.
i hope this helps.

Resources