Stock prediction + news sentiment with SVM in R? - r

I would like to predict the stock prices and news sentiment score together with SVM in R, in order to see whether news have an impact on stock price and their prediction. I read that support vector machines (svm) are a good machine learning approach for this problem. I have one column that represents the date of the stock and news, one column represents the stock prices on that day and 4 columns which represent the sentiment scores based on different lexica. I would like to test first with one of that lexica and if the models works, trying on the other. The dataset is included below. I found some examples with python but couldn't found something for R. I like to use the svm() function from the e1071 package
I split the data into train and test set:
sample <- sample(nrow(sentGI),nrow(sentGI)*0.70)
df.trainGI = sentGI[sample,]
df.testGI = sentGI[-sample,]
And I tried already this SVM code, but my wrong prediction rate is 100
plot(df.trainGI$GSPC.Close, df.trainGI$SentimentGI, pch = 19, col = c("red", "blue"))
svm_model_GI <- svm(SentimentGI.Class ~ ., df.trainGI)
print(svm_model_GI)
plot(svm_model_GI, df.trainGI)
svm_pred_GI <- predict(svm_model_GI, newdata = df.testGI, type="response")
rmse <- sqrt(mean((svm_pred_GI - df.testGI$GSPC.Close)^2))
rmse
What I am doing wrong here? Hope somebody can help me!
Dataset

You're using model accuracy to evaluate the model. Accuracy is used for classification problems but your response variable is continuous. You should use RMSE.
pred <- predict(radial.svm, newdata=df.test, type='response')
rmse <- sqrt(mean((pred - df.test$GSPC.Close)^2))
rmse
Continuation from comments:
The first plots GSPC.Close against date (left) and the second plots SentimentGI against date (right). Notice that stock prices generally increase over time whereas sentiment has a slope of 0 in that same time frame. What does that tell you?

Related

How to make predictions using an LDA (Linear discriminant analysis) model in R

as the title suggests I am trying to make predictions using an LDA model in R. I have two sets of data that I'm working with: the first set is a series of entries associated with 16 predictor variables and 1 outcome variable (the outcome variable are "groups" that each entry belongs to that I've assigned myself), the second set of data also consists of entries associated with the same 16 predictor variables, but with no outcome variable. What I would like to do is predict the group membership of the entries in the second set of data.
So far I've successfully managed to create an LDA model by separating the first dataset into a "training set" and a "test set". However, now that I have the model I don't know how I would go about predicting the group membership of the entries in my second data set.
Thanks for the help! Please let me know if any more information is required, this is my first post on stack overflow so I am still learning the ropes.
Short example based on An introduction to Statistical learning, chapter 4. Say you have fitted a model lda_model on a training_data set, with dependent variable Group which you aim to predict, and predictors Predictor1 and Predictor2
library(MASS)
lda_model <- lda(Groupāˆ¼ Predictor1 + Predictor2, data = training_set)
You can then make predictions with the lda_model using the predict function on the testing_set
lda_predictions <- predict(lda_model, testing_set)
lda_predictions then holds the posterior probabilities in $posterior that the observation is part of Group j.
You could then apply a threshold of for instance (but not limiting to) 50% probability. E.g.
sum(lda_model$posterior[, 7] >= .5)
returns the number of observations for which the probabilty that the observation is part of Group 7 is larger than 50%

How to plot multi-level meta-analysis by study (in contrast to the subgroup)?

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

Predicted(?) values from an lmer model

I have a data frame of bird counts. I have the participants ID number, the number of birds they counted, the year they counted them, their lat and long coordinates, and their effort. I have made this model:
model = lmer(count~year+lat+long+effort+(1|participant), data = df)
I now want the model to plot predicted values from that same data set. So, that data was for 1997-2017, and I want the model to give me predicted values for each year. I want to plot these, so the final plot will have the predicted count on the y-axis, and the year (categorical) on the x-axis. Each year will have one data point w/ a confidence interval.
I have tried figuring out predict(), but I'm not quite sure how to use that to get what I want. It seems to need a new data frame, but I don't have a new data set to run through the model to predict a future count. I want the model to go back and work on the previous data that I put into it already, based off of the Beta values in the output of summary(model).
I found this thread, and it seems to be basically what I'm looking to do, but I can't get the sjPlot dependencies to download, sjlabelled throws an error every time: How to plot predicted values with standard errors for lmer model results?
You could try the ggeffects-package, which will be used in the forthcoming sjPlot-update to plot predicted values.
library(ggeffects)
dat <- ggpredict(model, terms = "dat")
plot(dat)
If you're missing dependencies, try:
install.packages(
c("sjlabelled", "sjmisc", "sjstats", "ggeffects", "sjPlot"),
dependencies = TRUE
)
You may even want to install ggeffects from GitHub, since the current dev-version has some fixes and improvements for mixed models.
devtools::install_github("strengejacke/ggeffects")
I found the package I was looking for, it's called predictedmeans and has a function where you put in the model and the model term you want predictions for predictmeans(model, model term). It works perfectly!

Resources