Use glm to predict on fresh data - r

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance

You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

Related

Is there a way to change threshold of a classification within a model in caret R?

I would like to change the threshold of the model and have comes across post like in the Cross Validated thread How to change threshold for classification in R randomForests?
If I change the threshold post creating a model that means I will again have to tweak things for test data or new data.
Is there a way in R & caret to change the threshold within the model so that I can run the same model with same threshold value on new data or test data as well?
In probabilistic classifiers, such as Random Forests, there is not any threshold involved during fitting of a model, neither there is any threshold associated with a fitted model; hence, there is actually nothing to change. As correctly pointed out in the CV thread Reduce Classification Probability Threshold:
Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Quoting from my own answer in Change threshold value for Random Forest classifier :
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
So, if you produce predictions from a fitted model, say rf, with the argument type = "prob", as shown in the CV thread you have linked to:
pred <- predict(rf, mydata, type = "prob")
these predictions will be probability values in [0, 1], and not hard classes 0/1. From here, you are free to choose the threshold as shown in the answer there, i.e.:
thresh <- 0.6 # any desired value in [0, 1]
class_pred <- c()
class_pred[pred <= thresh] <- 0
class_pred[pred > thresh] <- 1
or of course experiment with different values of threshold without needing to change anything in the model itself.

How to check and control for autocorrelation in a mixed effect model of longitudinal data?

I have behavioral data for many groups of birds over 10 days of observation. I wanted to investigate whether there is a temporal pattern in some behaviors (e.g. does mate competition increase over time?) And I was told that I had to account for the autocorrelation of the data, since behavior is unlikely to be independent in each day.
However I was wondering about two things:
Since I'm not interested in the differences in y among days but the trend of y over days, do I still need to correct for autocorrelation?
If yes, how do I control for the autocorrelation so that I'm left out only with the signal (and noise of course)?
For the second question, keep in mind I will be analyzing the effect of time on behavior using mixed models in R (since there are random effects such as pseudo-replication), but I have not found any straightforward method of correcting for autocorrelation in the data when modeling the responses.
(1) Yes, you should check for/account for autocorrelation.
The first example here shows an example of estimating trends in a mixed model while accounting for autocorrelation.
You can fit these models with lme from the nlme package. Here's a mixed model without autocorrelation included:
cmod_lme <- lme(GS.NEE ~ cYear,
data=mc2, method="REML",
random = ~ 1 + cYear | Site)
and you can explore the autocorrelation by using plot(ACF(cmod_lme)).
(2) Add correlation to the model something like this:
cmod_lme_acor <- update(cmod_lme,
correlation=corAR1(form=~cYear|Site)
#JeffreyGirard notes that
to check the ACF after updating the model to include the correlation argument, you will need to use plot(ACF(cmod_lme_acor, resType = "normalized"))

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

Resources