Predicted vs Actual values of GBM in H2O - r

I am trying to calculate Gini for my regression models and since there is no Gini index for regression models, I am getting all the scores and calculate it using Gini functions in R using this code:
preds <-h2o.predict(model, test)
pred_vs_actual <- as.data.frame(h2o.cbind(test$target,preds)
Does this code return the correct pair values for actual and predictions? I know that there is no order in a spark table but I am not sure if this is also the case for H2O object.

Yes what you have (pred_vs_actual) will cbind your model's predictions with the corresponding row (record). As a quick check, when you look at the first few rows of pred_vs_actual you should be able to verify that the cbind does what you expect.

Related

Forecast ARIMA using a different training set

I estimate the ARIMA model on a training dataset using the auto.arima function in R. Afterwards I am using the function forecast to make suppose 50 predictions and calculate the accuracy measures such as RMSE and MAE.
If I use the forecast function, it uses only the observations in the training set, and then makes the predictions at each time unit t using the values predicted at time t-1. What I am trying to do, is to make 1 prediction at time, adding at each time t an observed value to the training set, without reestimating the ARIMA model. So instead of considering the predicted values at time t-1, I would consider the real values. So if ARIMA has been estimated on the training dataset of 100 observations, the first forecast will be done considering the training dataset of length 100, the second forecast will consider the training set of length 101, the third forecast will take the training set of length 102 and so on.
The auto.arima output contains the datasets "x" which is the training set I use to estimate the model, and the dataset "fitted" which contains the fitted values. It also has the argument "nobs" which is the length of the dataset "x". I am trying to replace auto.arima$x with a new training dataset where the last observations are given by true values I add one at the time. I also modify "nobs" so it would give me the length of the new "x". But I noticed that the forecast for only one time ahead always considers the old training set. So for instance I added one observed value at a time to the training set and made the one ahead predictions for 50 times but all the predictions are equal to the first one. Like the forecast function ignores the fact that I replaced the "x" series inside the auto.arima output. I tried to replace the "fitted" values with the same result.
Does someone know how exactly the function "forecast" considers the training set based on which to make the predictions? What should I modify inside the auto.arima output at each time t to get the one-ahead predictions based on the real values at the previous times, instead of the estimated ones? Or there is a way to tell the "forecast" function to consider a different training dataset?
I don't want to refit ARIMA model on the new training dataset (using Arima function) and reestimate the residual variance, it takes literally forever...
Any suggestion would be helpful
Thank you in advance

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

Using predict for linear model with NA values in R

I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?
The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude,
when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
Using
data[complete.cases(data),]
gives you only observations without NAs. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

What is the objective of model.matrix()?

I'm currently going through the 'Introduction to Statistical Learning' MOOC by Stanford OpenX. In one of the lab exercises, it suggests creating a model matrix from the test data by explicitly using model.matrix().
Extract from textbook
We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.
test.mat=model.matrix (Salary∼.,data=Hitters [test ,])
The model.matrix() function is used in many regression packages for
building an X matrix from data. Now we run a loop, and for each size i, we
extract the coefficients from regfit.best for the best model of that
size, multiply them into the appropriate columns of the test model
matrix to form the predictions, and compute the test MSE.
val.errors =rep(NA ,19)
for(i in 1:19){
coefi=coef(regfit .best ,id=i)
pred=test.mat [,names(coefi)]%*% coefi
val.errors [i]= mean(( Hitters$Salary[test]-pred)^2)
}
I understand that model.matrix would convert string variables into values with different levels, and that models like lm() would do the conversions under the hood.
However, what are the instances that we would explicitly use model.matrix(), and why?

Resources