k-fold cross validation of prediction error using mgcv - r

I would like to evaluate the performance of a GAM at predicting novel data using a five-fold cross-validation. Model training is based on a random subset of 80% of the data and the test set the remaining 20%. I can calculate mean square prediction error between the training and test data, but am uncertain how to implement this across k-folds.
I have the following code for training and test datasets and to calculate MSPE. I have not included sample data, but can do so.
indexes<-sample(1:nrow(data),size=0.2*nrow(data))
testP<-data[indexes,] #20%
trainP<-data[-indexes,]#80%
gam0<-gam(x~ NULL,family=quasibinomial(link='logit'),data=data,gamma=1.4)
pv<-predict(gam0,newdata=testP,type="response")
diff<-pv-testP$x #(predicted - observed)
diff2<-diff^2 #(predicted - observed)^2
mspegam0<-mean(diff2)

Related

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

calculate MSE for training set that's missing the response variable

I have a training set with a response variable ViolentCrimesPerPop, and I purposely fit a large regression tree with control
control1 <- rpart.control(minsplit=2, cp=1e-8, xval=20)
train_control <- rpart(ViolentCrimesPerPop ~ ., data=train, method='anova', control=control1)
then i use it to predict the testing set
predict1 <- predict(train_control, newdata=test)
however I'm not sure how to compute the mean square error of the test set because it requires the response variable ViolentCrimesPerPop, which is not given in the test set. Can someone give me a hint on how to approach this problem?
You can find the MSE only knowing the ground truth.
If you don't know the test labels then the only way is to train your model with 70 or 80% of the train data and test the MSE on the other 20/30% of the train data.
You won't be able to calculate the MSE for the test set if you don't know the ground truth (response variable). However, there may be a possibility that you had been asked to split a dataset that contains the ground truth into train and test; in that case, you can easily compute the MSE.
Are you working on some Kaggle tests that do not provide the response variable for the test set?
Regardless, try to split your training set into new subsets, and use part as training, and the rest to test your model. You cannot assess the model performance without the response variable.

can validated model be used to do prediction whole dataset?

We have been running 'gbm' models on dataset of about 15k rows. We have directly implemented 10 fold cross-validation to come up with a cross-validated model, which we are using to predict again on the same dataset.
This has resulted in probably overfitted models with about 0.99 training AUC, and 0.92 cv AUC. The prediction AUC is also really high of about 0.99.
Reviewers have asked us to validate the model with a holdout dataset.
Which we are assuming that we would split the data into a holdout data and training data. Then the training data will undergo again in kfold cross-validation. The model will be then validated with holdout dataset. My final question is whether we can use the validated model again on the whole dataset for prediction?
You can... the question of should you depends on what you are trying to portray.
Ideally you want be able to show that your model generalises well to new data (the holdout) and compare that to how the model performs on the training data. If your model has a large discrepancy in performance between the two you likely have overfit the data.
I wouldn't see much point in predicting across all the data (training and holdout) at once as it doesn't help demonstrate the models ability to predict on unseen data.
You would aim to provide the performance on the training data during k-CV and then on the holdout.
Depending on your k-CV setup you would train the model on the entire training set before predicting on the both before comparing. You would need to be more specific in describing your exact setup.

Unable to predict on a new dataset using coxph in R

I built a coxph model on training data and I am having trouble in using it for predictions on the validation dataset. I used Surv to build a survival object and used it as the response variable in coxph model. I then used predict to get predictions on both the training and test dataset but at both times it predicted using on training data.
my.surv<-Surv(train$time_to_event, train$event, type="right")
basic.coxph<-coxph(my.surv~train$basin+train$region+train$district_code)
prediction_train<-predict(basic.coxph,train,type="risk")
prediction_test<-predict(object=basic.coxph,newdata=validate,type="risk")
The results of both prediction_train and prediction_test have the same dimensions and are exactly the same, though the training and validation dataset have different dimensions but al the same columns. Any suggestion on what I am doing wrong here?

R glm - how to do multiple cross-validation

I have train data which I randomly split in two parts:
70% -> train_train
30% -> train_cv (for cross-validation)
I fit a glm (glmnet) model using train_train, then cross-validate with train_cv.
My problem is that a different random split for train_train and train_cv returns different cross-validation results (evaluated using Area Under the Curve, "AUC"):
AUC = 0.6381583 the 1st time
AUC = 0.6164524 the 2nd time
Is there a way to run multiple cross-validations, without duplicating the code?
There are some confusing things here. I think what you are describing is more of a standard train/test split, the word cross-validation is usually used differently. So you've held out 30% of the data for testing, which is good, and you can use that to find out how optimistic your train set estimate of AUC is. But of course the estimate depends on how you do the train/test split, and it would be good to know how much this test performance varies. You can use multiple runs of cross-validation to achieve this.
Cross-validation is slightly from just using a holdout set - five fold cross validation, for example, involves the following steps:
Randomly split the full dataset into five equal sized parts.
For i = 1 to 5, fit the model on all the data except the ith part.
Evaluate AUC on the part that was held out from the fit.
Average the five AUC results.
This process can be repeated multiple times to estimate the mean and variance of the out of sample estimate.
The R package cvTools allows you to do this. For example
library(ROCR)
library(cvTools)
calc_AUC <- function(pred, act) {
u<-prediction(pred, act)
return(performance(u, "auc")#y.values[[1]])
}
cvFit(m, data = train, y = train$response,
cost = calc_AUC, predictArgs = "response")
will perform 5-fold cross-validatino of the model m using AUC as the performance metric. cvFit also takes arguments K (number of cross-validation folds) and R (number of times to perform the cross-validation with different random splits).
See http://en.wikipedia.org/wiki/Cross-validation_(statistics) from more info on cross-validation.

Resources