Unable to predict on a new dataset using coxph in R - r

I built a coxph model on training data and I am having trouble in using it for predictions on the validation dataset. I used Surv to build a survival object and used it as the response variable in coxph model. I then used predict to get predictions on both the training and test dataset but at both times it predicted using on training data.
my.surv<-Surv(train$time_to_event, train$event, type="right")
basic.coxph<-coxph(my.surv~train$basin+train$region+train$district_code)
prediction_train<-predict(basic.coxph,train,type="risk")
prediction_test<-predict(object=basic.coxph,newdata=validate,type="risk")
The results of both prediction_train and prediction_test have the same dimensions and are exactly the same, though the training and validation dataset have different dimensions but al the same columns. Any suggestion on what I am doing wrong here?

Related

knn imputation for mixed data in glmnet during cross-validation without information leakage

I would like to use glmnet and impute missing data with knn imputation method which is based on the Gower Distance for numerical, categorical, ordered and semi-continuous variables (not possible with caret's knn imputation method). In order to assess the model adequately, I need to use the imputation model of the training data to impute missing data in the test data during cross-validation. I assume it should be possible to do this using the makeX function in glmnet and replace the mean imputation with a different imputation method. Does anyone know how to do this or have any other idea?

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

Producing graph of training and validation sets using caret's train function

I am using caret's train function in R to produce a model using GBM. I have used repeated cross-validation with 5 repititions meaning there will be 50 samples. I want to ask if there is a way to plot the results in a different way such that the plot shows the boosting iterations on the x-axis and auc on the y-axis and inside it shows the results obtained from the best parameter selection but a separate line for training folds and test folds. This can be produced when you use "gbm" function from the gbm package and use "gbm.perf" along with sampling technique to plot the training and validation curve for deviance.
Is it possible to do the same with caret's train function somehow?
Thanks.
Within your caret object, if you used method='gbm', you can select the attribute 'finalModel', which is the resulting gbm object. For example, if your train object is named 'a', then
gbm_model <- a$finalModel
With gbm_model, you can then run the functions internal to the gbm package.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

k-fold cross validation of prediction error using mgcv

I would like to evaluate the performance of a GAM at predicting novel data using a five-fold cross-validation. Model training is based on a random subset of 80% of the data and the test set the remaining 20%. I can calculate mean square prediction error between the training and test data, but am uncertain how to implement this across k-folds.
I have the following code for training and test datasets and to calculate MSPE. I have not included sample data, but can do so.
indexes<-sample(1:nrow(data),size=0.2*nrow(data))
testP<-data[indexes,] #20%
trainP<-data[-indexes,]#80%
gam0<-gam(x~ NULL,family=quasibinomial(link='logit'),data=data,gamma=1.4)
pv<-predict(gam0,newdata=testP,type="response")
diff<-pv-testP$x #(predicted - observed)
diff2<-diff^2 #(predicted - observed)^2
mspegam0<-mean(diff2)

Resources