Difference between fitted values and cross validation values from pls model in r - r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?

Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

Related

How to identify the non-zero coefficients in final caret elastic net model -

I have used caret to build a elastic net model using 10-fold cv and I want to see which coefficients are used in the final model (i.e the ones that aren't reduced to zero). I have used the following code to view the coefficients, however, this apears to create a dataframe of every permutation of coefficient values used, rather than the ones used in the final model:
tr_control = train_control(method="cv",number=10)
formula = response ~.
model1 = caret::train(formula,
data=training,
method="glmnet",
trControl=tr_control,
metric = "Accuracy",
family = "binomial")
Then to extract the coefficients from the final model and using the best lambda value, I have used the following:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$.lambda)))
However, this just returns a dataframe of all the coefficients and I can see different instances of where the coefficients have been reduced to zero, however, I'm not sure which is the one the final model uses. Using some slightly different code, I get slightly different results, but in this instance, non of the coefficients are reduced to zero, which suggests to me that the the final model isn't reducing any coefficients to zero:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda))) #i have removed the full stop preceeding lambda
Basically, I want to know which features are in the final model to assess how the model has performed as a feature reduction process (alongside standard model evaluation metrics such as accuracy, sensitivity etc).
Since you do not provide any example data I post an example based on the iris built-in dataset, slightly modified to fit better your need (a binomial outcome).
First, modify the dataset
library(caret)
set.seed(5)#just for reproducibility
iris
irisn <- iris[iris$Species!="virginica",]
irisn$Species <- factor(irisn$Species,levels = c("versicolor","setosa"))
str(irisn)
summary(irisn)
fit the model (the caret function for setting controls parameters for train is trainControl, not train_control)
tr_control = trainControl(method="cv",number=10)
model1 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
You can extract the coefficients of the final model as you already did:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda)))
Also here the model did not reduce any coefficients to 0, but what if we add a random variable that explains nothing about the outcome?
irisn$new1 <- runif(nrow(irisn))
model2 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
var <- data.frame(as.matrix(coef(model2$finalModel, model2$bestTune$lambda)))
Here, as you can see, the coefficient of the new variable was turning to 0. You can extract the variable name retained by the model with:
rownames(var)[var$X1!=0]
Finally, the accuracy metrics from the test set can be obtained with
confusionMatrix(predict(model1,test),test$outcome)

calculate MSE for training set that's missing the response variable

I have a training set with a response variable ViolentCrimesPerPop, and I purposely fit a large regression tree with control
control1 <- rpart.control(minsplit=2, cp=1e-8, xval=20)
train_control <- rpart(ViolentCrimesPerPop ~ ., data=train, method='anova', control=control1)
then i use it to predict the testing set
predict1 <- predict(train_control, newdata=test)
however I'm not sure how to compute the mean square error of the test set because it requires the response variable ViolentCrimesPerPop, which is not given in the test set. Can someone give me a hint on how to approach this problem?
You can find the MSE only knowing the ground truth.
If you don't know the test labels then the only way is to train your model with 70 or 80% of the train data and test the MSE on the other 20/30% of the train data.
You won't be able to calculate the MSE for the test set if you don't know the ground truth (response variable). However, there may be a possibility that you had been asked to split a dataset that contains the ground truth into train and test; in that case, you can easily compute the MSE.
Are you working on some Kaggle tests that do not provide the response variable for the test set?
Regardless, try to split your training set into new subsets, and use part as training, and the rest to test your model. You cannot assess the model performance without the response variable.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

How to test your tuned SVM model on a new data-set using machine learning and Caret Package in R?

Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM).
So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold.
I tuned my model and assessed the performance of the model using the test dataset (using ROC curve).
Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A.
What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,
Now you just take the model you built and tuned and predict off of it using predict :
D2.predictions <- predict(svm.tune, newdata = Dataset2)
They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).
D2.predictions will contain your predicted classes for the unlabeled data.

extract predicted values from Partial least square regression in R

I have following:
library(pls)
pcr(price ~ X, 6, data=cars, validation="CV")
it works, but because I have a small dataset, I cannot divide in into training and test and therefore I want to perform cross-validation and then extract predicted data for AUC and accuracy. But I could not find how I can extract the predicted data.Which parameter is it?
When you fit a cross-validated principal component regression model with pcr() and the validation= argument, one of the components of the output list is called validation. This contains the results of the cross validation. This in turn is a list and it has a component called pred, which contains the cross-validated predictions.
An example adapted from example("pcr"):
sens.pcr <- pcr(sensory ~ chemical, data = oliveoil, validation = "CV")
sens.pcr$validation$pred
As an aside, it's generally a good idea to set your random seed immediately prior to performing cross validation to ensure reproducibility of your results.

Resources