Implementing Monte Carlo Cross Validation on linear regression in R - r

I'm having a dataset of 90 stations with a variety of different covariates which I would like to take for prediction by using a step-wise forward multiple regression. Therefore I would like to use Monte Carlo Cross Validation to estimate the performance of my linear model by splitting into test- and training tests for many times.
How can I implement the MCCV in R to test my model for certain iterations? I found the package WilcoxCV which gives me the observation number for each iteration. I also found the CMA-package which doesn't helps me a lot so far.
I checked all threads about MCCV but didn't find the answer.

You can use the caret package. The MCCV is called 'LGOCV' in this package (i.e Leave Group Out CV). It randomly selects splits between training and test sets.
Here is an example use training a L1-regularized regression model (you should look into regularization instead of step-wise btw), validating the selection of the penalizing lambda parameter using MCCV:
library(caret)
library(glmnet)
n <- 1000 # nbr of observations
m <- 20 # nbr of features
# Generate example data
x <- matrix(rnorm(m*n),n,m)
colnames(x) <- paste0("var",1:m)
y <- rnorm(n)
dat <- as.data.frame(cbind(y,x))
# Set up training settings object
trControl <- trainControl(method = "LGOCV", # Leave Group Out CV (MCCV)
number = 10) # Number of folds/iterations
# Set up grid of parameters to test
params = expand.grid(alpha=c(0,0.5,1), # L1 & L2 mixing parameter
lambda=2^seq(1,-10, by=-0.3)) # regularization parameter
# Run training over tuneGrid and select best model
glmnet.obj <- train(y ~ ., # model formula (. means all features)
data = dat, # data.frame containing training set
method = "glmnet", # model to use
trControl = trControl, # set training settings
tuneGrid = params) # set grid of params to test over
# Plot performance for different params
plot(glmnet.obj, xTrans=log, xlab="log(lambda)")
# Plot regularization paths for the best model
plot(glmnet.obj$finalModel, xvar="lambda", label=T)
You can use glmnet to train linear models. If you want to use step-wise caret supports that too using e.g method = 'glmStepAIC' or similar.
a list of the feature selection wrappers can be found here: http://topepo.github.io/caret/Feature_Selection_Wrapper.html
Edit
alphaand lambda arguments in the expand.grid function are glmnet specific parameters. If you use another model it will have a different set of parameters to optimize over.
lambda is the amount of regularization, i.e the amount of penalization on the beta values. Larger values will give "simpler" models, less prone to overfit, and smaller values more complex models that will tend to overfit if not enough data is available. The lambda values I supplied are just an example. Supply the grid you are interested in. But in general it is nice to supply an exponentially decreasing sequence for lambda.
alpha is the mixing parameter between L1 and L2 regularization. alpha=1 is L1 and alpha=0 is L2 regularization. I only supplied one value in the grid for this parameter. It is of course possible to supply several, like e.g alpha=c(0,0.5,1) which would test L1, L2 and an even mix of the two.
expand.grid creates a grid of potential parameter values we want to run the MCCV procedure over. Essentially, the MCCV procedure will evaluate performance for each of the different values in the grid and select the best one for you.
You can read more about glmnet, caret and parameter tuning here:
An Introduction to Glmnet
glmnet documentation
Model Training and Parameter Tuning with Caret

Related

I am setting seed on Gradient Boosting Machine(GBM) Model but I keep on getting different prediction

I am performing credit risk modelling using the Gradient Boosting Machine (GBM) algorithm and on making predictions of Probability of Default (PD) I keep on getting different PDs for each run even when I have set.seed(1234) in my code.
What could be causing this to happen and how do I fix it. Here is my code below:
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
modelLookup(model='gbm')
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode
= c(3,5,10),interaction.depth=c(1,5,10))
#SetSeed
set.seed(1234)
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
plot(model_gbm)
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
plot(model_gbm)
#Checking variable importance for GBM
#Variable Importance
library(gbm)
varImp(object=model_gbm, numTrees = 50)
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
confusionMatrix(predictions,testSet[,outcomeName])
PD <- predict.train(object=model_gbm,credit_transformed[,predictors],type="prob")
I assume you are using train() from caret.
I recommend you use the more complex but customizable trainControl() from the same package.
As you can see from ?trainControl, the parameter seeds is:
an optional set of integers that will be used to set the seed at each
resampling iteration. This is useful when the models are run in
parallel. A value of NA will stop the seed from being set within the
worker processes while a value of NULL will set the seeds using a
random set of integers. Alternatively, a list can be used. The list
should have B+1 elements where B is the number of resamples, unless
method is "boot632" in which case B is the number of resamples plus 1.
The first B elements of the list should be vectors of integers of
length M where M is the number of models being evaluated. The last
element of the list only needs to be a single integer (for the final
model). See the Examples section below and the Details section.
Fixing seeds should do the trick.
Please, next time try to offer a dput o analogous of your data in order to be reproducible.
Best!

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

How to test your tuned SVM model on a new data-set using machine learning and Caret Package in R?

Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM).
So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold.
I tuned my model and assessed the performance of the model using the test dataset (using ROC curve).
Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A.
What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,
Now you just take the model you built and tuned and predict off of it using predict :
D2.predictions <- predict(svm.tune, newdata = Dataset2)
They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).
D2.predictions will contain your predicted classes for the unlabeled data.

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:
# de <- data imported from sql connection
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)
**Co <- coef(?reg or reg1?,s=???)**
summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]
The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.
This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.
I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?
Thanks in advance!
glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.
cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.
In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!
Hope this helps!
Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.

R glm - how to do multiple cross-validation

I have train data which I randomly split in two parts:
70% -> train_train
30% -> train_cv (for cross-validation)
I fit a glm (glmnet) model using train_train, then cross-validate with train_cv.
My problem is that a different random split for train_train and train_cv returns different cross-validation results (evaluated using Area Under the Curve, "AUC"):
AUC = 0.6381583 the 1st time
AUC = 0.6164524 the 2nd time
Is there a way to run multiple cross-validations, without duplicating the code?
There are some confusing things here. I think what you are describing is more of a standard train/test split, the word cross-validation is usually used differently. So you've held out 30% of the data for testing, which is good, and you can use that to find out how optimistic your train set estimate of AUC is. But of course the estimate depends on how you do the train/test split, and it would be good to know how much this test performance varies. You can use multiple runs of cross-validation to achieve this.
Cross-validation is slightly from just using a holdout set - five fold cross validation, for example, involves the following steps:
Randomly split the full dataset into five equal sized parts.
For i = 1 to 5, fit the model on all the data except the ith part.
Evaluate AUC on the part that was held out from the fit.
Average the five AUC results.
This process can be repeated multiple times to estimate the mean and variance of the out of sample estimate.
The R package cvTools allows you to do this. For example
library(ROCR)
library(cvTools)
calc_AUC <- function(pred, act) {
u<-prediction(pred, act)
return(performance(u, "auc")#y.values[[1]])
}
cvFit(m, data = train, y = train$response,
cost = calc_AUC, predictArgs = "response")
will perform 5-fold cross-validatino of the model m using AUC as the performance metric. cvFit also takes arguments K (number of cross-validation folds) and R (number of times to perform the cross-validation with different random splits).
See http://en.wikipedia.org/wiki/Cross-validation_(statistics) from more info on cross-validation.

Resources