How to test your tuned SVM model on a new data-set using machine learning and Caret Package in R? - r

Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM).
So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold.
I tuned my model and assessed the performance of the model using the test dataset (using ROC curve).
Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A.
What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,

Now you just take the model you built and tuned and predict off of it using predict :
D2.predictions <- predict(svm.tune, newdata = Dataset2)
They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).
D2.predictions will contain your predicted classes for the unlabeled data.

Related

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

I am setting seed on Gradient Boosting Machine(GBM) Model but I keep on getting different prediction

I am performing credit risk modelling using the Gradient Boosting Machine (GBM) algorithm and on making predictions of Probability of Default (PD) I keep on getting different PDs for each run even when I have set.seed(1234) in my code.
What could be causing this to happen and how do I fix it. Here is my code below:
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
modelLookup(model='gbm')
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode
= c(3,5,10),interaction.depth=c(1,5,10))
#SetSeed
set.seed(1234)
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
plot(model_gbm)
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
plot(model_gbm)
#Checking variable importance for GBM
#Variable Importance
library(gbm)
varImp(object=model_gbm, numTrees = 50)
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
confusionMatrix(predictions,testSet[,outcomeName])
PD <- predict.train(object=model_gbm,credit_transformed[,predictors],type="prob")
I assume you are using train() from caret.
I recommend you use the more complex but customizable trainControl() from the same package.
As you can see from ?trainControl, the parameter seeds is:
an optional set of integers that will be used to set the seed at each
resampling iteration. This is useful when the models are run in
parallel. A value of NA will stop the seed from being set within the
worker processes while a value of NULL will set the seeds using a
random set of integers. Alternatively, a list can be used. The list
should have B+1 elements where B is the number of resamples, unless
method is "boot632" in which case B is the number of resamples plus 1.
The first B elements of the list should be vectors of integers of
length M where M is the number of models being evaluated. The last
element of the list only needs to be a single integer (for the final
model). See the Examples section below and the Details section.
Fixing seeds should do the trick.
Please, next time try to offer a dput o analogous of your data in order to be reproducible.
Best!

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Implementing Monte Carlo Cross Validation on linear regression in R

I'm having a dataset of 90 stations with a variety of different covariates which I would like to take for prediction by using a step-wise forward multiple regression. Therefore I would like to use Monte Carlo Cross Validation to estimate the performance of my linear model by splitting into test- and training tests for many times.
How can I implement the MCCV in R to test my model for certain iterations? I found the package WilcoxCV which gives me the observation number for each iteration. I also found the CMA-package which doesn't helps me a lot so far.
I checked all threads about MCCV but didn't find the answer.
You can use the caret package. The MCCV is called 'LGOCV' in this package (i.e Leave Group Out CV). It randomly selects splits between training and test sets.
Here is an example use training a L1-regularized regression model (you should look into regularization instead of step-wise btw), validating the selection of the penalizing lambda parameter using MCCV:
library(caret)
library(glmnet)
n <- 1000 # nbr of observations
m <- 20 # nbr of features
# Generate example data
x <- matrix(rnorm(m*n),n,m)
colnames(x) <- paste0("var",1:m)
y <- rnorm(n)
dat <- as.data.frame(cbind(y,x))
# Set up training settings object
trControl <- trainControl(method = "LGOCV", # Leave Group Out CV (MCCV)
number = 10) # Number of folds/iterations
# Set up grid of parameters to test
params = expand.grid(alpha=c(0,0.5,1), # L1 & L2 mixing parameter
lambda=2^seq(1,-10, by=-0.3)) # regularization parameter
# Run training over tuneGrid and select best model
glmnet.obj <- train(y ~ ., # model formula (. means all features)
data = dat, # data.frame containing training set
method = "glmnet", # model to use
trControl = trControl, # set training settings
tuneGrid = params) # set grid of params to test over
# Plot performance for different params
plot(glmnet.obj, xTrans=log, xlab="log(lambda)")
# Plot regularization paths for the best model
plot(glmnet.obj$finalModel, xvar="lambda", label=T)
You can use glmnet to train linear models. If you want to use step-wise caret supports that too using e.g method = 'glmStepAIC' or similar.
a list of the feature selection wrappers can be found here: http://topepo.github.io/caret/Feature_Selection_Wrapper.html
Edit
alphaand lambda arguments in the expand.grid function are glmnet specific parameters. If you use another model it will have a different set of parameters to optimize over.
lambda is the amount of regularization, i.e the amount of penalization on the beta values. Larger values will give "simpler" models, less prone to overfit, and smaller values more complex models that will tend to overfit if not enough data is available. The lambda values I supplied are just an example. Supply the grid you are interested in. But in general it is nice to supply an exponentially decreasing sequence for lambda.
alpha is the mixing parameter between L1 and L2 regularization. alpha=1 is L1 and alpha=0 is L2 regularization. I only supplied one value in the grid for this parameter. It is of course possible to supply several, like e.g alpha=c(0,0.5,1) which would test L1, L2 and an even mix of the two.
expand.grid creates a grid of potential parameter values we want to run the MCCV procedure over. Essentially, the MCCV procedure will evaluate performance for each of the different values in the grid and select the best one for you.
You can read more about glmnet, caret and parameter tuning here:
An Introduction to Glmnet
glmnet documentation
Model Training and Parameter Tuning with Caret

Resources