I'm trying to implement some functions to compare five different machine learning models to predict some values in a regression problem.
My intention is working on a suit of functions that could train the different codes and organize them in a suit of results. The models I select by instance are: Lasso, Random Forest, SVM, Linear Model and Neural Network. To tune some models I intend to use the references of Max Kuhn: https://topepo.github.io/caret/available-models.html.
However, since each model requires different tuning parameters, I'm in doubt how to set them:
First I set up the grid to 'nnet' model tunning. Here I selected different number of nodes in hidden layer and the decay coefficient:
my.grid <- expand.grid(size=seq(from = 1, to = 10, by = 1), decay = seq(from = 0.1, to = 0.5, by = 0.1))
Then I construct the functions that will run the five models 5 times in a 6-fold configuration:
my_list_model <- function(model) {
set.seed(1)
train.control <- trainControl(method = "repeatedcv",
number = 6,
repeats = 5,
returnResamp = "all",
savePredictions = "all")
# The tunning configurations of machine learning models:
set.seed(1)
fit_m <- train(ST1 ~.,
data = train, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control
linout = 1 # linear activation function output
trace = FALSE
maxit = 1000
tuneGrid = my.grid) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
Lastly, I execute the five models:
lapply(list(
Lass = "lasso",
RF = "rf",
SVM = "svmLinear",
OLS = "lm",
NN = "nnet"),
my_list_model) -> model_list
However, when I run this, it shows:
Error: The tuning parameter grid should not have columns fraction
By what I understood, I didn't know how to specify very well the tune parameters. If I try to throw away the 'nnet' model and change it, for example, to a XGBoost model, in the penultimate line, it seems it works well and results would be calculated. That is, it seems the problem is with the 'nnet' tuning parameters.
Then, I think my real question is: how to configure these different parameters of models, in special the 'nnet' model. In addition, since I didn't need to set up the parameters of lasso, random forest, svmLinear and linear model, how were they tuned by the caret package?
my_list_model <- function(model,grd=NULL){
train.control <- trainControl(method = "repeatedcv",
number = 6,
returnResamp = "all",
savePredictions = "all")
# The tuning configurations of machine learning models:
set.seed(1)
fit_m <- train(Y ~.,
data = df, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control,
linout = 1, # linear activation function output
trace = FALSE,
maxit = 1000,
tuneGrid = grd) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
first run below code and see all the related parameters
modelLookup('rf')
now make grid of all models based on above lookup code
svmGrid <- expand.grid(C=c(3,2,1))
rfGrid <- expand.grid(mtry=c(5,10,15))
create a list of all model's grid and make sure the name of model is same as name in the list
grd_all<-list(svmLinear=svmGrid
,rf=rfGrid)
model_list<-lapply(c("rf","svmLinear")
,function(x){my_list_model(x,grd_all[[x]])})
model_list
[[1]]
Random Forest
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
5 63.54864 0.5247415 55.72074
10 63.70247 0.5255311 55.35263
15 62.13805 0.5765130 54.53411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 15.
[[2]]
Support Vector Machines with Linear Kernel
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1 59.83309 0.5879396 52.26890
2 66.45247 0.5621379 58.74603
3 67.28742 0.5576000 59.55334
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was C = 1.
Related
When I run a quantile regression forest model with caret::train, I get the following error: Error in { : task 1 failed - "non-numeric argument to binary operator".
When I set ntree to a higher number (in my reproducible example this would be ntree = 150), my code runs without errors.
This code
library(caret)
library(quantregForest)
data(segmentationData)
dat <- segmentationData[segmentationData$Case == "Train",]
dat <- dat[1:50,]
# predictors
preds <- dat[,c(5:ncol(dat))]
# convert all to numeric
preds <- data.frame(sapply(preds, function(x) as.numeric(as.character(x))))
# response variable
response <- dat[,4]
# set up error measures
sumfct <- function(data, lev = NULL, model = NULL){
RMSE <- sqrt(mean((data$pred - data$obs)^2, na.omit = TRUE))
c(RMSE = RMSE)
}
# specify folds
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
folds_train <- caret::createMultiFolds(y = dat$Cell,
k = 10,
times = 5)
# specify trainControl for tuning mtry with the created multifolds
finalcontrol <- caret::trainControl(search = "grid", method = "repeatedcv", number = 10, repeats = 5,
index = folds_train, savePredictions = TRUE, summaryFunction = sumfct)
# build grid for tuning mtry
tunegrid <- expand.grid(mtry = c(2, 10, sqrt(ncol(preds)), ncol(preds)/3))
# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, # with ntree = 150 it works
metric = "RMSE",
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = TRUE
)
produces the error. The model with my real data has ntree = 10000 and still the task is failing.
How can I fix this?
Where in the source code of caret can I find the conditions for the error message Error in { : task 1 failed - "non-numeric argument to binary operator"? From which part of the source code does the error message come from?
UPDATE:
I adapted my code with my real data according to the answer of StupidWolf, so it looks like this:
# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, # with ntree = 150 it works
metric = "RMSE",
sampsize = ceiling(length(response)*0.4)
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = FALSE
)
With my real data I still get the above error message, so that I had to adapt the sampsize to 0.1*length(response) in the worst case in order to compute the model successfully. So only setting keep.inbag = FALSEstill produced errors. I have up to 1500 predictors while the number of samples (rows) are only 50 to 60. I still don't understand, what exactly causes the error message. I tried the model without the sampsize argument, but always set keep.inbag = FALSE. The error was still occuring. only setting the sampsize very low ensured success.
How can I run the model successfully without setting sampsize? I actually wanted the bootstrap approach for the out of bag data sets and not the artificial sampsize of 40 % or 10% of my data set for training the forest.
You get the error because you used the option keep.inbag = TRUE, in the quantregforest code, line 95:
minoob <- min( apply(!is.na(valuesPredict),1,sum))
if(minoob<10) stop("need to increase number of trees for sufficiently many out-of-bag observations")
So it requires that all of your observations have at least 10 instances of OOB (out of bag), to keep the out of bag predictions. So if your real data is huge, the ntrees required for keeping the out of bag is going to be huge.
If you are using caret for training the data, keeping the OOB and having savePredictions = TRUE seems redundant. On the whole, OOB predictions might not be so useful since you will be using the test fold to predict anyway.
Another option, given the size of your data, is to tweak the sampsize. In randomForest only a number of sampsize observations are sampled with replacement subset to fit a tree. If you set a lower size for this, you ensure there's enough OOB. For example in the example given, we can see:
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, sampsize=17,
metric = "RMSE",
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = TRUE)
model
Quantile Random Forest
50 samples
57 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 44, 43, 44, 46, 45, 46, ...
Resampling results across tuning parameters:
mtry RMSE
2.000000 42.53061
7.549834 42.72116
10.000000 43.11533
19.000000 42.80340
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
So this is starting to confuse me a bit. Having for example the following code that trains a GLM model:
glm_sens = train(
form = target ~ .,
data = ABT,
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE),
method = "glm",
family = "binomial",
metric = "Sens"
)
I expected that this trains a few models and then selects the one that performs best on the sensitivity. Yet when I read up on cross validation most I find is on how it is used to calculate average performance scores.
Was my assumption wrong?
caret does train different models, but normally it is done with different hyper-parameters. You can check out an an explanation of the process. Hyper parameters cannot be directly learned from the data so you need the training process. These parameters decide how your model will behave, for example you have lambda in lasso which decides how much regularization applied to the model.
In a glm, there is no hyper-parameter to train. I guess what you are looking for is something to select the best possible linear model out of the many potential variables. You can use step()
fit = lm(mpg ~ .,data=mtcars)
step(fit,direction="back")
Another option is to use leaps with caret, for example an equivalent of the above will be:
train(mpg~ .,data=mtcars,method='leapBackward', trControl=trainControl(method="cv",number=10),tuneGrid=data.frame(nvmax=2:6))
Linear Regression with Backwards Selection
32 samples
10 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 30, 28, 28, 28, 30, 28, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 3.299712 0.9169529 2.783068
3 3.124146 0.8895539 2.750305
4 3.249803 0.8849213 2.853777
5 3.258143 0.8939493 2.823721
6 3.123481 0.8917197 2.723475
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 6.
You can check out more about variable selection using leaps in this website
I would like to cross-validate a neural network using the package neuralnet and caret.
The data df can be copied from this post.
When running the neuralnet() function, there is an argument called hidden where you can set the hidden layers and neurons in each. Let's say I want 2 hidden layers with 3 and 2 neurons respectively. It would be written as hidden = c(3, 2).
However, as I want to cross-validate it, I decided to use the fantastic caret package. But when using the function train(), I do not know how to set the number of layers and neurons.
Does anyone know where can I add these numbers?
This is the code I ran:
nn <- caret::train(DC1 ~ ., data=df,
method = "neuralnet",
#tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
trControl = trainControl (
method = "cv", number = 10,
verboseIter = TRUE
))
By the way, I am getting some warnings with the previous code:
predictions failed for Fold01: layer1=3, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
Ideas on how to solve it?
When using neural net model in caret in order to specify the number of hidden units in each of the three supported layers you can use the parameters layer1, layer2 and layer3. I found out by checking the source.
library(caret)
grid <- expand.grid(layer1 = c(32, 16),
layer2 = c(32, 16),
layer3 = 8)
Use case with BostonHousing data:
library(mlbench)
data(BostonHousing)
lets just select numerical columns for the example to make it simple:
BostonHousing[,sapply(BostonHousing, is.numeric)] -> df
nn <- train(medv ~ .,
data = df,
method = "neuralnet",
tuneGrid = grid,
metric = "RMSE",
preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE)
)
The part
preProc = c("center", "scale", "nzv")
is essential for the algorithm to converge, neural nets don't like unscaled features
Its super slow though.
nn
#output
Neural Network
506 samples
12 predictor
Pre-processing: centered (12), scaled (12)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 404, 404, 405, 406
Resampling results across tuning parameters:
layer1 layer2 RMSE Rsquared MAE
16 16 NaN NaN NaN
16 32 4.177368 0.8113711 2.978918
32 16 3.978955 0.8275479 2.822114
32 32 3.923646 0.8266605 2.783526
Tuning parameter 'layer3' was held constant at a value of 8
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were layer1 = 32, layer2 = 32 and layer3 = 8.
I have fitted an Averaged neural network in R with Caret. See code below. Does the term Averaged mean that the average is based on the outcomes of 1000 neural networks? (since there are 1000 iterations in this case)
Thanks.
library(AppliedPredictiveModeling)
data(solubility)
### Create a control funciton that will be used across models. We
### create the fold assignments explictily instead of relying on the
### random number seed being set to identical values.
library(caret)
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
################################################################################
### Section 7.1 Neural Networks
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the availible hardware. We
### estimate the memory usuage (VSIZE = total memory size) to be
### 2677M/core.
library(doMC)
registerDoMC(10)
library(caret)
nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
size = c(1, 3, 5, 7, 9, 11, 13),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = solTrainXtrans, y = solTrainY,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
maxit = 1000,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
testResults <- data.frame(obs = solTestY,
NNet = predict(nnetTune, solTestXtrans))
################################################################################
See also:
https://scientistcafe.com/post/nnet.html
avNNet is a model where the same neural network model is fit using different random number seeds. All the resulting models are used for prediction. For regression, the output from each network are averaged. For classification, the model scores are first averaged, then translated to predicted classes. Source.
The number of models fit is controlled by the argument repeats which is passed down to the model in caret via ...
repeats - the number of neural networks with different random number seeds. At default this is set to 5. So five models will be averaged. In caret's definition of the model I do not see this changing.
If the bag argument is set to TRUE model fitting and aggregation is performed by bootstrap aggregation, which in my opinion is almost guaranteed to provide better predictive performance if the number of models is high enough.
I'd like to know if it is possible to estimate the RMSE variance of the model trained in each K-fold on the test data.
I'm using the crantastic caret package.
For example:
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 20,
allowParallel = TRUE)
gbmGrid <- expand.grid(interaction.depth = seq(9, 11, by = 1),
n.trees = seq(700, 900, by = 25),
shrinkage = c(0.01, 0.025, 0.05))
set.seed(100)
gbmFit <- train(R ~ .,
data = trainSet,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
gbmFit$resample gives me the RMSE of each K-fold but I'd like to have the RMSE of the trained model of each K-fold relative to the test data. Is it possible?
To illustrate this further, based on the best model out of train one can predict the values of the output variables for a testset:
predict(gbmFit, testSet)
However it would be nice to do this for the ixk models (20x5) used in train and not only for the best model. For models which are very sensitive to model parameters this is a necessity. Given the two best models out of CV, they can have very different accuracy values for the test set. Perhaps the second best model out of CV is the overall best model. See http://ai.stanford.edu/~ang/papers/cv-final.pdf