Keeping one parameter fixed and search on randomly in caret - r

I would like to keep the parameter alpha fixed at 1 and use random search for lambda, is this possible?
library(caret)
X <- iris[, 1:4]
Y <- iris[, 5]
fit_glmnet <- train(X, Y, method = "glmnet", tuneLength = 2, trControl = trainControl(search = "random"))

I do not think this can be achieved by specifying directly in caret train but here is how to emulate the desired behavior:
From this link
one can see random search for lambda is achieved by:
lambda = 2^runif(len, min = -10, 3)
where len is the tune length
To emulate random search over one parameter:
len <- 2
fit_glmnet <- train(X, Y,
method = "glmnet",
tuneLength = len,
trControl = trainControl(search = "grid"),
tuneGrid = data.frame(alpha = 1, lambda = 2^runif(len, min = -10, 3)))

First off, I'm not sure you can use a random search and fix specific tuning parameters.
However, as an alternative you could use a grid search for optimising tuning parameters instead of a random search. You can then fix tuning parameters using tuneGrid:
fit <- train(
X,
Y,
method = "glmnet",
tuneLength = 2,
trControl = trainControl(search = "grid"),
tuneGrid = data.frame(alpha = 1, lambda = 10^seq(-4, -1, by = 0.5)));
fit;
#glmnet
#
#150 samples
# 4 predictor
# 3 classes: 'setosa', 'versicolor', 'virginica'
#
#No pre-processing
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
#Resampling results across tuning parameters:
#
# lambda Accuracy Kappa
# 0.0001000000 0.9398036 0.9093246
# 0.0003162278 0.9560817 0.9336278
# 0.0010000000 0.9581838 0.9368050
# 0.0031622777 0.9589165 0.9379580
# 0.0100000000 0.9528997 0.9288533
# 0.0316227766 0.9477923 0.9212374
# 0.1000000000 0.9141015 0.8709753
#
#Tuning parameter 'alpha' was held constant at a value of 1
#Accuracy was used to select the optimal model using the largest value.
#The final values used for the model were alpha = 1 and lambda = 0.003162278.

Related

Hyperparameters not changing results from random forest regression trees

I am trying to tune the hyperparameters of a random forest regression model and all of the accuracy measures are exactly the same, regardless of changes to hyperparameters. I've tested the same code on the "diamonds" dataset and have been able to reproduce the problem. Here is my code:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
Here's what I get for results:
What the heck?
UPDATE:
I reduced the sample size to 1000 to allow for faster processing and got different results, still all identical to each other. Code:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
Results:
This seems to be an issue with your cross-validation folds. When I run your code and look at the results of model it says:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
indicating that each fold only has a sample size of 1.
I think if you define folds like this, it will work more like you're expecting it to:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
The results then look like this:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.

Elastic net issue in R - Error in check_dims(x = x, y = y) : nrow(x) == n is not TRUE

Error: nrow(x) == n is not TRUE
I am not sure what "n" is referring to in this case. Here is the code throwing the error:
# BUILD MODEL
set.seed(9353)
elastic_net_model <- train(x = predictors, y = y,
method = "glmnet",
family = "binomial",
preProcess = c("scale"),
tuneLength = 10,
metric = "ROC",
# metric = "Spec",
trControl = train_control)
The main problem that others were running into with this error is that their y variable was not a factor or numeric. They were often passing it as a matrix or dataframe. I explicitly make my y a factor, shown here:
# Make sure that the outcome variable is a two-level factor
dfBlocksAll$trophout1 = as.factor(dfBlocksAll$trophout1)
# Set levels for dfBlocksAll$trophout1
levels(dfBlocksAll$trophout1) <- c("NoTrophy", "Trophy")
# Split the data into training and test set, 70/30 split
set.seed(1934)
index <- createDataPartition(y = dfBlocksAll$trophout1, p = 0.70, list = FALSE)
training <- dfBlocksAll[index, ]
testing <- dfBlocksAll[-index, ]
# This step is the heart of the process
y <- dfBlocksAll$trophout1 # outcome variable - did they get a trophy or not?
predictors <- training[,which(colnames(training) != "trophout1")]
The only other potentially relevant code that comes before the block throwing the error is this:
train_control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
# sampling = "down",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE,
savePredictions = "final",
verboseIter = FALSE)
Since my y is already a factor, I assume that my error has something to do with the x, not the y. As you can see from the code that my x is a dataframe called "predictors." This dataframe contains 768 obs. of 67 vars, and is filled with chars and numerics.
Your response variable has to come from the training, here I use an example dataset:
dfBlocksAll = data.frame(matrix(runif(1000),ncol=10))
dfBlocksAll$trophout1 = factor(sample(c("NoTrophy", "Trophy"),100,replace=TRUE))
index <- createDataPartition(y = dfBlocksAll$trophout1, p = 0.70, list = FALSE)
training <- dfBlocksAll[index, ]
testing <- dfBlocksAll[-index, ]
And this part should be changed:
y <- training$trophout1
predictors <- training[,which(colnames(training) != "trophout1")]
And the rest runs pretty ok:
elastic_net_model <- train(x = predictors, y = y,
method = "glmnet",
family = "binomial",
preProcess = c("scale"),
tuneLength = 10,
metric = "ROC",
trControl = train_control)
elastic_net_model
glmnet
71 samples
10 predictors
2 classes: 'NoTrophy', 'Trophy'
Pre-processing: scaled (10)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 65, 64, 64, 63, 64, 64, ...
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0.1 0.0003090198 0.5620833 0.5908333 0.51666667
0.1 0.0007138758 0.5620833 0.5908333 0.51666667
0.1 0.0016491457 0.5614583 0.5908333 0.51083333
0.1 0.0038097407 0.5594444 0.5933333 0.51083333

R caret (svmRadial) keep sigma constant and use grid search for C

I am implementing a Support Vector Machine with Radial Basis Function Kernel ('svmRadial') with caret. As far as I understand the documentation and the source code, caret uses an analytical formula to get reasonable estimates of sigma and fix it to that value (According to the output: Tuning parameter 'sigma' was held constant at a value of 0.1028894). In addition, caret cross-validates over a set of cost parameters C (default = 3).
However, if I now want to set my own grid of cost parameters (tuneGrid), I have to additionally specify a value of sigma. Otherwise the following error appears:
Error: The tuning parameter grid should have columns sigma, C
How can I fix Sigma based on the analytical formula and still implement my own grid of cost parameters C?
Here is a MWE:
library(caret)
library(mlbench)
data(BostonHousing)
set.seed(1)
index <- sample(nrow(BostonHousing),nrow(BostonHousing)*0.75)
Boston.train <- BostonHousing[index,]
Boston.test <- BostonHousing[-index,]
# without tuneGrid
set.seed(1)
svmR <- train(medv ~ .,
data = Boston.train,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 5))
# with tuneGrid (gives the error message)
set.seed(1)
svmR <- train(medv ~ .,
data = Boston.train,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(C = c(0.01, 0.1)),
trControl = trainControl(method = "cv", number = 5))
If you look under the information for the model, it shows how the grid is generated if you don't provide:
getModelInfo("svmRadial")$svmRadial$grid
function(x, y, len = NULL, search = "grid") {
sigmas <- kernlab::sigest(as.matrix(x), na.action = na.omit, scaled = TRUE)
if(search == "grid") {
out <- expand.grid(sigma = mean(as.vector(sigmas[-2])),
C = 2 ^((1:len) - 3))
} else {
rng <- extendrange(log(sigmas), f = .75)
out <- data.frame(sigma = exp(runif(len, min = rng[1], max = rng[2])),
C = 2^runif(len, min = -5, max = 10))
}
out
}
So the method to get it is to estimate the sigma using kernlab::sigest, First we pull out the grid method for svmRadial:
models <- getModelInfo("svmRadial", regex = FALSE)[[1]]
Set up the input x and y since you are providing a formula:
preProcValues = preProcess(Boston.train, method = c("center", "scale"))
processData = predict(preProcValues,Boston.train)
x = model.matrix(medv ~ .,data=processData)[,-1]
y = processData$medv
And we use the grid function for this model, which you can see is the same as your output:
set.seed(1)
models$grid(x,y,3)
sigma C
1 0.1028894 0.25
2 0.1028894 0.50
3 0.1028894 1.00
svmR$results
sigma C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.1028894 0.25 5.112750 0.7591398 2.982241 0.8569208 0.05387213 0.4032354
2 0.1028894 0.50 4.498887 0.8046234 2.594059 0.7823051 0.05357678 0.3644430
3 0.1028894 1.00 4.055564 0.8349416 2.402248 0.8403222 0.06825159 0.3732571
And this is what happens underneath:
set.seed(1)
sigmas = kernlab::sigest(as.matrix(x), na.action = na.omit, scaled = TRUE)
# from the code, you can see it takes the mean of the two extreme quantiles
mean(sigmas[-2])
[1] 0.1028894

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

Tuning XGboost parameters Using Caret - Error: The tuning parameter grid should have columns

I am using caret for modeling using "xgboost"
1- However, I get following error :
"Error: The tuning parameter grid should have columns nrounds,
max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample"
The code:
library(caret)
library(doParallel)
library(dplyr)
library(pROC)
library(xgboost)
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(train_churn$churn, k = 10)
# Compare class distribution
i <- my_folds$Fold1
table(train_churn$churn[i]) / length(i)
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
my_grid <- expand.grid(nrounds = 500,
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 1,
min_child_weight = 100,
subsample = 1)
set.seed(42)
model_xgb <- train(
class ~ ., data = train_churn,
metric = "ROC",
method = "xgbTree",
trControl = my_control,
tuneGrid = my_grid)
2- I also want to get a prediction made by averaging the predictions made by using the model fitted for each fold.
I know it's 'tad' bit late but, check your spelling of gamma in the grid of tuning parameters. You misspelled it as gammma (with triple m's).

Resources