Is there any way, where we can create multiple random forest models by fine-tuning hyper parameters on train data and check the test data performance against all models and store it in a csv file?
For ex:- i have one model with mtry is 6, nodesize is 3, and another model where mtryis 10 and nodesize is 4 What i need to do is to test these two models performance on test data and store the key model metrics like confusion matrix, sensitivity, and specificity.
i have tried the following code
train_performance <- data.frame('TN'=0,'FP'=0,'FN'=0,'TP'=0,'accuracy'=0,'kappa'=0,'sensitivity'=0,'specificity'=0)
modellist <- list()
for (mtry in c(6,11)){
for (nodesize in c(2,3)){
fit_model <- randomForest(dv~., train_final,mtry = mtry, importance=TRUE, nodesize=nodesize,
sampsize = ceiling(.8*nrow(train_final)), proximity=TRUE,na.action = na.omit,
ntree=500)
Key_col <- paste0(mtry,"-",nodesize)
modellist[[Key_col]] <- fit_model
pred_train <- predict(fit_model, train_final)
cf <- confusionMatrix(pred_train, train_final$DV, mode = 'everything', positive = '1')
train_performance$TN <- cf$table[1]
train_performance$FP <- cf$table[2]
train_performance$FN <- cf$table[3]
train_performance$TP <- cf$table[4]
train_performance$accuracy=cf$overall[1]
train_performance$kappa=cf$overall[2]
train_performance$sensitivity=cf$byClass[1]
train_performance$specificity=cf$byClass[2]
train_performance$key=Key_col
}
}
Below is sample method using caret package on how to tune and train your random forest model which outputs accuracy parameters for all models:
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)
output:
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.8138384 0.6209924 0.0747572 0.1569159
Tune Using Caret:
Random Search:
One search strategy that we can use is to try random values within a range.
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(seed)
mtry <- sqrt(ncol(x))
rf_random <- train(Class~., data=dataset, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
plot(rf_random)
output:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
11 0.8218470 0.6365181 0.09124610 0.1906693
14 0.8140620 0.6215867 0.08475785 0.1750848
17 0.8030231 0.5990734 0.09595988 0.1986971
24 0.8042929 0.6002362 0.09847815 0.2053314
30 0.7933333 0.5798250 0.09110171 0.1879681
34 0.8015873 0.5970248 0.07931664 0.1621170
45 0.7932612 0.5796828 0.09195386 0.1887363
47 0.7903896 0.5738230 0.10325010 0.2123314
49 0.7867532 0.5673879 0.09256912 0.1899197
50 0.7775397 0.5483207 0.10118502 0.2063198
60 0.7790476 0.5513705 0.09810647 0.2005012
Grid Search:
Another search is to define a grid of algorithm parameters to try.
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(seed)
tunegrid <- expand.grid(.mtry=c(1:15))
rf_gridsearch <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
plot(rf_gridsearch)
output:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
1 0.8377273 0.6688712 0.07154794 0.1507990
2 0.8378932 0.6693593 0.07185686 0.1513988
3 0.8314502 0.6564856 0.08191277 0.1700197
4 0.8249567 0.6435956 0.07653933 0.1590840
5 0.8268470 0.6472114 0.06787878 0.1418983
6 0.8298701 0.6537667 0.07968069 0.1654484
7 0.8282035 0.6493708 0.07492042 0.1584772
8 0.8232828 0.6396484 0.07468091 0.1571185
9 0.8268398 0.6476575 0.07355522 0.1529670
10 0.8204906 0.6346991 0.08499469 0.1756645
11 0.8073304 0.6071477 0.09882638 0.2055589
12 0.8184488 0.6299098 0.09038264 0.1884499
13 0.8093795 0.6119327 0.08788302 0.1821910
14 0.8186797 0.6304113 0.08178957 0.1715189
15 0.8168615 0.6265481 0.10074984 0.2091663
There are many other methods to tune your random forest model and store the results of these models, above two are the most widely used methods.
Moreover, you can also manually set these parameters up and train and tune the model.
Related
I'm trying to create a boxplot with the distribution of RMSE over all predicted resamples. The mean of the resamples equals the models predicted RMSE and therefore it would be interesting to exhibit how this number is calculated. How can I obtain predicted RMSE if I had run each of the models resamples? For example with 5-fold CV:
Model RMSE: 5
Fold 1, 2, 3, 4 ,5 = 5.02 , 5.01, 5, 4.99, 4.98
# Load packages
library(mlbench)
library(caret)
# Load data
data(BostonHousing)
#Dividing the data into train and test set
set.seed(1)
sample <- createDataPartition(BostonHousing$medv, p=0.75, list = FALSE)
train <- BostonHousing[sample,]
test <- BostonHousing[-sample,]
control <- trainControl(method='repeatedcv', number=10, repeats=3, savePredictions=TRUE)
metric <- 'RMSE'
# some random model
set.seed(1)
example <- train(medv~., data=train, method='example', metric=metric,
preProc=c('center', 'scale'), trControl=control)
I know one can obtain for this for resampled on train; example$resample
Is there some similar default way to this for predicted with each resample?
Appreciate all help, thanks.
I want to combine recursive feature elimination with rfe() and tuning together with model selection with trainControl() using the method rf (random forest). Instead of the standard summary statistic I would like to have the MAPE (mean absolute percentage error). Therefore I tried the following code using the ChickWeight data set:
library(caret)
library(randomForest)
library(MLmetrics)
# Compute MAPE instead of other metrics
mape <- function(data, lev = NULL, model = NULL){
mape <- MAPE(y_pred = data$pred, y_true = data$obs)
c(MAPE = mape)
}
# specify trainControl
trc <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid", savePred =T,
summaryFunction = mape)
# set up grid
tunegrid <- expand.grid(.mtry=c(1:3))
# specify rfeControl
rfec <- rfeControl(functions=rfFuncs, method="cv", number=10, saveDetails = TRUE)
set.seed(42)
results <- rfe(weight ~ Time + Chick + Diet,
sizes=c(1:3), # number of predictors from which should algorithm chose the best predictor
data = ChickWeight,
method="rf",
ntree = 250,
metric= "RMSE",
tuneGrid=tunegrid,
rfeControl=rfec,
trControl = trc)
The code runs without errors. But where do I find the MAPE, which I defined as a summaryFunction in trainControl? Is trainControlexecuted or ignored?
How could I rewrite the code in order to do recursive feature elimination with rfe and then tune the hyperparameter mtry using trainControl within rfe and at the same time compute an additional error measure (MAPE)?
trainControl is ignored, as its description
Control the computational nuances of the train function
would suggest. To use MAPE, you want
rfec$functions$summary <- mape
Then
rfe(weight ~ Time + Chick + Diet,
sizes = c(1:3),
data = ChickWeight,
method ="rf",
ntree = 250,
metric = "MAPE", # Modified
maximize = FALSE, # Modified
rfeControl = rfec)
#
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables MAPE MAPESD Selected
# 1 0.1903 0.03190
# 2 0.1029 0.01727 *
# 3 0.1326 0.02136
# 53 0.1303 0.02041
#
# The top 2 variables (out of 2):
# Time, Chick.L
I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])
I am using Random forest algorithm to predict target variable "Y" have 4 values
Below syntax is used to create model
control <- trainControl(method="repeatedcv", number=2, repeats=1, search="random")
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(train))
model <- train(Target~., data=complete, method="rf", metric=metric, tuneLength=15, trControl=control)
But, when I test trained model on test dataset it does gives accuracy close to 50% only , is there any way in which accuracy can be increased close 70% and above?
I'm a little confused how caret scores the test folds in k-fold cross validation.
I'd like to generate a data frame or matrix containing the scored records of the ten test datasets in 10-fold cross validation.
For example, using the iris dataset to train a decision tree model:
install.packages("caret", dependencies=TRUE)
library(caret)
data(iris)
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE),
model <- train(Species ~ ., data=iris, trControl=train_control, method="rpart")
model$pred
The model$pred command lists predictions for ten folds in 450 records.
This doesn't seem right - shouldn't model$pred produce predictions for the 150 records in the ten test folds (1/10 * 150 = 15 records per test fold)? How are 450 records generated?
By default, train iterates over three values for the complexity parameter cp of rpart (see ?rpart.control):
library(caret)
data(iris)
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart")
nrow(model$pred)
# [1] 450
length(unique(model$pred$cp))
# [1] 3
You can change that for example by explicitly specifying cp=0.05:
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart",
tuneGrid = data.frame(cp = 0.05))
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1
or by using tuneLength=1 instead of the default 3:
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart",
tuneLength = 1)
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1