Can I use random forest to do feature selection caret - r

I am using the caret R package do model training, I am totally new to machine learning.
I don't know if I can use the idea shows below to do feature selection and train a model
My code is shown as below:
The idea is that first I will run random forest in train function, then I will select the top20 important features(based on varImp function),and re-train my data based on these top20 features.
I am not sure if this method works?
1.first I will train a random forest model with all features
ctrl <- trainControl(method="cv",
number = 5,
summaryFunction=twoClassSummary,
classProbs=T,
savePredictions = T)
##################################### train model
set.seed(1234)
model <-
train(Bgroup ~ .,
data=all_data,
method="rf", preProc=c("center", "scale","nzv"),
trControl=ctrl,
tuneLength = 10,
metric = "ROC"
)
##################################### based on this model, I can get a plot of feature importance
feature_importance <- varImp(model)
##################################### I only selected top20 important features
importance_df <- data.frame(feature_importance$importance,feature = rownames(feature_importance$importance))
top20 <- head(importance_df[order(importance_df[,1],decreasing = T),],n=20) %>%
.$feature %>%
gsub("`", '', .)
##################################### top20 data selection
data_top20 <- all_data[,top20]
data_top20$group <- all_data$Bgroup
##################################### re-train model again based on these 20 features
set.seed(1234)
model_top20 <- train(group ~ .,
data=data_top20,
method="rf", preProc=c("center", "scale","nzv"),
trControl=ctrl,
tuneLength = 10,
metric = "ROC"
)
### calculate performance
a <- filter(data_top20$pred, mtry ==4)
confusionMatrix(a$pred,a$obs,positive = "positive")
Is the idea okay? I am very appreciated if anyone can give me some suggestions

Related

Listing model coefficients in descending order

I have a dataset with both continuous and categorical variables. I am running regression to predict one of the variables based on the other variables in the dataset. After comparing the results of ridge, lasso and elastic-net regression, the lasso regression is the best model to proceed with.
I used the 'coef' function to extract the model's coefficients, however, the result is a very long list with over 800 variables (as some of my categorical variables have many levels). Is there a way I can quickly rank the coefficients from largest to smallest? This is a glmnet model output
Reproducible problem with example code:
# Libraries Needed
library(caret)
library(glmnet)
library(mlbench)
library(psych)
# Data
data("BostonHousing")
data <- BostonHousing
str(data)
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = T)
# Linear Model
set.seed(1234)
lm <- train(medv ~.,
train,
method='lm',
trControl = custom)
# Results
lm$results
lm
summary(lm)
plot(lm$finalModel)
# Ridge Regression
set.seed(1234)
ridge <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0.0001, 1, length=5)),#try 5 values for lambda between 0.0001 and 1
trControl=custom)
#increasing lambda = increasing penalty and vice versa
#increase lambda therefore will cause coefs to shrink
# Plot Results
plot(ridge)
plot(ridge$finalModel, xvar = "lambda", label = T)
plot(ridge$finalModel, xvar = 'dev', label=T)
plot(varImp(ridge, scale=T))
# Lasso Regression
set.seed(1234)
lasso <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha=1,
lambda = seq(0.0001,1, length=5)),
trControl = custom)
# Plot Results
plot(lasso)
lasso
plot(lasso$finalModel, xvar = 'lambda', label=T)
plot(lasso$finalModel, xvar = 'dev', label=T)
plot(varImp(lasso, scale=T))
# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Plot Results
plot(en)
plot(en$finalModel, xvar = 'lambda', label=T)
plot(en$finalModel, xvar = 'dev', label=T)
plot(varImp(en))
# Compare Models
model_list <- list(LinearModel = lm, Ridge = ridge, Lasso = lasso, ElasticNet=en)
res <- resamples(model_list)
summary(res)
bwplot(res)
xyplot(res, metric = 'RMSE')
# Best Model
en$bestTune
best <- en$finalModel
coef(best, s = en$bestTune$lambda)
For most models all you'd have to do would be:
sort(coef(model), decreasing=TRUE)
Since you're using glmnet it's a little bit more complicated. I'm going to replicate a minimal version of your example here (the other models, plots, etc. are not necessary in order for us to be able to reproduce your problem ...)
## Packages
library(caret)
library(glmnet)
library(mlbench) ## for BostonHousing data
# Data
data("BostonHousing")
data <- BostonHousing
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = TRUE)
# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Best Model
best <- en$finalModel
coefs <- coef(best, s = en$bestTune$lambda)
(This could probably be made simpler: for example, do you really need the custom control parameters to show us the example? This would be even simpler without using caret - just using `glmnet - but I was afraid I might leave something out.)
Once you've got the coefficients, sorting does appear to work, albeit with a message about possible inefficiency:
sort(coefs, decreasing=TRUE)
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] 25.191049410 5.078589706 1.389548822 0.244605193 0.045600250
## [6] 0.008840485 0.004372752 -0.012701593 -0.028337745 -0.162794401
## [11] -0.335062819 -0.901475516 -1.395091095 -12.632336419
sort(as.numeric(coefs)) also appears to work fine.
If you want to sort the entire matrix (i.e. keeping the values for all penalization levels), you can take advantage of the fact that the penalization doesn't change the rank-order of the parameters:
coeftab <-coef(best)
lastvals <- coeftab[,ncol(coeftab)]
coeftab_s <- coeftab[order(lastvals,decreasing=TRUE),]
## plot, leaving out the intercept
matplot(t(coeftab_s)[,-1],type="l")

Extract the Intercept from a Caret LASSO Model

I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])

R: Feature Selection with Cross Validation using Caret on Logistic Regression

I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)
You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.

Hyper-parameter tuning using pure ranger package in R

Love the speed of the ranger package for random forest model creation, but can't see how to tune mtry or number of trees. I realize I can do this via caret's train() syntax, but I prefer the speed increase that comes from using pure ranger.
Here's my example of basic model creation using ranger (which works great):
library(ranger)
data(iris)
fit.rf = ranger(
Species ~ .,
training_data = iris,
num.trees = 200
)
print(fit.rf)
Looking at the official documentation for tuning options, it seems like the csrf() function may provide the ability to tune hyper-parameters, but I can't get the syntax right:
library(ranger)
data(iris)
fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)
print(fit.rf.tune)
Results in:
Error in ranger(Species ~ ., training_data = iris, num.trees = 200) :
unused argument (training_data = iris)
And I'd prefer to tune with the regular (read: non-csrf) rf algorithm ranger provides. Any idea as to a hyper-parameter tuning solution for either path in ranger? Thank you!
To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).
ptm <- proc.time()
library(ranger)
library(mlr)
# Define task and learner
task <- makeClassifTask(id = "iris",
data = iris,
target = "Species")
learner <- makeLearner("classif.ranger")
# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
makeDiscreteParam("num.trees", 200))
# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
control = makeTuneControlGrid())
# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)
print(m)
print(proc.time() - ptm) # ~6 seconds
For the curious, the caret equivalent is
ptm <- proc.time()
library(caret)
data(iris)
grid <- expand.grid(mtry = c(3,4))
fitControl <- trainControl(method = "CV",
number = 5,
verboseIter = TRUE)
fit = train(
x = iris[ , names(iris) != 'Species'],
y = iris[ , names(iris) == 'Species'],
method = 'ranger',
num.trees = 200,
tuneGrid = grid,
trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds
Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.
I think there are at least two errors:
First, the function ranger does not have a parameter called training_data. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) refers to that. You can see that when you look at ?ranger or args(ranger).
Second, the function csrf, on the other hand, has training_data as input, but also requires test_data. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:
fit.rf = ranger(
Species ~ ., data = iris,
num.trees = 200
)
fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)
Here, I have just provided iris as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger also take num.trees and mtry as input, so you could try tuning it there.
Note that mlr per default disables the internal parallelization of ranger. Set hyperparameter num.threads to the number of cores available to speed mlr up:
learner <- makeLearner("classif.ranger", num.threads = 4)
Alternatively, start a parallel backend via
parallelStartMulticore(4) # linux/osx
parallelStartSocket(4) # windows
before calling tuneParams to parallelize the tuning.
Another way to tune the model is to create a manual grid, maybe there are better ways to train the model but this may be a different option.
hyper_grid <- expand.grid(
mtry = 1:4,
node_size = 1:3,
num.trees = seq(50,500,50),
OOB_RMSE = 0
)
system.time(
for(i in 1:nrow(hyper_grid)) {
# train model
rf <- ranger(
formula = Species ~ .,
data = iris,
num.trees = hyper_grid$num.trees[i],
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
importance = 'impurity')
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
})
user system elapsed
3.17 0.19 1.36
nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
mtry node_size num.trees OOB_RMSE
6 2 2 50 0.1825741858
23 3 3 100 0.1825741858
3 3 1 50 0.2000000000
11 3 3 50 0.2000000000
14 2 1 100 0.2000000000
# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result
Call:
ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
Type: Classification
Number of trees: 50
Sample size: 150
Number of independent variables: 4
Mtry: 2
Target node size: 2
Variable importance mode: impurity
Splitrule: gini
OOB prediction error: 5.33 %
I hope it serves you.
There is also the tuneRanger R package, which is specifically designed for tuning ranger and uses predefined tuning parameters, hyperparameter spaces and intelligent tuning by using the out-of-bag observations.
Note, that random forest is not an algorithm were tuning makes a big difference, usually. But it can usually improve the performance a bit.

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Resources