Variable importance with ranger - r

I trained a random forest using caret + ranger.
fit <- train(
y ~ x1 + x2
,data = total_set
,method = "ranger"
,trControl = trainControl(method="cv", number = 5, allowParallel = TRUE, verbose = TRUE)
,tuneGrid = expand.grid(mtry = c(4,5,6))
,importance = 'impurity'
)
Now I'd like to see the importance of variables. However, none of these work :
> importance(fit)
Error in UseMethod("importance") : no applicable method for 'importance' applied to an object of class "c('train', 'train.formula')"
> fit$variable.importance
NULL
> fit$importance
NULL
> fit
Random Forest
217380 samples
32 predictors
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 173904, 173904, 173904, 173904, 173904
Resampling results across tuning parameters:
mtry RMSE Rsquared
4 0.03640464 0.5378731
5 0.03645528 0.5366478
6 0.03651451 0.5352838
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 4.
Any idea if & how I can get it ?
Thanks.

varImp(fit) will get it for you.
To figure that out, I looked at names(fit), which led me to names(fit$modelInfo) - then you'll see varImp as one of the options.

according to #fmalaussena
set.seed(123)
ctrl <- trainControl(method = 'cv',
number = 10,
classProbs = TRUE,
savePredictions = TRUE,
verboseIter = TRUE)
rfFit <- train(Species ~ .,
data = iris,
method = "ranger",
importance = "permutation", #***
trControl = ctrl,
verbose = T)
You can pass either "permutation" or "impurity" to argument importance.
The description for both value can be found here: https://alexisperrier.com/datascience/2015/08/27/feature-importance-random-forests-gini-accuracy.html

For 'ranger' package you could call an importance with
fit$variable.importance
As a side note, you could see the all available outputs for the model using str()
str(fit)

Related

R caret: "non-numeric argument to binary operator" in train with qrf

When I run a quantile regression forest model with caret::train, I get the following error: Error in { : task 1 failed - "non-numeric argument to binary operator".
When I set ntree to a higher number (in my reproducible example this would be ntree = 150), my code runs without errors.
This code
library(caret)
library(quantregForest)
data(segmentationData)
dat <- segmentationData[segmentationData$Case == "Train",]
dat <- dat[1:50,]
# predictors
preds <- dat[,c(5:ncol(dat))]
# convert all to numeric
preds <- data.frame(sapply(preds, function(x) as.numeric(as.character(x))))
# response variable
response <- dat[,4]
# set up error measures
sumfct <- function(data, lev = NULL, model = NULL){
RMSE <- sqrt(mean((data$pred - data$obs)^2, na.omit = TRUE))
c(RMSE = RMSE)
}
# specify folds
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
folds_train <- caret::createMultiFolds(y = dat$Cell,
k = 10,
times = 5)
# specify trainControl for tuning mtry with the created multifolds
finalcontrol <- caret::trainControl(search = "grid", method = "repeatedcv", number = 10, repeats = 5,
index = folds_train, savePredictions = TRUE, summaryFunction = sumfct)
# build grid for tuning mtry
tunegrid <- expand.grid(mtry = c(2, 10, sqrt(ncol(preds)), ncol(preds)/3))
# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, # with ntree = 150 it works
metric = "RMSE",
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = TRUE
)
produces the error. The model with my real data has ntree = 10000 and still the task is failing.
How can I fix this?
Where in the source code of caret can I find the conditions for the error message Error in { : task 1 failed - "non-numeric argument to binary operator"? From which part of the source code does the error message come from?
UPDATE:
I adapted my code with my real data according to the answer of StupidWolf, so it looks like this:
# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, # with ntree = 150 it works
metric = "RMSE",
sampsize = ceiling(length(response)*0.4)
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = FALSE
)
With my real data I still get the above error message, so that I had to adapt the sampsize to 0.1*length(response) in the worst case in order to compute the model successfully. So only setting keep.inbag = FALSEstill produced errors. I have up to 1500 predictors while the number of samples (rows) are only 50 to 60. I still don't understand, what exactly causes the error message. I tried the model without the sampsize argument, but always set keep.inbag = FALSE. The error was still occuring. only setting the sampsize very low ensured success.
How can I run the model successfully without setting sampsize? I actually wanted the bootstrap approach for the out of bag data sets and not the artificial sampsize of 40 % or 10% of my data set for training the forest.
You get the error because you used the option keep.inbag = TRUE, in the quantregforest code, line 95:
minoob <- min( apply(!is.na(valuesPredict),1,sum))
if(minoob<10) stop("need to increase number of trees for sufficiently many out-of-bag observations")
So it requires that all of your observations have at least 10 instances of OOB (out of bag), to keep the out of bag predictions. So if your real data is huge, the ntrees required for keeping the out of bag is going to be huge.
If you are using caret for training the data, keeping the OOB and having savePredictions = TRUE seems redundant. On the whole, OOB predictions might not be so useful since you will be using the test fold to predict anyway.
Another option, given the size of your data, is to tweak the sampsize. In randomForest only a number of sampsize observations are sampled with replacement subset to fit a tree. If you set a lower size for this, you ensure there's enough OOB. For example in the example given, we can see:
model <- caret::train(x = preds,
y = response,
method ="qrf",
ntree = 30, sampsize=17,
metric = "RMSE",
tuneGrid = tunegrid,
trControl = finalcontrol,
importance = TRUE,
keep.inbag = TRUE)
model
Quantile Random Forest
50 samples
57 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 44, 43, 44, 46, 45, 46, ...
Resampling results across tuning parameters:
mtry RMSE
2.000000 42.53061
7.549834 42.72116
10.000000 43.11533
19.000000 42.80340
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

How to fix "The metric "Accuracy" was not in the result set. AUC will be used instead"

I am trying to run a logistic regression on a classification problem
the dependent variable "SUBSCRIBEDYN" is a factor with 2 levels ("Yes" and "No")
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = F,
classProbs = T,
summaryFunction = prSummary)
set.seed(13)
simple.logistic.regression <- caret::train(SUBSCRIBEDYN ~ .,
data = train_data,
method = "glm",
metric = "Accuracy",
trControl = train.control)
simple.logistic.regression`
However, it does not accept Accuracy as a metric
"The metric "Accuracy" was not in the result set. AUC will be used instead"
For a classification model with 2 levels, you should use metric="ROC". metric="Accuracy" is used for multiple classes. However, after training the model, you can use the confusion matrix to retrieve the accuracy, for example using the function confusionMatrix().

How to compute AUC under ROC in R (caret, random forest , svm)

I am using random forest and support vector machine method in caret package in R. I want to calculate AUC under ROC for both cases; however, I do not know how to do it in this particular case. My outcome is coded as 0 and 1. Here is the example of code I am using :
set.seed(123)
cvCtrl <- trainControl(method = "cv", number = 10)
rf_moded<-train(readm30~.,data=train,method="rf", trControl=cvCtrl)
Do you want to train the model with ROC? Then you need the following:
For trainControl:
control <- trainControl(method = 'cv', number = 10,
savePredictions = 'final', classProbs = TRUE, summaryFunction = twoClassSummary)
And in train:
train(
outcome ~ .,
data = data,
method = method,
trControl = control,
metric = "ROC"
)

How to downsample using r-caret?

I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?
Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Resources