Why importance is affected after parallelization of randomForest? - r

I am working now with the randomForest package in R. To speed up the classification step, I was interested in performing the forest in parallel. For that, I have used the package 'foreach' in a similar way that it is indicated on the 'foreach' vignette. This consists in splitting the total number of trees by the number of cores you would like to use, and then combining them with the function 'combine' of the package 'randomForest':
require(randomForest)
require(foreach)
require(doParallel)
registerDoParallel(cores=CPUS)
rf <- foreach::foreach(ntree=rep(ceiling(NTREE/CPUS), CPUS), .combine=randomForest::combine, .packages='randomForest') %dopar% {
randomForest::randomForest(x=t(Y), y=A, ntree=ntree, importance=TRUE, ...)
}
I compared the results of the "parallel" forest with the forest generated in one core. The prediction capacity with the test set seems to be similar, but the 'importance' values are considerably reduced, and this affects the following steps of variable selection.
imp <- importance(rf,type=1)
I would like to know why this happens, and if it is correct or there is any mistake. Thanks a lot!

randomForest::combine does not support re-calculation of variable importance. In randomForest package importance is only calculated just before randomForest::randomForest function terminates. Two options are:
Write your own variable importance function, which will take the combined forest and training set as inputs. That is roundly ~50 lines of code.
Use a 'lapply'-like parallel computation, where each randomForest object is an element in the output list. Next aggregate variable importance across all forests and simply compute the mean. Use do.call(rf.list,combine) outside foreach loop instead. This method is an approximation of the total variable importance, but a quite good one.
Windows supported code example:
library(randomForest)
library(doParallel)
CPUS=6; NTREE=5000
cl = makeCluster(CPUS)
registerDoParallel(cl)
data(iris)
rf.list = foreach(ntree = rep(NTREE/CPUS,CPUS),
.combine=c,
.packages="randomForest") %dopar% {
list(randomForest(Species~.,data=iris,importance=TRUE, ntree=ntree))
}
stopCluster(cl)
big.rf = do.call(combine,rf.list)
big.rf$importance = rf.list[[1]]$importance
for(i in 2:CPUS) big.rf$importance = big.rf$importance + rf.list[[i]]$importance
big.rf$importance = big.rf$importance / CPUS
varImpPlot(big.rf)
#test number of trees in one forest and combined forest, big.rf
print(big.rf) #5000 trees
rf.list[[1]]$ntree
#training single forest
rf.single = randomForest(Species~.,data=iris,ntree=5000,importance=T)
varImpPlot(big.rf)
varImpPlot(rf.single)
#print unscaled variable importance, no large deviations
print(big.rf$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.033184860 0.023506673 0.04043017 0.03241500 9.679552
# Sepal.Width 0.008247786 0.002135783 0.00817186 0.00613059 2.358298
# Petal.Length 0.335508637 0.304525644 0.29786704 0.30933142 43.160074
# Petal.Width 0.330610910 0.307016328 0.27129746 0.30023245 44.043737
print(rf.single$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.031771614 0.0236603417 0.03782824 0.031049531 9.516198
# Sepal.Width 0.008436457 0.0009236979 0.00880401 0.006048261 2.327478
# Petal.Length 0.341879367 0.3090482654 0.29766905 0.312507316 43.786481
# Petal.Width 0.322015885 0.3045458852 0.26885097 0.296227150 43.623370
#but when plotting using varImppLot, scale=TRUE by default
#either simply turn of scaling to get comparable results
varImpPlot(big.rf,scale=F)
varImpPlot(rf.single,scale=F)
#... or correct scaling to the number of trees
big.rf$importanceSD = CPUS^-.5 * big.rf$importanceSD
#and now there are no large differences for scaled variable importance either
varImpPlot(big.rf,scale=T)
varImpPlot(rf.single,scale=T)

Related

shap plots for random forest models

I would like to get the Shap Contribution for variables for a Ranger/random forest model and have plots like this in R:
beeswarm plots
I have tried using the following libraries: DALEX, shapr, fastshap, shapper. I could only end up getting plots like this:
fastshap plot
Is it possible to get such plots? I have tried reticulate package and it still doesnt work.
Random forests need to grow many deep trees. While possible, crunching TreeSHAP for deep trees requires an awful lot of memory and CPU power. An alternative is to use the Kernel SHAP algorithm, which works for all kind of models.
library(ranger)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- ranger(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")
Disclaimer: I wrote "kernelshap" and "shapviz"

How to use expand.grid values to run various model hyperparameter combinations for ranger in R

I've seen various posts on how to select the independent variables for a model by using expand.grid and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.
library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris) # let's assume these are different input tables
I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger) for my list of input tables. I do the following to set up the grid:
hyper_grid <- expand.grid(
Input_table = names(Input_list),
Trees = c(10, 20),
Importance = c("none", "impurity"),
Classification = TRUE,
Repeats = 1:5,
Target = "Species")
> head(hyper_grid)
Input_table Trees Importance Classification Repeats Target
1 iris1 10 none TRUE 1 Species
2 iris2 10 none TRUE 1 Species
3 iris1 20 none TRUE 1 Species
4 iris2 20 none TRUE 1 Species
5 iris1 10 impurity TRUE 1 Species
6 iris2 10 impurity TRUE 1 Species
My question is, what is the best way to pass this values to the model? Currently I'm using a for loop:
for (i in 1:nrow(hyper_grid)) {
RF_train <- ranger(
dependent.variable.name = hyper_grid[i, "Target"],
data = Input_list[[hyper_grid[i, "Input_table"]]], # referring to the named object in the list
num.trees = hyper_grid[i, "Trees"],
importance = hyper_grid[i, "Importance"],
classification = hyper_grid[i, "Classification"]) # otherwise regression is performed
print(RF_train)
}
iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply for this role? This way of iterating also results in messy function calls:
Call:
ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i, "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i, "Importance"], classification = hyper_grid[i, "Classification"])
Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix) and write the results into an extended version of the hyper_grid on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.
Thank you!
I think it is cleanest to wrap the training and extraction of the values you need into a function. The dots (...) are needed for usage with the purrr::pmap function below.
fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
RF_train <- ranger(
dependent.variable.name = Target,
data = Input_list[[Input_table]], # referring to the named object in the list
num.trees = Trees,
importance = Importance,
classification = Classification) # otherwise regression is performed
data.frame(Prediction_error = RF_train$prediction.error,
True_positive = RF_train$confusion.matrix[1])
}
Then you can add the results as a column by mapping over the rows using for example purrr::pmap:
hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)
By mapping in this way, the function is applied row by row, so you should not run into memory issues.
The result of purrr::pmap is a list, which means that the column res contains a list for every row. This can be unnested using tidyr::unnest to spread the elements of that list across your data frame.
tidyr::unnest(hyper_grid, res)
I think this approach is very elegant, but it requires some tidyverse knowledge. I highly recommend this book if you want to know more about that. Chapter 25 (Many models) describes an approach similar to the one I'm taking here.

R biglasso results don't match hdm or glmnet

I've been experimenting with the R package 'biglasso' for high-dimensional data. However, the results I'm getting don't match the results I get for the LASSO functions from 'hdm' or 'glmnet. The documentation for biglasso is also really poor.
In the example below, the results from hdm and glmnet are very close but not exact, which is expected. However, biglasso doesn't drop the 'share' variable. I've tried all the different screen settings, and it doesn't make a difference. Any thoughts on how to get biglasso to be more consistent with the others? Thanks!
EDIT: for a given value of lambda, the results are highly similar. But each method seems to select a different lambda.. which for hdm makes sense, given that it's intended for causal inference and isn't concerned with out-of-sample prediction. hdm uses a different objective function from Belloni et al. (2012), but I'm not sure why cv.biglasso and cv.glmnet would differ so much. If I run biglasso without a screening rule, they should be maximizing the same objective function just with random diffs in the CV folds, no?
EDIT 2: I've edited the code below to include F. Privé's code to make glmnet use an algorithm similar to biglasso, and some additional code to make biglasso mimic glmnet.
##########
## PREP ##
##########
## Load required libraries
library(hdm)
library(biglasso)
library(glmnet)
## Read automobile dataset
data(BLP)
df <- BLP[[1]]
## Extract outcome
Y <- scale(df$mpg)
## Rescale variables
df$price <- scale(df$price)
df$mpd <- scale(df$mpd)
df$space <- scale(df$space)
df$hpwt <- scale(df$hpwt)
df$outshr <- scale(df$outshr)
## Limit to variables I want
df <- df[,names(df) %in% c("price","mpd","space","hpwt","share","outshr","air")]
## Convert to matrix
df.mat <- data.matrix(df)
df.bm <- as.big.matrix(df.mat)
#########
## HDM ##
#########
## Set seed for reproducibility
set.seed(1233)
## Run LASSO
fit.hdm <- rlasso(x=df.mat, y=Y, post=FALSE, intercept=TRUE)
## Check results
coef(fit.hdm)
############
## GLMNET ##
############
## Set seed for reproducibility
set.seed(1233)
## LASSO with 10-fold cross-validation
fit.glmnet <- cv.glmnet(df.mat, Y, alpha=1, family="gaussian")
## Check default results
coef(fit.glmnet)
## Try to mimic results of biglasso
coef(fit.glmnet, s = "lambda.min")
##############
## BIGLASSO ##
##############
## LASSO with 10-fold cross-validation
fit.bl <- cv.biglasso(df.bm, Y, penalty="lasso", eval.metric="default",
family="gaussian", screen="None",
seed=1233, nfolds=10)
## Check default results
coef(fit.bl)
## Try to mimic results of glmnet
## Calculate threshold for CV error (minimum + 1 standard error)
thresh <- min(fit.bl$cve) + sd(fit.bl$cve)/sqrt(100)
## Identify highest lambda with CVE at or below threshold
max.lambda <- max(fit.bl$lambda[fit.bl$cve <= thresh])
## Check results for the given lambda
coef(fit.bl$fit)[,which(fit.bl$fit$lambda==max.lambda)]
There are basically two ways to choose the "best" lambda after CV:
The one that minimizes the CV error (default of {biglasso})
The one that is the most parsimonious (highest lambda) with the CV error lower than the minimum + 1 standard error (default of {glmnet}).
Try coef(fit.glmnet, s = "lambda.min") to use the minimum.
Also, to ensure reproducibility, try setting the CV folds instead of some seed. There are parameters foldid in glmnet() and cv.ind in biglasso().

How do I program multiple output nodes using the neuralnet package in R?

I am building a neural network to predict "fit" based on a number of variables. "FitCls" is in three classes: "Excellent", "Good" and "Poor". I have 10 input variables, and have chosen one hidden layer with 6 neurons. I would like three output neurons so that I can classify the case that is presented to the neural network as a "fit" which is "excellent", "good" or "poor".
I have seen a similar example where this was done using iris data (on slide 40, et seq.) here: http://www.slideshare.net/DerekKane/data-science-part-viii-artifical-neural-network. I have tried to copy that structure, but I still only get a single output node when I plot the network.
Here is my code (after loading the 'nfit' dataframe):
nfit[nfit$FitCls=="Excellent", "Output"] <- 2
nfit[nfit$FitCls=="Good", "Output"] <- 1
nfit[nfit$FitCls=="Poor", "Output"] <- 0
nn <- neuralnet(Output~Universalism+Benevolence+Tradition+Conformity+Security+Power+Achievement+Hedonism+Stimulation+SelfDir, data = nfit, hidden = 6, err.fct = "ce", linear.output = FALSE)
When I run neuralnet, it gives me a warning message that it has forced err.fct to "sse" because the response is not binary. I am not sure what is going wrong because in the example that I am copying, the plot of the neural network shows three output nodes. Please let me know what I am doing wrong.
If this is not the right way to go about using neuralnet for classification I would also appreciate any help you can provide as to what I should be doing. Many thanks!
To more or less replicate the iris example you will need:
library(neuralnet)
library(nnet)
trainset <- cbind(iris[, 1:4], class.ind(iris$Species))
espnnet2=neuralnet(setosa + versicolor + virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, trainset)
plot(espnnet2)
Unfortunately neuralnet is sensible to data, try scale the explanatory variables.

trouble using foreach from doParallel with gamm4

I am trying to use foreach to make use of parallel processing for a complete subsets regression problem. I am trying to fit a complete list of models using the gamm4 package, using the binomial function where the response is provided as a proportion, and the weights argument supplies the number of trials. The code works fine when run using %do% but fails under %dopar% (retutns only NA's for AIC and BIC). Strangely, the code does work using %dopar% fine if the weights argument to the gamm4 call is left out, but obviously this is not a viable solution. I have been using similar code with no issues based on a gaussian distribution and a binomial distribution where the response is entered as 1,0s (thus no need for a call to weights) with no problems at all. I am using windows 7 64bit, with R version 3.1.2. I have updated all the relevant packages. A reproducible (but toy) example:
set.seed(666)
# generate a random factor with a random offset effect
random.factor=factor(sort(rep(1:10,10)))
random.effect=sort(rep(rnorm(10),10))
# generate some random predictor variables
X1 = rnorm(100)
X2 = rnorm(100)
X3 = rnorm(100)
X4 = rep(0,100) # make it so one variable fails (just to check the "try" if statement)
#X4 = rnorm(100)
X5 = rnorm(100)
# calculate a response variable based on some of the predictors
z = 1 + 2*X1 + 3*X2 + 2*X3^2 # linear combination with a bias
pr = 1/(1+exp(-(z+random.effect))) # pass through an inv-logit function
y = rbinom(n=100,size=100,pr)/100 # bernoulli response variable.
# Note that the response variable is a proprotion of successes of 100 trials
# We want to feed the number of trials as a "weights" argument to gamm
# now make a data frame of predictors
pred.dat=data.frame(X1=X1,X2=X2,X3=X3,X4=X4,X5=X5)
pred.vars=colnames(pred.dat)
# make a dataframe for passing to gamm
use.dat = data.frame(random.factor=random.factor,y=y,pred.dat)
# now set up the models to run
# this includes all combinations of variables, but only up to a total of two in
# any one model
model.fits.test=c(combn(1:ncol(pred.dat), 1,simplify = F),
combn(1:ncol(pred.dat), 2,simplify = F))
models.use=list(1,2,3,4,5)
n.models=length(model.fits.test)
require(lme4)
require(doParallel)
registerDoParallel(cores=4)
# if I run this using do, it works fine (with error values from the try argument
# returned for models that fail)
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%do%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# if I run using dopar, nothing is returned.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~",
paste(colnames(pred.dat)[vars.vec],collapse="+"),"+(1|random.factor)",sep=""))
model.fit=try(glmer(formula.l,
data=use.dat,
family="binomial",
weights=rep(100,nrow(use.dat))))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit)
out.vec["BIC"]<-BIC(model.fit)
}
return(out.vec)
}
out.dat
# Now run dopar without the weights argument (not really appropriate,
# but for the sake of demonstration). I get results again, but it doesn't
# really make sense to do this. Also, my real example fails unless I can supply
# weights.
out.dat<-foreach(l = 1:n.models,.combine=rbind,
.packages=c("lme4","gamm4"))%dopar%{
vars.vec=model.fits.test[[l]]
formula.l<-as.formula(paste("y~1+",
paste("s(",colnames(pred.dat)[vars.vec],")",collapse="+"),sep=""))
model.fit=try(gamm4(formula.l, random=~(1|random.factor),
data=use.dat,family="binomial"))
success<-class(model.fit)[[1]]!="try-error"
out.vec<-c(rep(NA,2),rep(NA,ncol(pred.dat)))
names(out.vec)<- c("AIC","BIC",colnames(pred.dat))
out.vec[
which(match(names(out.vec),pred.vars[vars.vec])>0)]<-1
if(success){
out.vec["AIC"]<-AIC(model.fit$mer)
out.vec["BIC"]<-BIC(model.fit$mer)
}
return(out.vec)
}
out.dat

Resources