Obtain importance of individual trees in a RandomForest - r

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?
rf_mod$forest doesn't seem to have this information, and the docs don't mention it.
In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).
library(randomForest)
df <- mtcars
set.seed(1)
rf_mod = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 200)
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
We can also extract individual tree structure with getTree. Here's the first tree.
head(getTree(rf_mod, k = 1, labelVar = TRUE))
left daughter right daughter split var split point status prediction
1 2 3 wt 2.15 -3 18.91875
2 0 0 <NA> 0.00 -1 31.56667
3 4 5 wt 3.16 -3 17.61034
4 6 7 drat 3.66 -3 21.26667
5 8 9 carb 3.50 -3 15.96500
6 0 0 <NA> 0.00 -1 19.70000
One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:
# number of trees to grow
nn <- 200
# function to run nn CART models
run_rf <- function(rand_seed){
set.seed(rand_seed)
one_tr = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 1)
return(one_tr)
}
# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)
The extraction, averaging, and comparison step.
# extract importance of each CART model
library(dplyr); library(purrr)
map(l, importance) %>%
map(as.data.frame) %>%
map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>%
bind_rows() %>%
group_by(var) %>%
summarise(`%IncMSE` = mean(`%IncMSE`)) %>%
arrange(-`%IncMSE`)
# A tibble: 10 x 2
var `%IncMSE`
<chr> <dbl>
1 wt 8.52
2 cyl 7.75
3 disp 7.74
4 hp 5.53
5 drat 1.65
6 carb 1.52
7 vs 0.938
8 qsec 0.824
9 gear 0.495
10 am 0.355
# compare to the RF model above
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.
I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.

When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.
Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:
## Starting with a random forest having a single tree,
## grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
rep(1,9), randomForest::grow )
## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )
## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194
# 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05
# 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1
# 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28
# 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98
# 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1
# 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56
# 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11
# 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77
# 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49
The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).

Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.
While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.
Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):
At each split in each tree, the improvement in the split-criterion is the
importance measure attributed to the splitting variable, and is accumulated
over all the trees in the forest separately for each variable. [...]
Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.
So the variable importance measure in RF is defined as a measure accumulated over all trees.
In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)
More complex measures to characterise variable importance in CART-like models exist; for example in rpart:
An overall measure of variable importance is the sum of the goodness of split
measures for each split for which it was the primary variable, plus goodness * (adjusted
agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum
to 100 and the rounded values are shown, omitting any variable whose proportion is less
than 1%.
So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.
Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.

We can simplify it by
library(tidyverse)
out <- map(seq_len(nn), ~
run_rf(.x) %>%
importance) %>%
reduce(`+`) %>%
magrittr::divide_by(nn)

Related

How do I export coefficients from a lm() object containing multiple lm()?

I have an object (S3; lm) that contains the linear regression outputs of 471 different models. I am trying to extract the standard error of a specific variable in each model but I'm unsure how to do so, can anyone help? Specifically, I want to extract the standard error for the variable "p" for EACH of the 471 models saved in the "fit" object.
varnames = names(merged1)[2036:2507]
fit <- lapply(varnames,
FUN=function(p) lm(formula(paste("Dx ~ x + y + z + q +", p)),data=merged1))
names(fit) <- varnames
Thank you so much!
Note
Edited to reflect the anonymous function p, rather than x, as stated previously.
Using fit shown reproducibly in the Note at the end invoke map_dfr on that with tidy which will give a data frame containing coefficients and associated statistics. We filter out the rows we want.
library(broom) # tidy
library(dplyr)
library(purrr) # map_dfr
fit %>%
map_dfr(tidy, .id = "variable") %>%
filter(term == variable)
giving:
# A tibble: 8 x 6
variable term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 hp hp -0.0147 0.0147 -1.00 0.325
2 drat drat 1.21 1.50 0.812 0.424
3 wt wt -3.64 1.04 -3.50 0.00160
4 qsec qsec -0.243 0.402 -0.604 0.551
5 vs vs -0.634 1.90 -0.334 0.741
6 am am 1.93 1.34 1.44 0.161
7 gear gear 0.158 0.910 0.174 0.863
8 carb carb -0.737 0.393 -1.88 0.0711
Note
We compute fit reproducibly using mtcars which is built into R.
data <- mtcars
resp <- "mpg" # response
fixed <- c("cyl", "disp") # always include these
varnames <- setdiff(names(data), c(resp, fixed)) # incl one at a time
fit <- Map(function(v) {
fo <- reformulate(c(fixed, v), resp)
lm(fo, data)
}, varnames)
Updated
Significantly revised.
sapply(fit,function(x) summary(x)$coefficients[p,][2],simplify = F)
subsetting to 2nd element serves standard error for a variable.

R run linear model by group in dataset [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

Effect size (Cohen's d) for pairwise comparisons

I'm trying to calculate the effect size among different factor levels. To compare the two means within each factor level, the code below works fine:
cohens_d_list <- by(mydata, mydata$factor, function(sub)
cohens_d(sub$score1, sub$score2)
)
cohens_d_list
However, I couldn't figure out how to compare each factor level for a single mean (e.g. for score1, I want to compare each factor level with each other: factor level 1 vs. factor level 2, factor level 1 vs. factor level 3, factor level 1. vs factor level 4....) with each other. I used psych, effectsize, and effsize packages, but they don't seem to account for more than 2 levels in a single factor variable. Any suggestions for a code or package?
After trying dozens of packages, esvis package did the trick.
df%>%
ungroup(Group)%>% # Include this line if you get grouping error
coh_d(score1~ Group)
You get a nice table with all possible comparisons.
You can fit a model and use the eff_size() function from emmeans (which will have the benefit of using the pooled SD from all groups, not just the 2 being compared):
m <- lm(mpg ~ factor(cyl), data = mtcars)
library(emmeans)
(em <- emmeans(m, ~ cyl))
#> cyl emmean SE df lower.CL upper.CL
#> 4 26.7 0.972 29 24.7 28.7
#> 6 19.7 1.218 29 17.3 22.2
#> 8 15.1 0.861 29 13.3 16.9
#>
#> Confidence level used: 0.95
eff_size(em, sigma = sigma(m), edf = df.residual(m))
#> contrast effect.size SE df lower.CL upper.CL
#> 4 - 6 2.15 0.56 29 1.003 3.29
#> 4 - 8 3.59 0.62 29 2.320 4.86
#> 6 - 8 1.44 0.50 29 0.418 2.46
#>
#> sigma used for effect sizes: 3.223
#> Confidence level used: 0.95
Created on 2021-06-07 by the reprex package (v2.0.0)

Fitting linear model / ANOVA by group [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question.
Here is the str() of my data to this point.
str(mhw)
'data.frame': 500 obs. of 5 variables:
$ r : int 1 2 3 4 5 6 7 8 9 10 ...
$ c : int 1 1 1 1 1 1 1 1 1 1 ...
$ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ...
$ straw: num 6.37 6.24 7.05 6.91 5.93 5.59 5.32 5.52 6.03 5.66 ...
$ Quad : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
Column r is a numerical value indicating which row in the field an individual plot resides
Column c is a numerical value indicating which column an individual plot resides
Column Quad corresponds to the geographical location in the field to which each plot resides
Quad <- ifelse(mhw$c > 13 & mhw$r < 11, "NE",ifelse(mhw$c < 13 & mhw$r < 11,"NW", ifelse(mhw$c < 13 & mhw$r >= 11, "SW","SE")))
mhw <- cbind(mhw, Quad)
I have fit a lm() as follows
nov.model <-lm(mhw$grain ~ mhw$straw)
anova(nov.model)
This is an anova() for the entire field, which is testing grain yield against straw yield for each plot in the dataset.
My trouble is that I want to run an individual anova() for the Quad column of my data to test grain yield and straw yield in each quadrant.
perhaps a with() might fix that. I have never used it before and I am in the process of learning R currently. Any help would be greatly appreciated.
I think you are looking for by facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to #Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme; it comes with R as one of recommended packages.

Finding correlation of multiple columns in R

My requirement is to find the Co-Relation of E_Id, IncomeType and Tax to help understand if any E_Id, IncomeType always leads to higher Tax. My sample data for the required columns is
E_id IncomeType Tax
1 1 121
2 1 11.23
2 3 51.623
1 1 115.23
3 4 675.1
I have around 5 lacs of data, 4 types of IncomeType, 340 unique E_id. I grouped the data and now my data looks something like this:
E_Id Tax_Income_1 Tax_Income_2 Tax_Income_3 Tax_Income_4
1 118025 66513.25 148134 274072.16
2 200527 235278 247536.42 487333.98
3 3376.93 11279 114312.5 130463.97
4 44630 22285.95 20830.55 2375
5 42902.63 15649 7602.01 3624
Now I don't have any idea how to find the correlation. This is my first analytics project, please provide some guidance.
I would like to draw your attention to - correlation_table {funModeling}
data(mtcars)
correlation_table(data=mtcars, target="mpg")
Variable mpg
1 mpg 1.00
2 drat 0.68
3 gear 0.48
4 qsec 0.42
5 carb -0.55
6 hp -0.78
7 cyl -0.85
8 disp -0.85
9 wt -0.87
Also using the mtcars data set as an example, the cor() unction will produce a matrix of variable correlations.
data(mtcars)
cor(mtcars)
You can also graphically represent these correlations:
corrgram(mtcars)
Using the mtcars dataset as an example you can visualize the correlations of all the variables like this:
data(mtcars)
pairs(mpg ~ ., data = mtcars)

Resources