I'm trying to calculate the effect size among different factor levels. To compare the two means within each factor level, the code below works fine:
cohens_d_list <- by(mydata, mydata$factor, function(sub)
cohens_d(sub$score1, sub$score2)
)
cohens_d_list
However, I couldn't figure out how to compare each factor level for a single mean (e.g. for score1, I want to compare each factor level with each other: factor level 1 vs. factor level 2, factor level 1 vs. factor level 3, factor level 1. vs factor level 4....) with each other. I used psych, effectsize, and effsize packages, but they don't seem to account for more than 2 levels in a single factor variable. Any suggestions for a code or package?
After trying dozens of packages, esvis package did the trick.
df%>%
ungroup(Group)%>% # Include this line if you get grouping error
coh_d(score1~ Group)
You get a nice table with all possible comparisons.
You can fit a model and use the eff_size() function from emmeans (which will have the benefit of using the pooled SD from all groups, not just the 2 being compared):
m <- lm(mpg ~ factor(cyl), data = mtcars)
library(emmeans)
(em <- emmeans(m, ~ cyl))
#> cyl emmean SE df lower.CL upper.CL
#> 4 26.7 0.972 29 24.7 28.7
#> 6 19.7 1.218 29 17.3 22.2
#> 8 15.1 0.861 29 13.3 16.9
#>
#> Confidence level used: 0.95
eff_size(em, sigma = sigma(m), edf = df.residual(m))
#> contrast effect.size SE df lower.CL upper.CL
#> 4 - 6 2.15 0.56 29 1.003 3.29
#> 4 - 8 3.59 0.62 29 2.320 4.86
#> 6 - 8 1.44 0.50 29 0.418 2.46
#>
#> sigma used for effect sizes: 3.223
#> Confidence level used: 0.95
Created on 2021-06-07 by the reprex package (v2.0.0)
Related
This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.
I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model
I have a mixed effects model (with lme4) with a 2-way interaction term, each term having multiple levels (each 4) and I would like to investigate their effects in reference to their grand mean. I present this example here from the car data set and omit the error term since it is not neccessary for this example:
## shorten data frame for simplicity
df=Cars93[c(1:15),]
df=Cars93[is.element(Cars93$Make,c('Acura Integra', 'Audi 90','BMW 535i','Subaru Legacy')),]
df$Make=drop.levels(df$Make)
df$Model=drop.levels(df$Model)
## define contrasts (every factor has 4 levels)
contrasts(df$Make) = contr.treatment(4)
contrasts(df$Model) = contr.treatment(4)
## model
m1 <- lm(Price ~ Model*Make,data=df)
summary(m1)
as you can see, the first levels are omitted in the interaction term. And I would like to have all 4 levels in the output, referenced to the grand mean (often referred to deviant coding). These are the sources I looked at: https://marissabarlaz.github.io/portfolio/contrastcoding/#coding-schemes and How to change contrasts to compare with mean of all levels rather than reference level (R, lmer)?. The last reference does not report interactions though.
The simple answer is that what you want is not possible directly. You have to use a slightly different approach.
In a model with interactions, you want to use contrasts in which the mean is zero and not a specific level. Otherwise, the lower-order effects (i.e., main effects) are not main effects but simple effects (evaluated when the other factor level is at its reference level). This is explained in more details in my chapter on mixed models:
http://singmann.org/download/publications/singmann_kellen-introduction-mixed-models.pdf
To get what you want, you have to fit the model in a reasonable manner and then pass it to emmeans to compare against the intercept (i.e., the unweighted grand mean). This works also for interactions as shown below (as your code did not work, I use warpbreaks).
afex::set_sum_contrasts() ## uses contr.sum globally
library("emmeans")
## model
m1 <- lm(breaks ~ wool * tension,data=warpbreaks)
car::Anova(m1, type = 3)
coef(m1)[1]
# (Intercept)
# 28.14815
## both CIs include grand mean:
emmeans(m1, "wool")
# wool emmean SE df lower.CL upper.CL
# A 31.0 2.11 48 26.8 35.3
# B 25.3 2.11 48 21.0 29.5
#
# Results are averaged over the levels of: tension
# Confidence level used: 0.95
## same using test
emmeans(m1, "wool", null = coef(m1)[1], infer = TRUE)
# wool emmean SE df lower.CL upper.CL null t.ratio p.value
# A 31.0 2.11 48 26.8 35.3 28.1 1.372 0.1764
# B 25.3 2.11 48 21.0 29.5 28.1 -1.372 0.1764
#
# Results are averaged over the levels of: tension
# Confidence level used: 0.95
emmeans(m1, "tension", null = coef(m1)[1], infer = TRUE)
# tension emmean SE df lower.CL upper.CL null t.ratio p.value
# L 36.4 2.58 48 31.2 41.6 28.1 3.196 0.0025
# M 26.4 2.58 48 21.2 31.6 28.1 -0.682 0.4984
# H 21.7 2.58 48 16.5 26.9 28.1 -2.514 0.0154
#
# Results are averaged over the levels of: wool
# Confidence level used: 0.95
emmeans(m1, c("tension", "wool"), null = coef(m1)[1], infer = TRUE)
# tension wool emmean SE df lower.CL upper.CL null t.ratio p.value
# L A 44.6 3.65 48 37.2 51.9 28.1 4.499 <.0001
# M A 24.0 3.65 48 16.7 31.3 28.1 -1.137 0.2610
# H A 24.6 3.65 48 17.2 31.9 28.1 -0.985 0.3295
# L B 28.2 3.65 48 20.9 35.6 28.1 0.020 0.9839
# M B 28.8 3.65 48 21.4 36.1 28.1 0.173 0.8636
# H B 18.8 3.65 48 11.4 26.1 28.1 -2.570 0.0133
#
# Confidence level used: 0.95
Note that for coef() you probably want to use fixef() for lme4 models.
I've noticed that emmeans (in R) isn't working for an intercept-only estimate after the latest update.
Reproducible example:
test=lm(mpg~1,mtcars)
library(emmeans)
emmeans::emmeans(test,~1)
The output on 2 of my machines (windows and Linux) is:
> emmeans::emmeans(test,~1)
Error in `[[<-.data.frame`(`*tmp*`, ".wgt.", value = 2) :
replacement has 1 row, data has 0
Is this a known issue, or have I messed up my system somehow?
This used to work I believe.
It does work if you include a variable:
test2=lm(mpg~as.factor(cyl),mtcars)
emmeans(test2,~cyl)
Thanks very much for the help in advance.
It turns out that the fix for issue #197 -- and incorporated in CRAN version 1.47 -- created the issue (#206) that we see here. I think I have them both fixed now:
require(emmeans)
## Loading required package: emmeans
#206...
warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
emmeans(warp.lm, "1")
## 1 emmean SE df lower.CL upper.CL
## overall 28.1 1.49 48 25.2 31.1
##
## Results are averaged over the levels of: wool, tension
## Confidence level used: 0.95
emmeans(warp.lm, "1", by = "wool")
## wool = A:
## 1 emmean SE df lower.CL upper.CL
## overall 31.0 2.11 48 26.8 35.3
##
## wool = B:
## 1 emmean SE df lower.CL upper.CL
## overall 25.3 2.11 48 21.0 29.5
##
## Results are averaged over the levels of: tension
## Confidence level used: 0.95
#197...
model <- lm(Sepal.Length ~ poly(Petal.Length,2), data = iris)
emtrends(model, ~ 1, "Petal.Length", max.degree = 2)
## degree = linear:
## 1 Petal.Length.trend SE df lower.CL upper.CL
## overall 0.4474 0.0180 147 0.4119 0.483
##
## degree = quadratic:
## 1 Petal.Length.trend SE df lower.CL upper.CL
## overall 0.0815 0.0132 147 0.0554 0.108
##
## Confidence level used: 0.95
Created on 2020-06-01 by the reprex package (v0.3.0)
Users who need this now can install from github via
remotes::install_github("rvlenth/emmeans")
It is working fine with emmeans - 1.4.6 on macOS Catalina 10.15.4 and R 4.0
emmeans::emmeans(test,~1)
# 1 emmean SE df lower.CL upper.CL
# overall 20.1 1.07 31 17.9 22.3
#Confidence level used: 0.95
Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?
rf_mod$forest doesn't seem to have this information, and the docs don't mention it.
In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).
library(randomForest)
df <- mtcars
set.seed(1)
rf_mod = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 200)
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
We can also extract individual tree structure with getTree. Here's the first tree.
head(getTree(rf_mod, k = 1, labelVar = TRUE))
left daughter right daughter split var split point status prediction
1 2 3 wt 2.15 -3 18.91875
2 0 0 <NA> 0.00 -1 31.56667
3 4 5 wt 3.16 -3 17.61034
4 6 7 drat 3.66 -3 21.26667
5 8 9 carb 3.50 -3 15.96500
6 0 0 <NA> 0.00 -1 19.70000
One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:
# number of trees to grow
nn <- 200
# function to run nn CART models
run_rf <- function(rand_seed){
set.seed(rand_seed)
one_tr = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 1)
return(one_tr)
}
# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)
The extraction, averaging, and comparison step.
# extract importance of each CART model
library(dplyr); library(purrr)
map(l, importance) %>%
map(as.data.frame) %>%
map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>%
bind_rows() %>%
group_by(var) %>%
summarise(`%IncMSE` = mean(`%IncMSE`)) %>%
arrange(-`%IncMSE`)
# A tibble: 10 x 2
var `%IncMSE`
<chr> <dbl>
1 wt 8.52
2 cyl 7.75
3 disp 7.74
4 hp 5.53
5 drat 1.65
6 carb 1.52
7 vs 0.938
8 qsec 0.824
9 gear 0.495
10 am 0.355
# compare to the RF model above
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.
I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.
When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.
Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:
## Starting with a random forest having a single tree,
## grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
rep(1,9), randomForest::grow )
## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )
## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194
# 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05
# 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1
# 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28
# 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98
# 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1
# 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56
# 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11
# 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77
# 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49
The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).
Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.
While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.
Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):
At each split in each tree, the improvement in the split-criterion is the
importance measure attributed to the splitting variable, and is accumulated
over all the trees in the forest separately for each variable. [...]
Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.
So the variable importance measure in RF is defined as a measure accumulated over all trees.
In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)
More complex measures to characterise variable importance in CART-like models exist; for example in rpart:
An overall measure of variable importance is the sum of the goodness of split
measures for each split for which it was the primary variable, plus goodness * (adjusted
agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum
to 100 and the rounded values are shown, omitting any variable whose proportion is less
than 1%.
So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.
Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.
We can simplify it by
library(tidyverse)
out <- map(seq_len(nn), ~
run_rf(.x) %>%
importance) %>%
reduce(`+`) %>%
magrittr::divide_by(nn)