I am trying to write a custom function to run a one-way within-subjects ANOVA using rlang + ez.
An example of the output I am expecting:
# setup
set.seed(123)
library(WRS2)
library(ez)
library(tidyverse)
# getting data in format that `ez` expects
df <- WRS2::WineTasting %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., Taster = as.factor(Taster))
# this works
ez::ezANOVA(
data = df,
dv = Taste,
wid = Taster,
within = Wine,
detailed = TRUE,
return_aov = TRUE
)
#> $ANOVA
#> Effect DFn DFd SSn SSd F p
#> 1 (Intercept) 1 21 2.005310e+03 4.2186364 9982.254929 1.311890e-29
#> 2 Wine 2 42 9.371212e-02 0.3129545 6.288308 4.084101e-03
#> p<.05 ges
#> 1 * 0.99774530
#> 2 * 0.02026075
#>
#> $`Mauchly's Test for Sphericity`
#> Effect W p p<.05
#> 2 Wine 0.7071776 0.03128132 *
#>
#> $`Sphericity Corrections`
#> Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
#> 2 Wine 0.7735015 0.008439799 * 0.8233709 0.007188822 *
#>
#> $aov
#>
#> Call:
#> aov(formula = formula(aov_formula), data = data)
#>
#> Grand Mean: 5.512121
#>
#> Stratum 1: Taster
#>
#> Terms:
#> Residuals
#> Sum of Squares 4.218636
#> Deg. of Freedom 21
#>
#> Residual standard error: 0.4482047
#>
#> Stratum 2: Taster:Wine
#>
#> Terms:
#> Wine Residuals
#> Sum of Squares 0.09371212 0.31295455
#> Deg. of Freedom 2 42
#>
#> Residual standard error: 0.08632091
#> Estimated effects may be unbalanced
Now here is a custom function I have written to do the same but using non-standard evaluation implemented in rlang:
# custom function
aov_fun <- function(data, x, y, id) {
# getting data in format that `ez` expects
df <- data %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., {{ id }} := as.factor({{ id }})) %>%
tibble::as_tibble(.)
# print the dataframe to see if it was cleaned as expected
print(df)
# running anova
ez::ezANOVA(
data = df,
dv = {{ y }},
wid = {{ id }},
within = {{ x }},
detailed = TRUE,
return_aov = TRUE
)
}
But this doesn't work. Note that the dataframe is getting cleaned properly, so that's not where the error lies.
# using the function
aov_fun(WRS2::WineTasting, Wine, Taste, Taster)
#> # A tibble: 66 x 3
#> Taste Wine Taster
#> <dbl> <fct> <fct>
#> 1 5.4 Wine A 1
#> 2 5.5 Wine B 1
#> 3 5.55 Wine C 1
#> 4 5.85 Wine A 2
#> 5 5.7 Wine B 2
#> 6 5.75 Wine C 2
#> 7 5.2 Wine A 3
#> 8 5.6 Wine B 3
#> 9 5.5 Wine C 3
#> 10 5.55 Wine A 4
#> # ... with 56 more rows
#> Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, : "{
#> y
#> }" is not a variable in the data frame provided.
Instead of dv = {{ y }}, I have also tried-
dv = rlang::as_string(y)
dv = rlang::as_name(y)
dv = rlang::enquo(y)
But none of these work.
Whenever I want to bridge rlang's NSE with functions that don't explicitly support it,
I find that dividing the procedure in these 2 steps (at least conceptually) is always helpful:
Create the final expression I'd like using rlang functions.
Evaluate it, either with rlang::eval_tidy if quosures are involved, or with base::eval otherwise.
In your case, you can probably finish your function with something like:
# running anova
rlang::eval_tidy(rlang::expr(ez::ezANOVA(
data = df,
dv = {{ y }},
wid = {{ id }},
within = {{ x }},
detailed = TRUE,
return_aov = TRUE
)))
expr creates the expression and obviously supports rlang's NSE,
and eval_tidy simply evaluates the expression.
Oh and BTW, if ezANOVA (or any other function you want to use NSE with) supported strings instead of expressions as input,
you'd need something like rlang::as_string(rlang::enexpr(param)),
first capturing the expression of what the user wrote as param,
and then using as_string to transform that expression.
This can be corrected with
aov_fun <- function(data, x, y, id) {
lst1 <- as.list(match.call()[-1])
names(lst1)<- c("data", "dv", "wid", "within")[match(names(lst1),
c("data", "y", "id", "x"))]
df <- data %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., {{ id }} := as.factor({{ id }})) %>%
tibble::as_tibble(.)
do.call(getFromNamespace("ezANOVA", "ez"),
c(lst1, detailed = TRUE, return_aov = TRUE))
}
-testing
aov_fun(WRS2::WineTasting, x = Wine,y = Taste, id = Taster)
#$ANOVA
# Effect DFn DFd SSn SSd F p p<.05 ges
# 1 (Intercept) 1 21 2.005310e+03 4.2186364 9982.254929 1.311890e-29 * 0.99774530
# 2 Wine 2 42 9.371212e-02 0.3129545 6.288308 4.084101e-03 * 0.02026075
# $`Mauchly's Test for Sphericity`
# Effect W p p<.05
# 2 Wine 0.7071776 0.03128132 *
# $`Sphericity Corrections`
# Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
# 2 Wine 0.7735015 0.008439799 * 0.8233709 0.007188822 *
# $aov
# Call:
# aov(formula = formula(aov_formula), data = data)
# Grand Mean: 5.512121
# Stratum 1: Taster
# Terms:
# Residuals
# Sum of Squares 4.218636
# Deg. of Freedom 21
# Residual standard error: 0.4482047
# Stratum 2: Taster:Wine
# Terms:
# Wine Residuals
# Sum of Squares 0.09371212 0.31295455
# Deg. of Freedom 2 42
# Residual standard error: 0.08632091
# Estimated effects may be unbalanced
This is a great application for using_bang from #moody_mudskipper's tags package
aov_fun <- function(data, x, y, id) {
# ...
# code as before
# running anova
tags::using_bang$ezANOVA(
data = df,
dv = {{y}},
wid = {{id}},
within = {{x}},
detailed = TRUE,
return_aov = TRUE
)
}
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 months ago.
Improve this question
I´m trying to build a function which would receive: a dataframe (data), variable(s) to group by (groupby), and the name of a dependent variable (var); The function will then: a. create a plot of the means of var, separated by group(s) in groupby. In addition, a nice to have would be adding an anova at the end.
I´ll start with the end: my problem is obviously how to use (string) values in further manipulations in a user defined function.
I unfortunately have problems parsing groupby, which I couldn´t solve after a couple of days trying: I tried using:
!!!rlang::parse_exprs, strsplit, etc... but with no success. Currently it looks like something like that (that´s the simplified version with less aesthetics..):
grp_comp <- function(data, groupby, var){
data %>%
filter(!is.na(var)) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean(!!!rlang::parse_expr(var)),
sd = sd(!!!rlang::parse_expr(var)),
se = sd / sqrt(n)) -> ddata
gg <- unlist(rlang::parse_exprs(groupby))
if(length(as.vector(rlang::parse_exprs(groupby))) == 1){
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean)) +
geom_point()}
else{
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean,
shape = as.character(gg[2]),
color= as.character(gg[2])),
group = as.character(gg[2]))}
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(var, " ~ ", form)
form
data%>%
filter(!is.na(var)) %>%
aov(formula = form) -> anova
summary(anova) -> anova
l <- list(ddata, g, anova)
l
}
My problems are:
a. groupby could contain one or two variables. I can´t manage to use groupby as an argument for group_by in the ggplots. Either I get: Error: Discrete value supplied to continuous scale in case I use: x = gg[1], or I use: x = as.factor(gg[1]) or: as.character and get the following plot (i.e. x is only labeled "BPL", but not grouped by the factor).
b. when I try to use two (instead of one) groupby factors, things get even worse and the plot is completely empty...
c. I manage to create the right formula for the anova, but when I try to actually calculate it I encounter: Error: $ operator is invalid for atomic vectors -> any ideas why?
d. not critical, but any ideas for using the second, optional group as color & shape in aes() in case the argument contains two groups, without using the if ?
Many many thanks in advance!
Guy
It's not clear how you want to call this function, but you could do something like:
library(tidyverse)
grp_comp <- function(data, groupby, var){
ddata <- data %>%
filter(!is.na({{var}})) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean({{var}}),
sd = sd({{var}}),
se = sd / sqrt(n))
gg <- unlist(rlang::parse_exprs(groupby))
g <- if(length(as.vector(rlang::parse_exprs(groupby))) == 1)
ggplot(ddata, aes(x = !!gg[[1]], y = mean)) + geom_point()
else {
ggplot(ddata, aes(x = !!gg[[1]], y = mean, shape = factor(!!gg[[2]]),
color= !!gg[[2]], group = !!gg[[2]])) + geom_point()
}
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(deparse(substitute(var)), " ~ ", form)
data%>%
filter(!is.na({{var}})) %>%
aov(formula = as.formula(form)) -> anova
summary(anova) -> anova
list(ddata, g, anova)
}
This allows:
grp_comp(iris, "Species", Sepal.Length)
#> [[1]]
#> # A tibble: 3 x 5
#> Species n mean sd se
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 setosa 50 5.01 0.352 0.0498
#> 2 versicolor 50 5.94 0.516 0.0730
#> 3 virginica 50 6.59 0.636 0.0899
#>
#> [[2]]
#>
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Species 2 63.21 31.606 119.3 <2e-16 ***
#> Residuals 147 38.96 0.265
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And
grp_comp(mtcars, c("gear", "cyl"), mpg)
#> `summarise()` has grouped output by 'gear'. You can override using the
#> `.groups` argument.
#> [[1]]
#> # A tibble: 8 x 6
#> # Groups: gear [3]
#> gear cyl n mean sd se
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3 4 1 21.5 NA NA
#> 2 3 6 2 19.8 2.33 1.65
#> 3 3 8 12 15.0 2.77 0.801
#> 4 4 4 8 26.9 4.81 1.70
#> 5 4 6 4 19.8 1.55 0.776
#> 6 5 4 2 28.2 3.11 2.2
#> 7 5 6 1 19.7 NA NA
#> 8 5 8 2 15.4 0.566 0.400
#>
#> [[2]]
#>
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> gear 1 259.7 259.7 24.87 2.63e-05 ***
#> cyl 1 563.4 563.4 53.94 4.32e-08 ***
#> Residuals 29 302.9 10.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-08-27 with reprex v2.0.2
I have been struggling for a while to make my own CLDs from the output of a TukeyHSD test.
First I've done a two-way ANOVA:
aov2_arbuscular <- aov(arbuscular_count ~ block + pesticide*fertilizer, data = main_trial)
And did a TukeyHSD test as a post hoc test:
tk_arbuscular <- TukeyHSD(aov2_arbuscular)
Because I could not generate the CLD with the TukeyHSD output I used the emmeans() and cld() function.
tk_arbuscular_model <- emmeans(aov2_arbuscular,
pairwise ~ pesticide*fertilizer,
adjust = "tukey")
tk_arbuscular_model_cld <- cld(tk_arbuscular_model$emmeans,
alpha = .05,
Letters = letters)
I thought that both the TukeyHSD and emmeans with adjust = "tukey" result in the same output. Which they do for the most results, with unfortunately a few exceptions.
I have already written my result part and do not want to adjust all the p-values again. Therefore, can someone help me to generate the CLDs with the TukeyHSD output, so I can integrate them in a ggplot?
You did not provide your data, so I am creating my own reprex that can also be analyzed as a two-factorial block design, i.e. with a model similar to yours.
As you can see, I could not reproduce your problem - all p-values are basically identical. I noticed that you added the adjust = "Tukey" to the emmeans() statement, but I usually add it to the cld() statement instead - however that shouldn't be the problem.
library(tidyverse)
library(emmeans)
library(multcomp)
library(multcompView)
dataURL <- "https://raw.githubusercontent.com/SchmidtPaul/DSFAIR/master/data/Gomez%26Gomez1984.csv"
dat <- read_csv(dataURL) %>%
filter(G %in% c("A", "B") & N %in% c("N1", "N2")) %>%
mutate_at(vars(rep:N), as.factor)
aov <- aov(yield ~ G + N + G:N + rep, data = dat)
# get contrasts via 3 options ---------------------------------------------
option1 <- stats::TukeyHSD(aov) %>%
pluck("G:N")
option2 <- emmeans::emmeans(aov, ~ G:N) %>%
emmeans::pairs(adjust = "Tukey")
option3 <- emmeans::emmeans(aov, ~ G:N) %>%
multcomp::cld(adjust = "Tukey", details = TRUE)
# uniform format ----------------------------------------------------------
option1 <- option1 %>%
as_tibble(rownames = "contrast") %>%
transmute(contrast = contrast,
estimate = diff,
p.value = `p adj`)
option2 <- option2 %>%
as_tibble() %>%
dplyr::select(contrast, estimate, p.value)
option3 <- option3 %>%
pluck("comparisons") %>%
as_tibble() %>%
dplyr::select(contrast, estimate, p.value)
# compare -----------------------------------------------------------------
option1
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 B:N1-A:N1 53.3 0.999
#> 2 A:N2-A:N1 1419. 0.0860
#> 3 B:N2-A:N1 1729. 0.0401
#> 4 A:N2-B:N1 1366 0.0984
#> 5 B:N2-B:N1 1676 0.0455
#> 6 B:N2-A:N2 310 0.910
option2
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 A N1 - B N1 -53.3 0.999
#> 2 A N1 - A N2 -1419. 0.0860
#> 3 A N1 - B N2 -1729. 0.0401
#> 4 B N1 - A N2 -1366 0.0984
#> 5 B N1 - B N2 -1676 0.0455
#> 6 A N2 - B N2 -310. 0.910
option3
#> # A tibble: 6 x 3
#> contrast estimate p.value
#> <chr> <dbl> <dbl>
#> 1 B N1 - A N1 53.3 0.999
#> 2 A N2 - A N1 1419. 0.0860
#> 3 A N2 - B N1 1366 0.0984
#> 4 B N2 - A N1 1729. 0.0401
#> 5 B N2 - B N1 1676 0.0455
#> 6 B N2 - A N2 310. 0.910
tibble(
o1_p = option1$p.value,
o2_p = option2$p.value,
o3_p = option3$p.value
) %>% cor()
#> o1_p o2_p o3_p
#> o1_p 1.0000000 1.0000000 0.9967731
#> o2_p 1.0000000 1.0000000 0.9967731
#> o3_p 0.9967731 0.9967731 1.0000000
Created on 2021-12-02 by the reprex package (v2.0.1)
Does this help? If not, can you try to create a reproducible example with your data so that we have all the information?
Also, here is a chapter I wrote on using and interpreting the compact letter display.
I am new here and quite new to programming, so any help would be greatly appreciated.
I have a dataframe df1 which looks like this:
Picture
Emotion
Gender
Type
Trial
Attr_scores
Fear_scores
Appr_scores
Avoid_scores
1
happy
male
human
first
11
3
21
21
2
sad
male
human
first
12
6
22
22
3
neutral
male
human
first
13
2
23
23
4
happy
male
cartoon
first
14
3
24
24
5
sad
male
cartoon
first
15
6
25
25
6
neutral
male
cartoon
first
16
2
26
26
7
happy
male
animal
first
17
3
27
27
8
sad
male
animal
first
18
6
28
28
9
neutral
male
animal
first
19
2
29
29
10
happy
female
human
first
20
3
21
30
11
sad
female
human
first
21
6
22
31
12
neutral
female
human
first
22
2
23
32
13
happy
female
cartoon
first
23
3
24
33
14
sad
female
cartoon
first
24
6
25
34
15
neutral
female
cartoon
first
25
2
26
35
16
happy
female
animal
first
26
3
27
36
17
sad
female
animal
first
27
6
28
37
18
neutral
female
animal
first
28
2
29
38
And here is the code to generate it:
Picture <- c(1:18)
Emotion <- rep(c('happy','sad','neutral'),times=6)
Gender <- rep(c('male','female'),each=9)
Type <- rep(c('human','cartoon','animal','human','cartoon','animal'),each=3)
Trial <- rep(c('first'),times=18)
Attr_scores <- c(11:28)
Fear_scores <- rep(c(3,6,2),times=6)
Appr_scores <- rep(c(21:29),times=2)
Avoid_scores <- c(21:38)
df1<-data.frame(Picture,Emotion,Gender,Type,Trial,Attr_scores,Fear_scores,Appr_scores,Avoid_scores)
I need to take several pairs of variables (one independent variable + one dependent variable, e.g. Emotion + Attr_scores, Emotion + Fear_scores, Gender + Attr_scores, Gender + Avoid_scores), and for each of them: 1) run summary statistics (compare means and SDs), 2) run one-way ANOVA, 3) create a scatter plot.
So far, I have created the code for the first pair of variables (Gender + Attr_scores). Here is the code:
# Summary Statistics
library(dplyr)
group_by(df1, Gender) %>%
summarise(
N = n(),
Mean = mean(Attr_scores, na.rm = TRUE),
Sd = sd(Attr_scores, na.rm = TRUE)
)
# ANOVA
res.aov <- aov(Attr_scores ~ Gender, data = df1)
summary(res.aov)
#Plot
gender_attr_plot <- ggplot(df1, aes(x=Gender, y=Attr_scores)) +
geom_jitter(position=position_jitter(0.2))+
stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1),
geom="pointrange", color="red")
ggsave("gender_attr_plot.png", gender_attr_plot, width = 1600, height = 900, units = "px")
I can copypaste the code for each additional pair of variables and change the variable names manually each time, but this seems like a very inefficient way of doing things. Moreover, if I need to run the same analyses for any additional pair of variables, I will have to copy the entire code again just to do that.
What I want to do instead, is create a table or nested list with pairs of variables (which can be easily updated later, if additional pairs of variables are required) and write a loop that goes through these pairs of variables and performs all 3 actions (summary statistics, ANOVA and plot) for each of them.
I think it should look something like this (this is very far from an actual working code, it's just to give a general idea):
variables <- list(
c(Gender, Attr_scores),
c(Gender, Fear_scores),
c(Type, Appr_scores),
c(Emotion, Avoid_scores))
for(i in variables){
library(dplyr)
group_by(df1, variables,'[[',1) %>%
summarise(
N = n(),
Mean = mean(variables,'[[',2, na.rm = TRUE),
Sd = sd(variables,'[[',2, na.rm = TRUE)
)
res.aov <- aov(variables,'[[',2 ~ variables,'[[',1, data = df1)
summary(res.aov)
plot <- ggplot(df1, aes(x=variables,'[[',1, y=variables,'[[',2)) +
geom_jitter(position=position_jitter(0.2))+
stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1),
geom="pointrange", color="red")
ggsave("??????.png", plot, width = 1600, height = 900, units = "px")
}
Obviously, this is not working, and I have been searching all over the internet for a solution, but my knowledge of R is not yet sufficient to figure out how to make it work. Any help would be most appreciated!
Here is a possible solution for your task:
I modified your code a little and created one function my_function with this function you get the desired output for one pair of your data set. The result is return in a list!
library(dplyr)
library(ggplot2)
my_function <- function(df, x, y) {
# Summary
a <- group_by(df, {{x}}) %>%
summarise(
N = n(),
Mean = mean({{y}}, na.rm = TRUE),
Sd = sd({{y}}, na.rm = TRUE)
)
# ANOVA
res.aov <- aov({{y}} ~ {{x}}, data = df)
b <- summary(res.aov)
# Plot
c <- ggplot(df1, aes(x={{x}}, y={{y}})) +
geom_jitter(position=position_jitter(0.2))+
stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1),
geom="pointrange", color="red")
ggsave(paste0(deparse(substitute(x)), "_",
deparse(substitute(y)), ".png"), width = 1600, height = 900, units = "px")
output<-list(a,b,c)
return(output)
}
# cases 1 - 4
my_function(df1, Gender, Attr_scores)
my_function(df1, Gender, Avoid_scores)
my_function(df1, Emotion, Attr_scores)
my_function(df1, Emotion, Fear_scores)
this may be useful
https://r4ds.had.co.nz/iteration.html#the-map-functions
https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/
variables <-
structure(list(
x = c("Gender", "Gender", "Type", "Emotion"),
y = c("Attr_scores", "Fear_scores", "Appr_scores", "Avoid_scores")
),
class = "data.frame",
row.names = c(NA,-4L))
variables
#> x y
#> 1 Gender Attr_scores
#> 2 Gender Fear_scores
#> 3 Type Appr_scores
#> 4 Emotion Avoid_scores
library(tidyverse)
# GROUP
map2(
.x = variables$x,
.y = variables$y,
.f = ~ group_by(df,!!sym(.x)) %>%
summarise(
N = n(),
Mean = mean(!!sym(.y), na.rm = TRUE),
Sd = sd(!!sym(.y), na.rm = TRUE)
)) %>%
set_names(nm = str_c(variables$x, variables$y, sep = "#"))
#> $`Gender#Attr_scores`
#> # A tibble: 2 x 4
#> Gender N Mean Sd
#> <chr> <int> <dbl> <dbl>
#> 1 female 9 24 2.74
#> 2 male 9 15 2.74
#>
#> $`Gender#Fear_scores`
#> # A tibble: 2 x 4
#> Gender N Mean Sd
#> <chr> <int> <dbl> <dbl>
#> 1 female 9 3.67 1.80
#> 2 male 9 3.67 1.80
#>
#> $`Type#Appr_scores`
#> # A tibble: 3 x 4
#> Type N Mean Sd
#> <chr> <int> <dbl> <dbl>
#> 1 animal 6 28 0.894
#> 2 cartoon 6 25 0.894
#> 3 human 6 22 0.894
#>
#> $`Emotion#Avoid_scores`
#> # A tibble: 3 x 4
#> Emotion N Mean Sd
#> <chr> <int> <dbl> <dbl>
#> 1 happy 6 28.5 5.61
#> 2 neutral 6 30.5 5.61
#> 3 sad 6 29.5 5.61
# ANOVA
map2(
.x = variables$x,
.y = variables$y,
.f = ~ aov(as.formula(str_c(.y, .x, sep = "~")), data = df)
) %>%
set_names(nm = str_c(variables$x, variables$y, sep = "#"))
#> $`Gender#Attr_scores`
#> Call:
#> aov(formula = as.formula(str_c(.y, .x, sep = "~")), data = df)
#>
#> Terms:
#> Gender Residuals
#> Sum of Squares 364.5 120.0
#> Deg. of Freedom 1 16
#>
#> Residual standard error: 2.738613
#> Estimated effects may be unbalanced
#>
#> $`Gender#Fear_scores`
#> Call:
#> aov(formula = as.formula(str_c(.y, .x, sep = "~")), data = df)
#>
#> Terms:
#> Gender Residuals
#> Sum of Squares 0 52
#> Deg. of Freedom 1 16
#>
#> Residual standard error: 1.802776
#> Estimated effects may be unbalanced
#>
#> $`Type#Appr_scores`
#> Call:
#> aov(formula = as.formula(str_c(.y, .x, sep = "~")), data = df)
#>
#> Terms:
#> Type Residuals
#> Sum of Squares 108 12
#> Deg. of Freedom 2 15
#>
#> Residual standard error: 0.8944272
#> Estimated effects may be unbalanced
#>
#> $`Emotion#Avoid_scores`
#> Call:
#> aov(formula = as.formula(str_c(.y, .x, sep = "~")), data = df)
#>
#> Terms:
#> Emotion Residuals
#> Sum of Squares 12.0 472.5
#> Deg. of Freedom 2 15
#>
#> Residual standard error: 5.612486
#> Estimated effects may be unbalanced
#PLOT
f <- function(x, y) {
gender_attr_plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
geom_jitter(position = position_jitter(0.2)) +
stat_summary(
fun.data = mean_sdl,
fun.args = list(mult = 1),
geom = "pointrange",
color = "red"
)
}
all_plots <- map2(.x = variables$x, .y = variables$y, .f = f)
plotnames <- str_c(variables$x, "#", variables$y, ".png")
walk2(
.x = plotnames,
.y = all_plots,
.f = ~ ggsave(
filename = .x,
plot = .y,
width = 1600,
height = 900,
units = "px"
)
)
Created on 2021-10-25 by the reprex package (v2.0.1)
data
Picture <- c(1:18)
Emotion <- rep(c('happy', 'sad', 'neutral'), times = 6)
Gender <- rep(c('male', 'female'), each = 9)
Type <-
rep(c('human', 'cartoon', 'animal', 'human', 'cartoon', 'animal'),
each = 3)
Trial <- rep(c('first'), times = 18)
Attr_scores <- c(11:28)
Fear_scores <- rep(c(3, 6, 2), times = 6)
Appr_scores <- rep(c(21:29), times = 2)
Avoid_scores <- c(21:38)
df <-
data.frame(
Picture,
Emotion,
Gender,
Type,
Trial,
Attr_scores,
Fear_scores,
Appr_scores,
Avoid_scores
)
(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)
I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor() function. I want to see what the difference is between the results with and without normalization of the features.
Now set_engine("kknn") uses the kknn::train.kknn() function under the hood, which has a normalization argument scale = TRUE. I want to compare models with scale = FALSE to scale = TRUE (actually, I want to do that in a recipe, but that is not possible, as I'll explain below).
But it does not seem as if I am able to reliably set scale = FALSE through tidymodels. Below is a reprex showing what I see.
The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.
Set up for the reprex
Here I'll use mtcars:
library(tidymodels)
data("mtcars")
A train-test split is:
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
Here is a common recipe I'll use:
mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)
Here is model 1 (called knn_FALSE) where scale = FALSE:
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
Here is model 2 (called knn_TRUE) where scale = TRUE:
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
I bundle these two models into two workflows:
## Workflow with scale = FALSE
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_recipe(mtcars_recipe)
Using fit(), it is possible to have scale = FALSE
It does seem to be possible to have one version with scale = TRUE and one with scale = FALSE when using fit() on a workflow.
For example, for scale = TRUE I get:
wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)
Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5
Whereas for scale = FALSE I have:
wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)
Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5
The results are clearly different, which comes from the difference in the scale parameter.
But the plot thickens.
No difference with last_fit()
When using last_fit() however, the results for scale = TRUE and scale = FALSE are identical though.
For scale = TRUE:
wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
Whereas for scale = FALSE:
wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
These are clearly --- and unexpectedly --- the same.
There is also no difference when tuning using tune_grid()
If I do tuning with tune_grid() and a validation_split(), there is also no difference between the results for scale = TRUE and scale = FALSE.
Here is the code for that:
## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))
## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
## Model 2: kNN regresson with scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>%
add_model(knn_FALSE_tune) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>%
add_model(knn_TRUE_tune) %>%
add_recipe(mtcars_recipe)
## Validation split
mtcars_val <- validation_split(mtcars)
## Tune results: Without scaling
wf_FALSE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
## Tune results: With scaling
wf_TRUE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
The result when scale = FALSE:
> wf_FALSE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
The results when scale = TRUE:
> wf_TRUE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
Question
Am I misunderstanding (or missing my own bug), or are the last_fit() and tune_grid() functions not respecting my choice for scale?
I'm new to tidymodels, so I might have missed something. Answers much appreciated.
I was hoping to use step_normalize() in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE in the underlying engine, I have not been able to experiment with that.
Update after Julia's reply
As Julia shows, predictions from train.kknn() provide the same predictions for scale = FALSE and scale = TRUE. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn() function does not respect all parameters passed to train.kknn() when predicting.
Consider the following output which uses kknn() instead of train.kknn():
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = FALSE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = TRUE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
These are different, as it should be. The problem is that kknn:::predict.train.kknn() calls kknn(), but without passing along scale (and some other optional arguments):
function (object, newdata, ...)
{
if (missing(newdata))
return(predict(object, ...))
res <- kknn(formula(terms(object)), object$data, newdata,
k = object$best.parameters$k, kernel = object$best.parameters$kernel,
distance = object$distance)
return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>
I think you don't have a bug or problem but are just misunderstanding what last_fit() and friends are predicting on to estimate performance.
library(tidymodels)
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
knn_FALSE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = FALSE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = FALSE)
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
knn_TRUE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = TRUE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = TRUE)
Notice that both parsnip models are correctly passing the scale parameter to the underlying engine.
We can now add these two parsnip models to a workflow(), with a formula preprocessor (a recipe would be fine too).
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_formula(mpg ~ disp + wt)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_formula(mpg ~ disp + wt)
The function last_fit() fits on the training data and predicts on the testing data. We can do that manually with our workflows. Importantly, notice that for these examples in the testing set, the predictions are the same, so the metrics you would get are the same.
wf_TRUE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
wf_FALSE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
The same thing is true for fitting the models directly:
knn_TRUE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
knn_FALSE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
And in fact is true if we fit the underlying kknn model directly:
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = FALSE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = TRUE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
Created on 2020-11-12 by the reprex package (v0.3.0.9001)
The scale parameter is correctly being passed to the underlying engine; it just doesn't change the prediction for these test cases.
I'm writing a regression function which can take several arguments (I include just 3 arguments here for simplicity, but there will be tens of them). Ultimately I want to pass all possible combinations of arguments to the regression function and compile the estimates from the models. So I will first make the full set of combinations using cross_df then iterate over each row of the resulting dataframe, with each row containing the set of arguments to pass to the custom regression function (one argument per column). In the dataframe, I then want to create two new columns: one with the estimated coefficient for an independent variable, and one with the associated p-value.
Here's what I've tried:
rm(list = ls())
library(DeclareDesign)
library(tidyverse)
set.seed(12345)
df <- fabricate(N = 100,
oneDV = rnorm(N),
anotherDV = draw_binary(prob = 0.5, N),
X = draw_binary(prob = 0.5, N),
M = rnorm(N))
myreg <- function(DV = NULL,
control = FALSE,
subset = FALSE) {
dat <- df
# declare dv
dv <- DV
if(subset) {
dat %>% filter(M < median(M))} else {
dat <- dat
}
# controls
if(control) {
cntr <- "+ M"} else {
cntr <- ""
}
# decalare formula
frm <- paste0(dv, "~ X", cntr)
out <- lm_robust(as.formula(frm), data = df)
out$coef <- as.vector(out$coefficients[2])
out$pval <- as.vector(out$p.value[2])
return(out)
}
args <- list(
DV = c("oneDV", "anotherDV"),
control = c(T,F),
double = c(T,F))
args %>%
purrr::cross_df() %>%
mutate(coef = myreg(DV, control, subset)$coef,
pval = myreg(DV, control, subset)$pval)
As you can see, this isn't looping over each row as I want it to---the same result is showing for every row, even though every row is supposed to represent a separate model (8 distinct ones in this example). What am I doing wrong?
Use map2_dbl:
library(tidyverse)
args %>%
purrr::cross_df() %>%
mutate(coef = map2_dbl(DV, control, ~myreg(.x, .y)$coef),
pval = map2_dbl(DV, control, ~myreg(.x, .y)$pval))
#> # A tibble: 4 x 4
#> DV control coef pval
#> <chr> <lgl> <dbl> <dbl>
#> 1 oneDV TRUE 0.120 0.569
#> 2 anotherDV TRUE 0.0957 0.354
#> 3 oneDV FALSE 0.163 0.437
#> 4 anotherDV FALSE 0.0833 0.408
Created on 2019-06-21 by the reprex package (v0.3.0)
Using pmap with more than 2 arguments:
args %>%
purrr::cross_df() %>%
mutate(mod = pmap(., myreg),
coef = map_dbl(mod, ~.x$coef),
pval = map_dbl(mod, ~.x$pval))
#> # A tibble: 8 x 6
#> DV control subset mod coef pval
#> <chr> <lgl> <lgl> <list> <dbl> <dbl>
#> 1 oneDV TRUE TRUE <lm_robst> -0.0917 0.678
#> 2 anotherDV TRUE TRUE <lm_robst> -0.0404 0.693
#> 3 oneDV FALSE TRUE <lm_robst> -0.0825 0.706
#> 4 anotherDV FALSE TRUE <lm_robst> -0.0369 0.717
#> 5 oneDV TRUE FALSE <lm_robst> -0.0917 0.678
#> 6 anotherDV TRUE FALSE <lm_robst> -0.0404 0.693
#> 7 oneDV FALSE FALSE <lm_robst> -0.0825 0.706
#> 8 anotherDV FALSE FALSE <lm_robst> -0.0369 0.717
Created on 2019-06-22 by the reprex package (v0.3.0)