I have a dataframe "data" that contains
employee ID ("CPNo") - int
Gender - factor
Job Role - factor
Country - factor
Annual Salary - int
I want to run a t-test for each job role in each country to see if there is a significant paygap between the genders in the same job role and country.
I create a nested dataframe which contains dataframes with at least 20 observations:
dataNested <- data %>%
select(CPNo, Gender, JobRole, Country, AnnualSalaryLocal) %>%
nest(data = c(CPNo, Gender, AnnualSalaryLocal)) %>% filter(map_int(data, nrow) > 20)
And I want to run a t-test on that nested dataframe:
dataNested %>%
mutate(t_test = map(data, ~t.test(.x$AnnualSalaryLocal ~ .x$Gender, var.eq=F, paired=F)))
Now, if I run the code I get the following table which is a nested dataframe that contain the results of my t-tests:
JobRole
<fctr>
JobStage
<fctr>
Country
<fctr>
data
<list>
t_test
<list>
76 Product Development 06 Ireland <tibble> <S3: htest>
76 Product Development 06 Italy <tibble> <S3: htest>
82 Service Delivery 05 Italy <tibble> <S3: htest>
82 Service Delivery 06 Italy <tibble> <S3: htest>
82 Service Delivery 03 Mexico <tibble> <S3: htest>
83 Supply & Logistics 01 Mexico <tibble> <S3: htest>
76 Product Development 05 Poland <tibble> <S3: htest>
How do I write the syntax if I want to add a new variable "sig" which extracts the p.value from my "t_test" variable?
You can extract by using broom::tidy(). Here's an example using the gapminder dataset:
library(gapminder)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
gapminder |>
filter(continent %in% c("Europe", "Asia")) |>
group_by(year) |>
nest() |>
mutate(t_test = map(data, ~ t.test(.x$lifeExp ~ .x$continent, var.eq = F, paired = F)),
res = map(t_test, tidy)) |>
unnest(res) |>
ungroup()
# A tibble: 12 × 13
year data t_test estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
<int> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1952 <tibble [63 × 5]> <htest> -18.1 46.3 64.4 -9.09 1.14e-12 56.8 -22.1 -14.1 Welch Two Sample t-test two.sided
2 1957 <tibble [63 × 5]> <htest> -17.4 49.3 66.7 -8.98 4.73e-12 50.6 -21.3 -13.5 Welch Two Sample t-test two.sided
3 1962 <tibble [63 × 5]> <htest> -17.0 51.6 68.5 -9.02 1.24e-11 44.7 -20.8 -13.2 Welch Two Sample t-test two.sided
4 1967 <tibble [63 × 5]> <htest> -15.1 54.7 69.7 -8.29 2.01e-10 42.5 -18.7 -11.4 Welch Two Sample t-test two.sided
5 1972 <tibble [63 × 5]> <htest> -13.5 57.3 70.8 -7.50 3.96e- 9 39.6 -17.1 -9.83 Welch Two Sample t-test two.sided
6 1977 <tibble [63 × 5]> <htest> -12.3 59.6 71.9 -6.72 5.46e- 8 38.7 -16.0 -8.61 Welch Two Sample t-test two.sided
7 1982 <tibble [63 × 5]> <htest> -10.2 62.6 72.8 -6.38 1.18e- 7 41.7 -13.4 -6.96 Welch Two Sample t-test two.sided
8 1987 <tibble [63 × 5]> <htest> -8.79 64.9 73.6 -5.71 1.04e- 6 42.1 -11.9 -5.68 Welch Two Sample t-test two.sided
9 1992 <tibble [63 × 5]> <htest> -7.90 66.5 74.4 -5.19 5.54e- 6 42.7 -11.0 -4.83 Welch Two Sample t-test two.sided
10 1997 <tibble [63 × 5]> <htest> -7.48 68.0 75.5 -4.93 1.34e- 5 42.0 -10.5 -4.42 Welch Two Sample t-test two.sided
11 2002 <tibble [63 × 5]> <htest> -7.47 69.2 76.7 -4.81 2.13e- 5 40.3 -10.6 -4.33 Welch Two Sample t-test two.sided
12 2007 <tibble [63 × 5]> <htest> -6.92 70.7 77.6 -4.65 3.39e- 5 41.5 -9.93 -3.91 Welch Two Sample t-test two.sided
There is a package, rstatix, that do what you want in one step:
library(rstatix)
data %>% t_test(Gender ~ Country + JobRole)
Related
Likely a trivial task for the pros out there, but have not been able to figure out how to insert the text found in the "Slug" column into each of the three nested tables associated with the slug.
![data] (https://i.stack.imgur.com/YClrE.png)
I am just looking to get the Slug value inserted into the nested tables and repeated for each row so I can combine and keep track of associations properly.
Any tips are most welcome! Thank you
Solution
You can use rowwise() with mutate(across())
df %>%
rowwise() %>%
mutate(across(floor_price_array:holder_hist, ~list(mutate(.x,slug=slug))))
Explanation
If your original data, say df, looks like this:
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
2 b awxeb <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
3 c pbncj <tibble [10 x 3]> <tibble [10 x 3]> <tibble [10 x 3]>
then, the above code will add the value in the slug column as a new constant column in each of the nested tibbles, and resulting in this (notice that each now has four columns):
id slug floor_price_array num_listed_hist holder_hist
<chr> <chr> <list> <list> <list>
1 a hyznu <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
2 b awxeb <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
3 c pbncj <tibble [10 x 4]> <tibble [10 x 4]> <tibble [10 x 4]>
For example, floor_price_array, now contains this:
[[1]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 1.44 2.02 -0.272 hyznu
2 -0.598 -0.723 -0.528 hyznu
3 0.490 -0.576 -1.62 hyznu
4 -0.145 0.349 0.341 hyznu
5 -0.362 0.503 0.584 hyznu
6 -0.798 -0.839 -0.352 hyznu
7 -0.503 -1.27 -1.18 hyznu
8 -0.916 -0.654 0.335 hyznu
9 0.578 0.137 -0.590 hyznu
10 -0.194 -0.674 1.73 hyznu
[[2]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 0.876 0.665 -0.723 awxeb
2 -0.0442 -0.00906 0.0829 awxeb
3 -2.15 1.33 0.0692 awxeb
4 0.264 0.237 -0.497 awxeb
5 0.0381 0.0502 -1.58 awxeb
6 -0.802 0.783 -1.34 awxeb
7 -0.940 1.50 -0.542 awxeb
8 0.209 -1.06 0.853 awxeb
9 0.569 -1.15 -0.347 awxeb
10 -1.57 -0.0774 0.0250 awxeb
[[3]]
# A tibble: 10 x 4
x y z slug
<dbl> <dbl> <dbl> <chr>
1 -0.0289 -1.63 1.29 pbncj
2 -0.716 0.647 0.0230 pbncj
3 -0.0797 -0.0227 2.12 pbncj
4 -0.358 -1.43 -1.81 pbncj
5 -1.35 -0.402 -0.463 pbncj
6 -0.00494 -0.136 1.50 pbncj
7 1.09 0.124 -0.974 pbncj
8 -1.18 1.78 -0.836 pbncj
9 0.896 -1.38 0.199 pbncj
10 0.293 0.420 0.562 pbncj
Input data:
df <- tibble(id = letters[1:3]
) %>%
rowwise() %>%
mutate(slug = paste0(sample(letters,5), collapse="")) %>%
mutate(floor_price_array=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
num_listed_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10))),
holder_hist=list(tibble(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
) %>% ungroup()
I have the following dataset (dput here):
# A tibble: 3,713 x 17
ID Age Group RHR HRV Sleep.Onset Wake.Onset Hours.in.Bed Hours.of.Sleep Sleep.Disturbances Latency.min Cycles REM.Sleep.hours Deep.Sleep.hours
<int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl>
1 5027 Young Increase 58 73 0.180 0.458 6.66 5.33 9 8.98 6 1.4 0.32
2 5027 Young Increase 83 27 0.162 0.542 9.1 6.84 15 3.48 9 1.19 1.54
3 5027 Young Increase 57 85 0.113 0.318 4.92 4.43 5 1.98 4 1.32 0.44
4 5027 Young Increase 60 70 0.0975 0.319 5.32 3.75 3 26.5 4 1.02 0.14
5 5027 Young Increase 63 72 0.105 0.329 5.38 4.74 5 2.48 5 1.32 0.07
6 5027 Young Increase 62 61 0.983 0.472 11.8 9.44 9 4.48 8 2.07 0.84
7 5027 Young Increase 66 68 0.142 0.426 6.83 5.48 15 2.98 6 1.48 0.35
8 5027 Young Increase 81 28 0.0908 0.177 2.06 1.93 2 2.48 1 0.22 0.22
9 5027 Young Increase 69 57 0.158 0.443 6.85 6.58 13 0.48 6 2.43 0
10 5027 Young Increase 63 60 0.0859 0.318 5.58 5.47 4 0.48 5 1.34 0.13
# ... with 3,703 more rows, and 3 more variables: Light.Sleep.hours <dbl>, Awake.hours <dbl>, Session <chr>
I am trying to calculate a t-test across every variable, grouped by Age and Group between Session (pre or post).
df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(p.value = ~ t.test(. ~ Session)$p.value))
I am successful with p values:
# A tibble: 4 x 15
# Groups: Age [2]
Age Group RHR_p.value HRV_p.value Sleep.Onset_p.value Wake.Onset_p.value Hours.in.Bed_p.value Hours.of.Sleep_p~ Sleep.Disturban~ Latency.min_p.v~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease 0.0594 0.865 0.495 0.885 0.316 0.307 0.148 0.00237
2 Old Increase 0.00920 0.634 0.0979 0.0514 0.00774 0.00762 0.247 0.933
3 Young Decrease 0.0975 0.259 0.779 0.760 0.959 0.975 0.256 0.181
4 Young Increase 0.115 0.604 0.846 0.164 0.140 0.242 0.692 0.412
# ... with 5 more variables: Cycles_p.value <dbl>, REM.Sleep.hours_p.value <dbl>, Deep.Sleep.hours_p.value <dbl>, Light.Sleep.hours_p.value <dbl>,
# Awake.hours_p.value <dbl>
However, I am struggling to calculate the other t-statistics (mean, sd, t, df, 95%CI) between these pre-post and also correct p-values groups. I am struggling to do this so any help is appreciated.
I think I may need to convert data long and use something like this?
df %>%
group_by(Age, Group) %>%
t_test(mean ~ ., by = "Session") %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Dndata frames can only have certain object classes as column types. A
htest is not one of those.
However, we can store lists as list-columns.
If we adapt the current code to output lists htests as results, we can later extract elements of the tests separately.
library(dplyr)
output <- df %>%
select(-ID) %>%
group_by(Age, Group) %>%
summarize_at(
vars(-group_cols(), -Session),
list(t.test = ~ list(t.test(. ~ Session))))
output
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
2 Old Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
3 Young Decrease <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
4 Young Increase <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest> <htest>
With this output data.frame, we can extract individual tests and values from them as desired:
output$RHR_t.test
[[1]]
Welch Two Sample t-test
data: . by Session
t = -1.8965, df = 188.22, p-value = 0.05942
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.09118590 0.06082897
sample estimates:
mean in group Post mean in group Pre
62.28902 63.80420
[[2]]
Welch Two Sample t-test
data: . by Session
t = -2.6271, df = 226.21, p-value = 0.009199
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-3.3949577 -0.4848655
sample estimates:
mean in group Post mean in group Pre
57.95946 59.89937
[[3]]
Welch Two Sample t-test
data: . by Session
t = 1.6633, df = 251.75, p-value = 0.0975
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.2074028 2.4611194
sample estimates:
mean in group Post mean in group Pre
60.58255 59.45570
[[4]]
Welch Two Sample t-test
data: . by Session
t = 1.5849, df = 208.4, p-value = 0.1145
alternative hypothesis: true difference in means between group Post and group Pre is not equal to 0
95 percent confidence interval:
-0.244287 2.247775
sample estimates:
mean in group Post mean in group Pre
60.23462 59.23288
output$RHR_t.test %>%
map_dbl('p.value')
[1] 0.059424354 0.009199459 0.097497620 0.114502332
We can also convert these lists to user-friendly tibbles with broom::tidy
output %>%
mutate(across(ends_with('t.test'), map, broom::tidy))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.te… Wake.Onset_t.test Hours.in.Bed_t.t… Hours.of.Sleep_… Sleep.Disturbanc… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <list> <list> <list> <list> <list> <list> <list> <list> <list> <list>
1 Old Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
2 Old Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
3 Young Decrease <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
4 Young Increase <tibble [1 × 10]> <tibble [1 … <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 × 10]> <tibble [1 × 10… <tibble [1 ×… <tibble [1 × 10…
# … with 3 more variables: Deep.Sleep.hours_t.test <list>, Light.Sleep.hours_t.test <list>, Awake.hours_t.test <list>
To have all tests "statistics", we can do it like this:
tidy_output %>%
mutate(across(ends_with('t.test'), sapply, pull, 'statistic'))
# A tibble: 4 × 15
# Groups: Age [2]
Age Group RHR_t.test HRV_t.test Sleep.Onset_t.test Wake.Onset_t.test Hours.in.Bed_t.test Hours.of.Sleep_t.test Sleep.Disturbance… Latency.min_t.t… Cycles_t.test REM.Sleep.hours…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Old Decrease -1.90 0.171 0.684 -0.145 -1.01 -1.02 -1.45 3.05 -0.928 -0.906
2 Old Increase -2.63 0.477 -1.66 -1.96 -2.69 -2.69 -1.16 0.0848 -1.76 -1.87
3 Young Decrease 1.66 1.13 0.281 -0.305 0.0509 -0.0320 1.14 -1.34 -0.675 0.672
4 Young Increase 1.58 0.519 0.195 -1.40 -1.48 -1.17 0.397 -0.821 -1.73 0.886
# … with 3 more variables: Deep.Sleep.hours_t.test <dbl>, Light.Sleep.hours_t.test <dbl>, Awake.hours_t.test <dbl>
I calculated the mean of bootstrap samples for mpg variable from mtcars dataset. My code looks like this (Please, let me know if there's a "better practice" to do it.):
mean_mpg <- function(x) {
rsample::analysis(x) %>%
pull(mpg) %>%
mean()
}
mtcars2 <- rsample::bootstraps(mtcars) %>%
mutate(mean_mpg = purrr::map(splits, mean_mpg)) %>%
tidyr::unnest(mean_mpg) %>%
select(-splits)
However, now I would like to do the same on a grouped dataset. For example:
mtcars %>%
group_by(am)
# now calculate boostrap means of `mpg` for each `am` group
What's the best way to do it?
I think I would nest() to do this, rather than group_by().
Here is a slightly modified version of how to find the mean mpg for each bootstrap resample of the dataset overall.
library(rsample)
library(tidyverse)
bootstraps(mtcars) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # Bootstrap sampling
#> # A tibble: 25 x 4
#> splits id mpg mean_mpg
#> * <list> <chr> <list> <dbl>
#> 1 <split [32/10]> Bootstrap01 <dbl [32]> 18.8
#> 2 <split [32/13]> Bootstrap02 <dbl [32]> 20.4
#> 3 <split [32/9]> Bootstrap03 <dbl [32]> 21.1
#> 4 <split [32/12]> Bootstrap04 <dbl [32]> 19.4
#> 5 <split [32/10]> Bootstrap05 <dbl [32]> 19.8
#> 6 <split [32/11]> Bootstrap06 <dbl [32]> 20.1
#> 7 <split [32/13]> Bootstrap07 <dbl [32]> 19.1
#> 8 <split [32/11]> Bootstrap08 <dbl [32]> 18.7
#> 9 <split [32/13]> Bootstrap09 <dbl [32]> 19.3
#> 10 <split [32/13]> Bootstrap10 <dbl [32]> 20.9
#> # … with 15 more rows
And here is how I would go about creating bootstrap resamples for each value of am, and then finding the mean value of mpg for those resamples.
mtcars %>%
nest(-am) %>%
mutate(nested_boot = map(data, bootstraps)) %>%
select(-data) %>%
unnest(nested_boot) %>%
mutate(mpg = map(splits, ~ analysis(.) %>% pull(mpg)),
mean_mpg = map_dbl(mpg, mean))
#> # A tibble: 50 x 5
#> am splits id mpg mean_mpg
#> <dbl> <list> <chr> <list> <dbl>
#> 1 1 <split [13/4]> Bootstrap01 <dbl [13]> 21.9
#> 2 1 <split [13/4]> Bootstrap02 <dbl [13]> 24.0
#> 3 1 <split [13/5]> Bootstrap03 <dbl [13]> 24.8
#> 4 1 <split [13/5]> Bootstrap04 <dbl [13]> 25.9
#> 5 1 <split [13/3]> Bootstrap05 <dbl [13]> 24.0
#> 6 1 <split [13/5]> Bootstrap06 <dbl [13]> 22.1
#> 7 1 <split [13/4]> Bootstrap07 <dbl [13]> 24.3
#> 8 1 <split [13/4]> Bootstrap08 <dbl [13]> 25.0
#> 9 1 <split [13/5]> Bootstrap09 <dbl [13]> 22.7
#> 10 1 <split [13/6]> Bootstrap10 <dbl [13]> 23.3
#> # … with 40 more rows
Created on 2020-05-26 by the reprex package (v0.3.0)
Using mtcars data, I am testing map() to build some lm() models:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x)))
#> # A tibble: 3 x 3
#> cyl data fit
#> <dbl> <list> <list>
#> 1 6 <tibble [7 x 10]> <S3: lm>
#> 2 4 <tibble [11 x 10]> <S3: lm>
#> 3 8 <tibble [14 x 10]> <S3: lm>
The output shows that I have a new column, fit.
Now I wish to see the summary of each lm
When I try:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
map(fit,summary)
#> Error in as_mapper(.f, ...): object 'fit' not found
It gives the error:
Error in as_mapper(.f, ...) : object 'fit' not found
If I wish to calculate R2 or aic then I can using the following code without any problem:
library(tidyverse)
library(modelr)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(r2 = map_dbl(fit, ~rsquare(., data = mtcars)),
aic = map_dbl(fit, ~AIC(.))) %>%
arrange(aic)
#> # A tibble: 3 x 5
#> cyl data fit r2 aic
#> <dbl> <list> <list> <dbl> <dbl>
#> 1 6 <tibble [7 x 10]> <S3: lm> -8.96 -Inf
#> 2 4 <tibble [11 x 10]> <S3: lm> -26.4 56.4
#> 3 8 <tibble [14 x 10]> <S3: lm> -1.000 67.3
Created on 2019-06-18 by the reprex package (v0.3.0)
What am I missing?
As IceCreamToucan's comment laid out, purrr::map does not look into the data which has been made within your piping.
If you use it with dplyr::mutate then it has access to fit which you have created in the previous piping.
Another option would be explicitly referring to fit which you can see below, as my second suggestion.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
mutate(fit_sum = map(fit,summary))
#> # A tibble: 3 x 4
#> cyl data fit fit_sum
#> <dbl> <list> <list> <list>
#> 1 6 <tibble [7 x 10]> <lm> <smmry.lm>
#> 2 4 <tibble [11 x 10]> <lm> <smmry.lm>
#> 3 8 <tibble [14 x 10]> <lm> <smmry.lm>
mtcars %>%
group_by(cyl) %>%
nest()%>%
mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
{map(.$fit, summary)} #or using pull: `pull(fit) %>% map(summary)`
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ ., data = .x)
#>
#> Residuals:
#> ALL 7 residuals are 0: no residual degrees of freedom!
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.78649 NA NA NA
#> disp 0.07456 NA NA NA
#> hp -0.04252 NA NA NA
#> drat 1.52367 NA NA NA
#> wt 5.12418 NA NA NA
#> qsec -2.33333 NA NA NA
#> vs -1.75289 NA NA NA
#> am NA NA NA NA
#> gear NA NA NA NA
#> carb NA NA NA NA
#>
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: NaN
#> F-statistic: NaN on 6 and 0 DF, p-value: NA
####truncated the results for the sake of space####
Created on 2019-06-17 by the reprex package (v0.3.0)
From the latest release of dplyr, tidyverse seems to be encouraging using group_modify functions instead of using purrr + nested dataframes.
In that workflow, here is how you can get both model summaries and estimates in the same dataframe via broom package:
# setup
set.seed(123)
library(tidyverse)
options(tibble.width = Inf)
# joining dataframes with regression estimates and model summaries
dplyr::full_join(
# to get a tidy dataframe of regression estimates
x = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::tidy(lm(mpg ~ ., data = .x), conf.int = TRUE)),
# to get a tidy dataframe of model summaries
y = mtcars %>%
group_by(cyl) %>%
group_modify(.f = ~ broom::glance(lm(mpg ~ ., data = .x))),
by = "cyl"
) %>%
dplyr::ungroup(x = .)
#> Warning in qt(a, object$df.residual): NaNs produced
#> # A tibble: 25 x 20
#> cyl term estimate std.error statistic.x p.value.x conf.low
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 60.9 180. 0.338 0.793 -2229.
#> 2 4 disp -0.345 0.469 -0.735 0.596 -6.31
#> 3 4 hp -0.0332 0.364 -0.0915 0.942 -4.65
#> 4 4 drat -4.19 46.4 -0.0903 0.943 -594.
#> 5 4 wt 4.48 29.7 0.151 0.905 -373.
#> 6 4 qsec -0.106 7.82 -0.0136 0.991 -99.4
#> 7 4 vs -3.64 34.0 -0.107 0.932 -435.
#> 8 4 am -6.33 45.2 -0.140 0.912 -581.
#> 9 4 gear 4.07 29.1 0.140 0.912 -366.
#> 10 4 carb 3.22 28.2 0.114 0.928 -355.
#> conf.high r.squared adj.r.squared sigma statistic.y p.value.y df
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2351. 0.928 0.276 3.84 1.42 0.576 9
#> 2 5.62 0.928 0.276 3.84 1.42 0.576 9
#> 3 4.59 0.928 0.276 3.84 1.42 0.576 9
#> 4 586. 0.928 0.276 3.84 1.42 0.576 9
#> 5 382. 0.928 0.276 3.84 1.42 0.576 9
#> 6 99.2 0.928 0.276 3.84 1.42 0.576 9
#> 7 428. 0.928 0.276 3.84 1.42 0.576 9
#> 8 568. 0.928 0.276 3.84 1.42 0.576 9
#> 9 374. 0.928 0.276 3.84 1.42 0.576 9
#> 10 362. 0.928 0.276 3.84 1.42 0.576 9
#> logLik AIC BIC deviance df.residual nobs
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 -17.2 56.4 60.8 14.7 1 11
#> 2 -17.2 56.4 60.8 14.7 1 11
#> 3 -17.2 56.4 60.8 14.7 1 11
#> 4 -17.2 56.4 60.8 14.7 1 11
#> 5 -17.2 56.4 60.8 14.7 1 11
#> 6 -17.2 56.4 60.8 14.7 1 11
#> 7 -17.2 56.4 60.8 14.7 1 11
#> 8 -17.2 56.4 60.8 14.7 1 11
#> 9 -17.2 56.4 60.8 14.7 1 11
#> 10 -17.2 56.4 60.8 14.7 1 11
#> # ... with 15 more rows
Created on 2019-06-17 by the reprex package (v0.3.0)
I have found that the following works
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(mean = map(strap, mean))
but the below does not
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(median = map(strap, median))
The only difference is that the second line of code uses the median.
The error I get is
Error in mutate_impl(.data, dots) : Evaluation error: unimplemented type 'list' in 'greater' .
The code looks like it's working, but if you unnest it, you're actually just getting a lot of NAs because you're trying to take the mean of a resample object, which is a classed list with a reference to the data resampled and the indices for the particular resample. Taking the mean of such a list is not useful, so returning NA with a warning is helpful behavior. To get the code to work, coerce the resample to a data frame, which you can operate upon as usual within map's anonymous function.
For a direct route, extract the data and take the mean, simplifying the list to a numeric vector with map_dbl:
library(tidyverse)
set.seed(47)
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(sepal_mean = map_dbl(strap, ~mean(as_data_frame(.x)$Sepal.Length)))
#> # A tibble: 100 x 3
#> strap .id sepal_mean
#> <list> <chr> <dbl>
#> 1 <S3: resample> 001 5.844000
#> 2 <S3: resample> 002 6.016000
#> 3 <S3: resample> 003 5.851333
#> 4 <S3: resample> 004 5.869333
#> 5 <S3: resample> 005 5.840667
#> 6 <S3: resample> 006 5.825333
#> 7 <S3: resample> 007 5.824000
#> 8 <S3: resample> 008 5.790000
#> 9 <S3: resample> 009 5.858000
#> 10 <S3: resample> 010 5.810000
#> # ... with 90 more rows
Translating this approach to median works fine:
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(sepal_median = map_dbl(strap, ~median(as_data_frame(.x)$Sepal.Length)))
#> # A tibble: 100 x 3
#> strap .id sepal_median
#> <list> <chr> <dbl>
#> 1 <S3: resample> 001 5.9
#> 2 <S3: resample> 002 5.8
#> 3 <S3: resample> 003 5.8
#> 4 <S3: resample> 004 5.7
#> 5 <S3: resample> 005 5.7
#> 6 <S3: resample> 006 5.8
#> 7 <S3: resample> 007 5.8
#> 8 <S3: resample> 008 5.7
#> 9 <S3: resample> 009 5.8
#> 10 <S3: resample> 010 5.7
#> # ... with 90 more rows
If you'd like both median and mean, you could repeatedly coerce the resample to a data frame, or store it in another column, but neither approach is very efficient. It's better to return a list of data frames with map that can be unnested:
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(stats = map(strap, ~summarise_all(as_data_frame(.x), funs(mean, median)))) %>%
unnest(stats)
#> # A tibble: 100 x 4
#> strap .id mean median
#> <list> <chr> <dbl> <dbl>
#> 1 <S3: resample> 001 5.744667 5.60
#> 2 <S3: resample> 002 5.725333 5.70
#> 3 <S3: resample> 003 5.808667 5.70
#> 4 <S3: resample> 004 5.809333 5.70
#> 5 <S3: resample> 005 5.964000 5.85
#> 6 <S3: resample> 006 5.931333 5.95
#> 7 <S3: resample> 007 5.838667 5.80
#> 8 <S3: resample> 008 5.926000 5.95
#> 9 <S3: resample> 009 5.855333 5.75
#> 10 <S3: resample> 010 5.888667 5.70
#> # ... with 90 more rows
Updated syntax is:
iris %>%
select(Sepal.Length) %>%
modelr::bootstrap(100) %>%
mutate(stats = map(strap, ~summarise_all(as_tibble(.x), list(mean = mean, median = median)))) %>%
unnest(stats)
because as_data_frame and funs are deprecated