I have a grouped dataset and I am interested in summarising a column of counts (number of ___). To calculate the standard error for the summary, I want to bootstrap within groups and calculate the standard deviation of medians. I am struggling to figure out how to manually code this (resampling with replacement, and not functions like boot()), without using for loops (i.e., I am hoping for a purely tidyverse solution). If there is a way other than using *apply(), that would be preferred. Wrapping the whole process into a function would be great---either to be used in pipeline with, say, summarise(), or as a standalone function that can be applied to the grouped data.
An ad hoc dataset can be mtcars which I have grouped by gear. I am now interested in summarising the hp column using median and also obtaining confidence intervals for the same. I have already attempted a bunch of solutions suggested by slightly related threads on SO, like replicate()+across(), map()/pmap(), etc. but couldn't get them to work for my specific case.
library(tidyverse)
data <- mtcars %>%
select(gear, hp) %>%
group_by(gear)
> data
# A tibble: 32 x 2
# Groups: gear [3]
gear hp
<dbl> <dbl>
1 4 110
2 4 110
3 4 93
4 3 110
5 3 175
6 3 105
7 3 245
8 4 62
9 4 95
10 4 123
# ... with 22 more rows
I am hoping for a way to integrate the bootstrap results with the simple summarisation as another column (SEs per group):
data2 <- data %>%
summarise(hp = median(hp))
While it may not make much sense to summarise horsepower by number of gears, and the distribution of hp might not be a typical Poisson, I think the coding solution for this example will apply to my specific case nonetheless.
EDIT 1
The solution need not be a clean and robust function. It can be just the lines of code required to obtain the bootstrapped SE value in each group for this specific case. The desired output is just the data2 object, where hp is the column of medians and hpse is the column of SEs.
data2 <- data %>%
summarise(hp = median(hp),
### hpse = workingcode()
)
If not possible to do it directly this way inside the summarise() call, it must at least be possible to later join the values to data2.
Related threads
Using boot()
How to perform a bootstrap and find 95% confidence interval for the median of a dataset
Stratified Bootstrapping in R with >25 strata
Bootsrapping a statistic in a nested data column and retrieve results in tidy format
Bootstrapping a vector of results, by group in R
Using *apply()
Bootstrap a large data set
Using for loop
How to perform a bootstrap and find 95% confidence interval for the median of a dataset
Others
Creating bootstrap samples and storing sampled data in different names
First we can make a bootstrap function:
boot_fn = function(x, fn = median, B = 1000) {
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
Note how I gave the parameter fn a default value of median, which gives you the opportunity to pass any numeric function you wish into boot_fn().
Now we can use the function as you originally asked:
mtcars %>%
group_by(gear) %>%
summarise(
hp_median = median(hp),
se = boot_fn(hp, fn = median)
)
# A tibble: 3 x 3
gear hp_median se
<dbl> <dbl> <dbl>
1 3 180 13.2
2 4 94 15.2
3 5 175 70.3
The reason this works is because our data is grouped. For each group, a new value of x is sent to boot_fn(). In this case, three different values of x were passed, each being the hp values corresponding to each different value of gear.
This is easy to confirm if we just add a cat() statement in our function:
boot_fn = function(x, fn = median, B = 1000, verbose = FALSE) {
if (verbose) cat("Hello, x is ", x, "\n")
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
data %>%
summarise(
hp_median = median(hp),
se = boot_fn(hp, fn = median, verbose = TRUE)
)
Output:
Hello, x is 110 175 105 245 180 180 180 205 215 230 97 150 150 245 175
Hello, x is 110 110 93 62 95 123 123 66 52 65 66 109
Hello, x is 91 113 264 175 335
# A tibble: 3 x 3
gear hp_median se
<dbl> <dbl> <dbl>
1 3 180 13.5
2 4 94 14.9
3 5 175 69.6
This function may break down when used on real-world data (due to things like NAs), but this is a good start.
An alternative to #kybazzi's solution that fits in the pipeline workflow is this:
boot_se <- function(x, fn = median, B = 100){
replicate(B,
do.call("fn", list(sample(x, n(), replace = T))),
simplify = F) %>%
unlist() %>%
sd()
}
It seems to be slower at times:
boot_fn = function(x, fn = median, B = 100) {
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
data1 <- mtcars %>%
select(gear, hp) %>%
group_by(gear)
data2 <- data %>%
summarise(hpmed = median(hp),
hpse = boot_se(hp))
data3 <- data %>%
summarise(hpmed = median(hp),
hpse = boot_fn(hp))
#######################################
library(microbenchmark)
microbenchmark((data %>%
summarise(hpmed = median(hp),
hpse = boot_fn(hp))),
(data %>%
summarise(hpmed = median(hp),
hpse = boot_se(hp))))
# Output:
Unit: milliseconds
expr min lq
(data %>% summarise(hpmed = median(hp), hpse = boot_fn(hp))) 14.5737 15.63690
(data %>% summarise(hpmed = median(hp), hpse = boot_se(hp))) 20.6675 21.64715
mean median uq max neval
22.23120 16.78140 25.85675 91.4154 100
29.15338 22.68525 32.01430 87.6299 100
#######################################
microbenchmark(data2, data3, times = 1000)
# Output:
Unit: nanoseconds
expr min lq mean median uq max neval
data2 0 100.0 95.986 101 101 3501 1000
data3 0 1.5 92.318 101 101 2700 1000
Related
I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))
Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.
all:
I am having trouble referencing simple functions inside sdf_pivot. Can anyone help? Thanks!
This is the code that works, but not exactly what I need:
spark_disconnect_all();
sc <- spark_connect(master = "yarn-client")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean"))
I want to calculate mean with NA removed and median calculated as well, ideally with NA removed as well. I cannot get the following codes to work though:
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean(na.rm=TRUE)"))
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "percentile(0.5)"))
This is the result I need:
mpg cyl `3.0` `4.0` `5.0`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Low 8 194. NaN 300.
2 High 4 NaN 61 113
3 Low 4 97 85 91
4 Low 6 108. 116. 175
My data is 800 million rows and I'm just using a single example here for the purpose of easy replication. In reality, I cannot just collect it to a dataframe in R. All the calculation has to be on spark. A lot of things stop to work on Spark, median function is one. I can get percentile function to work but not median. But I can't figure out how to supply the extra parameters in this specific setting.
Is it possible to pass a list of functions to dplyr::summarize in a way to allow the list of functions to vary? I'd like to create an overall function to create a summary table but allow different for different groups of functions in the output - [edit: when the functions are not all being applied to the same column].
I was thinking this could be done by creating an overall function with which group of summary functions to be included with T/F arguments (where funA=T/F, funB=T/F are lists of functions and the user could include all functions from funA, funB or both), but am not how to write the initial list functions (funA, funB)- when the functions are not all being applied to the same column. Below is an idea of how it would be structured. Is this possible, or is there a better way to do this?
#Essentially - how would I write a function to selectively include a group of functions (for example either funA = c(n, min, max) or funB=c(n_na, n_neg), or both).
extract_all <- function(x){
x %>% summarize(n=n(),
min = min(disp, na.rm=TRUE),
max = max(disp, na.rm=TRUE),
n_na = sum(is.na(wt)),
n_neg = sum(vs < 0, na.rm=TRUE))
}
test <- mtcars %>% group_by(cyl) %>% extract_all()
#Does this structure work?
extract_summaries <- function(x, funA=TRUE, funB=FALSE){
funAls <- list() #but how do you write n, min, max in here?
funBls <- list() #and n_na, n_neg in here
funls <- append(funAls[funA], funBls[funB])
summarize(x, funls)
}
#which could be run with:
test <- mtcars %>% group_by(cyl) %>% extract_summaries(funA=TRUE, funB=TRUE)
}
Here is one option
extract_summaries <- function(x, colnm, funA=TRUE, funB=FALSE){
funAls <- list(n = length, min= min, max = max)
funBls <- list(n_na = function(y) sum(is.na(y)),
n_neg = function(y) sum(y < 0, na.rm=TRUE))
funls <- append(funAls[funA], funBls[funB])
x %>%
summarise_at(vars({{colnm}}), funls)
}
test <- mtcars %>%
group_by(cyl) %>%
extract_summaries(mpg, funA=TRUE, funB=TRUE)
test
# A tibble: 3 x 6
# cyl n min max n_na n_neg
# <dbl> <int> <dbl> <dbl> <int> <int>
#1 4 11 21.4 33.9 0 0
#2 6 7 17.8 21.4 0 0
#3 8 14 10.4 19.2 0 0
test <- mtcars %>%
group_by(cyl) %>%
extract_summaries(mpg, funA = FALSE, funB = TRUE)
test
# A tibble: 3 x 3
# cyl n_na n_neg
# <dbl> <int> <int>
#1 4 0 0
#2 6 0 0
#3 8 0 0
I can summarize the mean by groups using
t(mtcars %>%
group_by(gear) %>%
dplyr::summarize(Mean_Mpg = mean(mpg, na.rm=TRUE),
StdD_Mpg = sd(mpg, na.rm=TRUE)
))
gear 3 4 5
Mean_Mpg 16.106667 24.533333 21.380000
StdD_Mpg 3.371618 5.276764 6.658979
I know summary(aov(gear ~ mpg , mtcars)) will output the results from ANOVA test includign the F Statistic.
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 3.893 3.893 8.995 0.0054 **
Residuals 30 12.982 0.433
Also chisq.test(table(mtcars$gear,mtcars$carb)) will output the results from Chi.Square test.
Pearson's Chi-squared test
X-squared = 16.518, df = 10, p-value = 0.08573
What I am trying to do is produce an output like this below, where I am combining the mean, standard deviation and F Statistic value from ANOVA, X-Squared test statistic.
gear 3 4 5 Test-Statistic Test
Mpg (Mean) 16.106667 24.533333 21.380000 8.995 ANOVA
(StdD) 3.371618 5.276764 6.658979
Carb(N) 16.518 Chi.Square
3 4 0
4 4 2
3 0 0
5 4 1
0 0 1
0 0 1
I am not sure how to do put together a table like this this by combining the mean,standard deviation, F Statistic, Chiq.Square statistic values etc. I would welcome any help from the community on formatting the results like this.
One option is to think about all the results you want, and how to manipulate them in order to have a same structure. Then, use bind_rows() for instance, to gather all results in a same table.
The functions group_by() and summarise() able to calculate mean (and others) for severals variables (and the result is a data.frame), whereas the function apply() allow to apply a same function, or a combinaison of functions (like summary(aov(...))) to several variables. The result of the second is a vector.
library(tidyverse)
# mean (± sd) of x per group
mtcars %>%
group_by(gear) %>%
summarise_at(
vars(mpg, carb),
funs(paste0(round(mean(.), 2), '(±', round(sd(.) / sqrt(n()), 1), ')'))
) %>%
mutate(gear = as.character(gear)) %>%
# add ANOVA: gear ~ x
bind_rows(
c(gear = 'ANOVA',
apply(mtcars %>% select(mpg, carb), 2,
function(x) summary(aov(mtcars$gear ~ x))[[1]]$`F value`[1] %>% round(3) %>% as.character()
))
) %>%
# add Chi-Square: gear ~ x
bind_rows(
c(gear = 'CHI-SQUARE',
apply(mtcars %>% select(mpg, carb), 2,
function(x) chisq.test(table(mtcars$gear, x))$statistic %>% round(3) %>% as.character()
))
)
# # A tibble: 5 x 3
# gear mpg carb
# <chr> <chr> <chr>
# 1 3 16.11(±0.9) 2.67(±0.3)
# 2 4 24.53(±1.5) 2.33(±0.4)
# 3 5 21.38(±3) 4.4(±1.2)
# 4 ANOVA 8.995 2.436
# 5 CHI-SQUARE 54.667 16.518
I've noticed a lot of examples here which uses dplyr::mutate in combination with a function returning multiple outputs to create multiple columns. For example:
tmp <- mtcars %>%
group_by(cyl) %>%
summarise(min = summary(mpg)[1],
median = summary(mpg)[3],
mean = summary(mpg)[4],
max = summary(mpg)[6])
Such syntax however means that the summary function is called 4 times, in this example, which does not seem particularly efficient. What ways are there to efficiently assign a list output to a list of column names in summarise or mutate?
For example, from a previous question: Split a data frame column containing a list into multiple columns using dplyr (or otherwise), I know that you can assign the output of summary as a list and then split it using do(data.frame(...)), however this means that you have to then add the column names later and the syntax is not as pretty.
The tie package from Romain Francois can do this very neatly
devtools::install_github("romainfrancois/tie")
library('tidyverse')
library('tie')
tmp <- mtcars %>%
group_by(cyl) %>%
bow( tie(min, median, mean, max) := summary(mpg)[c(1,3,4,6)] )
Note the use of := rather than =.
This issue of using functions that return vectors (not scalars) inside summarise is considered by the tidyverse team here https://github.com/tidyverse/dplyr/issues/154 and in further posts referenced within.
This addresses your example, but perhaps not your principal question. In the case you showed, you could rewrite this as:
tmp <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min, median, mean, max), mpg)
This is more efficient, taking about 40% as much time to run:
microbenchmark(mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min, median, mean, max), mpg),
times = 1000L)
mtcars %>% group_by(cyl) %>% summarise_each(funs(min, median,mean, max), mpg)
min lq mean median uq max neval
2.002762 2.159464 2.330703 2.216719 2.271264 7.771477 1000
microbenchmark(mtcars %>%
group_by(cyl) %>%
summarise(min = summary(mpg)[1],
median = summary(mpg)[3],
mean = summary(mpg)[4],
max = summary(mpg)[6]), times = 1000L)
mtcars %>% group_by(cyl) %>% summarise(min = summary(mpg)[1], median = summary(mpg)[3], mean = summary(mpg)[4], max = summary(mpg)[6])
min lq mean median uq max neval
4.967731 5.21122 5.571605 5.360689 5.530197 13.26596 1000
However, there are certainly other cases whether this will not address the problem.
EDIT:
The do() function can solve this. e.g.
by_cyl <- group_by(mtcars, cyl) %>%
do(mod = summary(.)[c(1,4,6),])
I cannot find a suitable solution in dplyr that lets you assign names in a easy to remember way. I find the following data.table solution acceptable, if a bit wordy:
data.table(mtcars) %>%
.[, setattr(as.list(summary(mpg)[c(1,3,4,6)]),
"names", c("min", "median", "mean", "max")),
by = cyl]
This is derived from akrun's answer, where by:
data.table(mtcars) %>%
.[, as.list(summary(mpg)[c(1,3,4,6)]), by = cyl]
automatically assigns the output of the function into 4 columns. Thus the only thing left is to rename the columns appropriately using the setattr function.
Note that the output of summary is not a list, therefore has to be coerced to a list for this to work.
I managed to do it this way. It works reasonably fast with a 45 million row dataset I own.
tmp <- mtcars %>%
group_by(cyl) %>%
do(data.frame(t(as.matrix(summary(.$mpg)[c(1, 3, 4, 6)]))))
Source: local data frame [3 x 5]
Groups: cyl [3]
cyl Min. Median Mean Max.
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 21.4 26.0 26.66 33.9
2 6 17.8 19.7 19.74 21.4
3 8 10.4 15.2 15.10 19.2
This can also be accomplished using tidyr::nest and purrr::map. Note, the output returned by summary needs to be converted from a named vector to a data.frame or tibble, I'm using dplyr::bind_rows below to accomplish this but equally data.frame(as.list(summary(.$mpg))) could be used instead.
suppressWarnings(library(tidyverse))
mtcars %>%
group_by(cyl) %>%
nest() %>%
summarise(stats = map(data, ~ bind_rows(summary(.$mpg)))) %>%
unnest(stats)
#> # A tibble: 3 x 7
#> cyl Min. `1st Qu.` Median Mean `3rd Qu.` Max.
#> <dbl> <table> <table> <table> <table> <table> <table>
#> 1 4 21.4 22.80 26.0 26.66364 30.40 33.9
#> 2 6 17.8 18.65 19.7 19.74286 21.00 21.4
#> 3 8 10.4 14.40 15.2 15.10000 16.25 19.2
Created on 2021-04-19 by the reprex package (v0.3.0)