function referencing inside sdf_pivot

function referencing inside sdf_pivot - r

all:
I am having trouble referencing simple functions inside sdf_pivot. Can anyone help? Thanks!
This is the code that works, but not exactly what I need:
spark_disconnect_all();
sc <- spark_connect(master = "yarn-client")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean"))
I want to calculate mean with NA removed and median calculated as well, ideally with NA removed as well. I cannot get the following codes to work though:
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean(na.rm=TRUE)"))
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "percentile(0.5)"))
This is the result I need:
mpg cyl `3.0` `4.0` `5.0`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Low 8 194. NaN 300.
2 High 4 NaN 61 113
3 Low 4 97 85 91
4 Low 6 108. 116. 175
My data is 800 million rows and I'm just using a single example here for the purpose of easy replication. In reality, I cannot just collect it to a dataframe in R. All the calculation has to be on spark. A lot of things stop to work on Spark, median function is one. I can get percentile function to work but not median. But I can't figure out how to supply the extra parameters in this specific setting.

Related

Why does R ignore relevel when using group_nest()?

As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
Adding a relevel argument (shown below)
Adding a relevel argument within the custom function itself (also shown below)
Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english' but could also be switched to work for 'computer science' too.

It does respect the relevel() function, the problem, such as it is, is that the returned results don't match with the major column. See what happens if you stop at the unnest() function:
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy))
Now, look at major and term
tmp %>% select(major, term)
# # A tibble: 6 × 2
# major term
# <fct> <chr>
# 1 undeclared "(Intercept)"
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english "relevel(major, ref = \"english\")undeclared"
# 4 undeclared "(Intercept)"
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english "relevel(major, ref = \"english\")undeclared"
You can see that the rows where major is "english" are actually for the "undeclared" parameter estimate. Taking the above result, I think you can capture what you want with the following:
tmp %>%
filter(term != "(Intercept)") %>%
mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(year, app_deadline, major, time, significant) %>%
full_join(test2)
# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
# year app_deadline major time significant admit reject total accept_rate
# <dbl> <date> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 2020-03-23 computer science 2020-01-01 Yes 300 100 400 0.75
# 2 2020 2020-03-23 undeclared 2020-01-01 Yes 800 210 1010 0.792
# 3 2021 2021-04-04 computer science 2021-01-01 Yes 1000 300 1300 0.769
# 4 2021 2021-04-04 undeclared 2021-01-01 No 500 1000 1500 0.333
# 5 NA 2021-04-04 english 2021-01-01 NA 450 1000 1450 0.310
# 6 NA 2020-03-23 english 2020-01-01 NA 100 900 1000 0.1
# 7 NA 2019-05-23 undeclared 2019-01-01 NA 1000 1500 2500 0.4

Manual bootstrapping for confidence intervals using tidyverse only

I have a grouped dataset and I am interested in summarising a column of counts (number of ___). To calculate the standard error for the summary, I want to bootstrap within groups and calculate the standard deviation of medians. I am struggling to figure out how to manually code this (resampling with replacement, and not functions like boot()), without using for loops (i.e., I am hoping for a purely tidyverse solution). If there is a way other than using *apply(), that would be preferred. Wrapping the whole process into a function would be great---either to be used in pipeline with, say, summarise(), or as a standalone function that can be applied to the grouped data.
An ad hoc dataset can be mtcars which I have grouped by gear. I am now interested in summarising the hp column using median and also obtaining confidence intervals for the same. I have already attempted a bunch of solutions suggested by slightly related threads on SO, like replicate()+across(), map()/pmap(), etc. but couldn't get them to work for my specific case.
library(tidyverse)
data <- mtcars %>%
select(gear, hp) %>%
group_by(gear)
> data
# A tibble: 32 x 2
# Groups: gear [3]
gear hp
<dbl> <dbl>
1 4 110
2 4 110
3 4 93
4 3 110
5 3 175
6 3 105
7 3 245
8 4 62
9 4 95
10 4 123
# ... with 22 more rows
I am hoping for a way to integrate the bootstrap results with the simple summarisation as another column (SEs per group):
data2 <- data %>%
summarise(hp = median(hp))
While it may not make much sense to summarise horsepower by number of gears, and the distribution of hp might not be a typical Poisson, I think the coding solution for this example will apply to my specific case nonetheless.
EDIT 1
The solution need not be a clean and robust function. It can be just the lines of code required to obtain the bootstrapped SE value in each group for this specific case. The desired output is just the data2 object, where hp is the column of medians and hpse is the column of SEs.
data2 <- data %>%
summarise(hp = median(hp),
### hpse = workingcode()
)
If not possible to do it directly this way inside the summarise() call, it must at least be possible to later join the values to data2.
Related threads
Using boot()
How to perform a bootstrap and find 95% confidence interval for the median of a dataset
Stratified Bootstrapping in R with >25 strata
Bootsrapping a statistic in a nested data column and retrieve results in tidy format
Bootstrapping a vector of results, by group in R
Using *apply()
Bootstrap a large data set
Using for loop
How to perform a bootstrap and find 95% confidence interval for the median of a dataset
Others
Creating bootstrap samples and storing sampled data in different names

First we can make a bootstrap function:
boot_fn = function(x, fn = median, B = 1000) {
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
Note how I gave the parameter fn a default value of median, which gives you the opportunity to pass any numeric function you wish into boot_fn().
Now we can use the function as you originally asked:
mtcars %>%
group_by(gear) %>%
summarise(
hp_median = median(hp),
se = boot_fn(hp, fn = median)
)
# A tibble: 3 x 3
gear hp_median se
<dbl> <dbl> <dbl>
1 3 180 13.2
2 4 94 15.2
3 5 175 70.3
The reason this works is because our data is grouped. For each group, a new value of x is sent to boot_fn(). In this case, three different values of x were passed, each being the hp values corresponding to each different value of gear.
This is easy to confirm if we just add a cat() statement in our function:
boot_fn = function(x, fn = median, B = 1000, verbose = FALSE) {
if (verbose) cat("Hello, x is ", x, "\n")
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
data %>%
summarise(
hp_median = median(hp),
se = boot_fn(hp, fn = median, verbose = TRUE)
)
Output:
Hello, x is 110 175 105 245 180 180 180 205 215 230 97 150 150 245 175
Hello, x is 110 110 93 62 95 123 123 66 52 65 66 109
Hello, x is 91 113 264 175 335
# A tibble: 3 x 3
gear hp_median se
<dbl> <dbl> <dbl>
1 3 180 13.5
2 4 94 14.9
3 5 175 69.6
This function may break down when used on real-world data (due to things like NAs), but this is a good start.

An alternative to #kybazzi's solution that fits in the pipeline workflow is this:
boot_se <- function(x, fn = median, B = 100){
replicate(B,
do.call("fn", list(sample(x, n(), replace = T))),
simplify = F) %>%
unlist() %>%
sd()
}
It seems to be slower at times:
boot_fn = function(x, fn = median, B = 100) {
1:B %>%
# For each iteration, generate a sample of x with replacement
map(~ x[sample(1:length(x), replace = TRUE)]) %>%
# Obtain the fn estimate for each bootstrap sample
map_dbl(fn) %>%
# Obtain the standard error
sd()
}
data1 <- mtcars %>%
select(gear, hp) %>%
group_by(gear)
data2 <- data %>%
summarise(hpmed = median(hp),
hpse = boot_se(hp))
data3 <- data %>%
summarise(hpmed = median(hp),
hpse = boot_fn(hp))
#######################################
library(microbenchmark)
microbenchmark((data %>%
summarise(hpmed = median(hp),
hpse = boot_fn(hp))),
(data %>%
summarise(hpmed = median(hp),
hpse = boot_se(hp))))
# Output:
Unit: milliseconds
expr min lq
(data %>% summarise(hpmed = median(hp), hpse = boot_fn(hp))) 14.5737 15.63690
(data %>% summarise(hpmed = median(hp), hpse = boot_se(hp))) 20.6675 21.64715
mean median uq max neval
22.23120 16.78140 25.85675 91.4154 100
29.15338 22.68525 32.01430 87.6299 100
#######################################
microbenchmark(data2, data3, times = 1000)
# Output:
Unit: nanoseconds
expr min lq mean median uq max neval
data2 0 100.0 95.986 101 101 3501 1000
data3 0 1.5 92.318 101 101 2700 1000

Creating a versatile descriptives table using dplyr

I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.

There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854

I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.

Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)

Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949

One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))

A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975

You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:

Using clusrank by group

simple question, I want to perform the one-sample rank test with cluster in data, after searching for a while, I got clusWilcox.test from the package clusrank. A toy example for illustration:
df = data.frame(x_1 = rnorm(200),
x_2 = rnorm(200),
group = c(rep('A',100),rep('B',100)),
clus = c(rep('a_1',50),rep('a_2',50),rep('b_1',50),rep('b_2',50)))
Worked like a charm when used directly
clusWilcox.test(x_1,paired = TRUE,cluster = "clus",data = df)
But went wrong when I tried to perform the test by group:
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus")$p.value), vars = c('x_1','x_2'))
Error in complete.cases(x, cluster, group, stratum) :
not all arguments have the same length
Seems like a data problem, so I fill the data option of the function with df, it worked, but test all the data instead of by group.
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus",data = df)$p.value), vars = c('x_1','x_2'))
> temp_test
# A tibble: 2 x 3
group vars1 vars2
<fct> <dbl> <dbl>
1 A 0.168 0.136
2 B 0.168 0.136
This won't happen when I tried to perform the one-sample t.test
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(t.test(.)$p.value), vars = c('x_1','x_2'))
My guess is that the clusWilcox.test somehow could not inherit data from dplyr, anyone know how to get the problem fixed?

According to ?clusWilcox.test, the cluster parameter should be a numeric vector. In your df, it is a factor.
Therefore, running the test separately for group A with your factor cluster variable
clusWilcox.test(x_1, paired = TRUE, cluster = clus, data = df[df$group == "A", ])
results in:
Clustered Wilcoxon signed rank test using Rosner-Glynn-Lee method
data: x_1; cluster: clus; (from [)x_1; cluster: clus; (from temp)x_1; cluster: clus; (from temp$group == "A")x_1; cluster: clus; (from )
number of observations: 100; number of clusters: 4
Z = NA, p-value = NA
alternative hypothesis: true shift in location is not equal to 0
If you create a new cluster variable that is numeric, it runs the tests correctly:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise(pvalue = clusWilcox.test(x_1, paired = TRUE, cluster = clus)$p.value)
group pvalue
<fct> <dbl>
1 A 0.175
2 B 0.801
If you want to calculate it for different columns:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value)
group x_1 x_2
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289
To indicate that it contains the p-value:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), list(pvalue = ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value))
group x_1_pvalue x_2_pvalue
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289

Summarise multiple variables by one group at a time

There are a number of questions and answers about summarising multiple variables by one or more groups (e.g., Means multiple columns by multiple groups). I don't think this is a duplicate.
Here's what I'm trying to do: I want to calculate the mean for 4 variables by Displacement, then calculate the mean for those same three by Horsepower, and so on. I don't want to group by vs, am, gear, and carb simultaneously (i.e., I'm not looking for simply mydata %>% group_by(vs, am, gear, and carb) %>% summarise_if(...).
How can I calculate the means for a set of variables by Displacement, then calculate the means for that same set of variables by Horsepower, etc., then place in a table side by side?
I tried to come up with a reproducible example but couldn't. Here is a tibble from mtcars that shows what I'm ultimately looking for (data is made up):
tibble(Item = c("vs", "am" ,"gear", "carb"),
"Displacement (mean)" = c(2.4, 1.4, 5.5, 1.3),
"Horsepower (mean)" = c(155, 175, 300, 200))

Perhaps something like this using purrr::map and some rlang syntax?
grps <- list("cyl", "vs")
map(setNames(grps, unlist(grps)), function(x)
mtcars %>%
group_by(!!rlang::sym(x)) %>%
summarise(mean.mpg = mean(mpg), mean.disp = mean(disp)) %>%
rename(id.val = 1)) %>%
bind_rows(.id = "id")
## A tibble: 5 x 4
# id id.val mean.mpg mean.disp
# <chr> <dbl> <dbl> <dbl>
#1 cyl 4. 26.7 105.
#2 cyl 6. 19.7 183.
#3 cyl 8. 15.1 353.
#4 vs 0. 16.6 307.
#5 vs 1. 24.6 132.

With so few groupings, why not do each set of means one at a time:
out1 <- mydata %>% group_by(Var1) %>%
summarise(mean_1a = mean(var_a), mean_1b = mean(var_b))
out2 <- mydata %>% group_by(Var2) %>%
summarise(mean_2a = mean(var_a), mean_2b = mean(var_b))
out3 <- mydata %>% group_by(Var3) %>%
summarise(mean_3a = mean(var_a), mean_3b = mean(var_b))
If it makes sense to place the results side-by-side, you could do so with something like:
result <- cbind(out1, out2, out3)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

function referencing inside sdf_pivot - r

Related

Why does R ignore relevel when using group_nest()?

Manual bootstrapping for confidence intervals using tidyverse only

Creating a versatile descriptives table using dplyr

Using clusrank by group

Summarise multiple variables by one group at a time

Categories

Resources