Why does R ignore relevel when using group_nest()? - r

As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
Adding a relevel argument (shown below)
Adding a relevel argument within the custom function itself (also shown below)
Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english' but could also be switched to work for 'computer science' too.

It does respect the relevel() function, the problem, such as it is, is that the returned results don't match with the major column. See what happens if you stop at the unnest() function:
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy))
Now, look at major and term
tmp %>% select(major, term)
# # A tibble: 6 × 2
# major term
# <fct> <chr>
# 1 undeclared "(Intercept)"
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english "relevel(major, ref = \"english\")undeclared"
# 4 undeclared "(Intercept)"
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english "relevel(major, ref = \"english\")undeclared"
You can see that the rows where major is "english" are actually for the "undeclared" parameter estimate. Taking the above result, I think you can capture what you want with the following:
tmp %>%
filter(term != "(Intercept)") %>%
mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(year, app_deadline, major, time, significant) %>%
full_join(test2)
# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
# year app_deadline major time significant admit reject total accept_rate
# <dbl> <date> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 2020-03-23 computer science 2020-01-01 Yes 300 100 400 0.75
# 2 2020 2020-03-23 undeclared 2020-01-01 Yes 800 210 1010 0.792
# 3 2021 2021-04-04 computer science 2021-01-01 Yes 1000 300 1300 0.769
# 4 2021 2021-04-04 undeclared 2021-01-01 No 500 1000 1500 0.333
# 5 NA 2021-04-04 english 2021-01-01 NA 450 1000 1450 0.310
# 6 NA 2020-03-23 english 2020-01-01 NA 100 900 1000 0.1
# 7 NA 2019-05-23 undeclared 2019-01-01 NA 1000 1500 2500 0.4

Related

How to efficiently run many logistic regressions in R and skip over equations that throw errors?

As a continuation from this question, I want to run many logistic regression equations at once and then note if a group was significantly different from a reference group. This solution works, but it only works when I'm not missing values. Being that my data has 100 equations, it's bound to have missing values, so rather than this solution failing when it hits an error, how can I program it to skip the instances that throw an error?
Here's a modified dataset that's missing cases:
library(dplyr)
library(lubridate)
library(broom)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
And here's the solution that works when it has all cases (see dataset here), but when it hits the 2019 grouping that's missing cases, it fails:
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total, na.action = na.exclude)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
I also tried wrapping the solution using the custom functions here, but I also couldn't get it to work. Last, I'm also open to other ideas for a solution if it produces a similar output and is resistant to these errors.
To ignore errors, use this function:
get_model <- function(df) {
tryCatch(
glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
Use it where you call mutate(reg=map()...):
# compute regression
mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy))
Output:
# A tibble: 4 x 8
year major time admit reject total accept_rate significant
<dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
2 2021 english 2021-01-01 450 1000 1450 0.310 No
3 2020 computer science 2020-01-01 300 100 400 0.75 No
4 2020 english 2020-01-01 100 900 1000 0.1 Yes
purrr::safely allows to take care of errors. To wrap glm call inside purrr::safely, I use a helper function glm_safe. glm_safe returns a list with two elements, result and error.
When there is no error, result contains the model object, while element is NULL. In case of an error, the error message is stored in error and result is NULL.
To use the results in your pipeline, we have to extract the result elements which could be achieved via transpose(reg)$result.
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test <- test %>%
mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
mutate(accept_rate = admit / total)
glm_safe <- purrr::safely(
function(x) {
glm(accept_rate ~ major,
data = x,
family = binomial, weights = total, na.action = na.exclude
)
}
)
test %>%
# create year column
mutate(
year = year(time),
major = relevel(major, "undeclared")
) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, glm_safe),
reg = transpose(reg)$result) |>
mutate(reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#> year major time admit reject total accept_rate significant
#> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
#> 2 2021 english 2021-01-01 450 1000 1450 0.310 No
#> 3 2020 computer science 2020-01-01 300 100 400 0.75 No
#> 4 2020 english 2020-01-01 100 900 1000 0.1 Yes

Variable importance signs from vip are opposite of expected from glmnet / tidymodels

I am using a lasso regression to classify some text as either related to AI or not. When I calculate variable importance using vip and tidymodels, the sign is opposite of expected -- words like "machine", "learning", and "algorithm" have a negative sign.
Apologies for the lack of reprex, but here is my code:
fy21_raw %>%
sample_n(5)
# A tibble: 5 x 3
# prog_title text artificial_intel
# <chr> <chr> <fct>
#1 Advanced Batt~ "ABMS l~ not
#2 Energy Effici~ "This e~ not
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not
# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"
set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel)
budget_train <- training(budget_split)
budget_test <- testing(budget_split)
set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5)
budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
update_role(prog_title, new_role = "id") %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 1000) %>%
step_upsample(artificial_intel) %>% # update dv with actual name
step_tfidf(text) %>%
step_normalize(recipes::all_predictors())
budget_wf <- workflow() %>%
add_recipe(budget_rec)
lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
set.seed(1234)
lasso_res <- budget_wf %>%
add_model(lasso_spec) %>%
fit_resamples(resamples = budget_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))
set.seed(123)
budget_imp <- budget_wf %>%
add_model(lasso_spec) %>%
fit(budget_train) %>%
pull_workflow_fit() %>%
vi()
# A tibble: 1,000 x 3
# Variable Importance Sign
# <chr> <dbl> <chr>
# 1 tfidf_text_machine -6.82 NEG
# 2 tfidf_text_artificial -5.84 NEG
# 3 tfidf_text_learning -3.69 NEG
Is it calculating the importance relative to the "not" outcome rather than "artificial_intel"?
From the glmnet vignette:
Note that for "binomial" models, results are returned only for the
class corresponding to the second level of the factor response.
So if you want the right coefficient sign, the positive level with glmnet must be the second.
If you use glmnet with yardstick, keep in mind that yardstick uses the first factor-level as default. Therefore, you need to set yardstick.event_first = FALSE

Creating a versatile descriptives table using dplyr

I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.
There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854
I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.
Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)
Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949
One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))
A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975
You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:

Using clusrank by group

simple question, I want to perform the one-sample rank test with cluster in data, after searching for a while, I got clusWilcox.test from the package clusrank. A toy example for illustration:
df = data.frame(x_1 = rnorm(200),
x_2 = rnorm(200),
group = c(rep('A',100),rep('B',100)),
clus = c(rep('a_1',50),rep('a_2',50),rep('b_1',50),rep('b_2',50)))
Worked like a charm when used directly
clusWilcox.test(x_1,paired = TRUE,cluster = "clus",data = df)
But went wrong when I tried to perform the test by group:
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus")$p.value), vars = c('x_1','x_2'))
Error in complete.cases(x, cluster, group, stratum) :
not all arguments have the same length
Seems like a data problem, so I fill the data option of the function with df, it worked, but test all the data instead of by group.
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus",data = df)$p.value), vars = c('x_1','x_2'))
> temp_test
# A tibble: 2 x 3
group vars1 vars2
<fct> <dbl> <dbl>
1 A 0.168 0.136
2 B 0.168 0.136
This won't happen when I tried to perform the one-sample t.test
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(t.test(.)$p.value), vars = c('x_1','x_2'))
My guess is that the clusWilcox.test somehow could not inherit data from dplyr, anyone know how to get the problem fixed?
According to ?clusWilcox.test, the cluster parameter should be a numeric vector. In your df, it is a factor.
Therefore, running the test separately for group A with your factor cluster variable
clusWilcox.test(x_1, paired = TRUE, cluster = clus, data = df[df$group == "A", ])
results in:
Clustered Wilcoxon signed rank test using Rosner-Glynn-Lee method
data: x_1; cluster: clus; (from [)x_1; cluster: clus; (from temp)x_1; cluster: clus; (from temp$group == "A")x_1; cluster: clus; (from )
number of observations: 100; number of clusters: 4
Z = NA, p-value = NA
alternative hypothesis: true shift in location is not equal to 0
If you create a new cluster variable that is numeric, it runs the tests correctly:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise(pvalue = clusWilcox.test(x_1, paired = TRUE, cluster = clus)$p.value)
group pvalue
<fct> <dbl>
1 A 0.175
2 B 0.801
If you want to calculate it for different columns:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value)
group x_1 x_2
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289
To indicate that it contains the p-value:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), list(pvalue = ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value))
group x_1_pvalue x_2_pvalue
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Resources