median doesn't work after group_by in r function - r

I wrote a r function to compute the median by group:
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me<- function(df, var, group_var) {
df[[group_var]]<-as.character(df[[group_var]])
# df[[var]]<-as.numeric(df[[var]])
tbl1<-df %>%
bind_rows(mutate(., !!group_var := 'Total')) %>%
dplyr::group_by(gpvar=.[[group_var]])%>%
dplyr::summarise(
median=median(.[[var]], na.rm = TRUE),
N = n())
print(tbl1)
}
ctn_me(df1, "var1", "outcome")
It gave me results like this:
#### gpvar median N
#### <chr> <dbl> <int>
#### 1 No 734 30
#### 2 Total 734 60
#### 3 Yes 734 30
So it can count the number of rows within each group, but for the median, it returned the overall median instead by the group.
This gave me the results I wanted:
df1 %>% bind_rows(mutate(., outcome := 'Total')) %>%
dplyr::group_by(outcome)%>%
dplyr::summarise(
median=median(var1, na.rm = TRUE),
N = n())
# A tibble: 3 x 3
# outcome median N
# <chr> <dbl> <int>
# 1 No 713 30
# 2 Total 734 60
# 3 Yes 788. 30
I was trying to figure out what was wrong with my r function. Can anyone let me know? Thanks!

The docs state that you need to specifically reference ".data" within the summarise() function:
"When you have an env-variable that is a character vector, you need to
index into the .data pronoun with [[, like summarise(df, mean =
mean(.data[[var]]))."
In this case, you need to change .[[variable]] to .data[[variable]], i.e.
library(tidyverse)
set.seed(123)
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me <- function(df, var, group_var) {
df %>%
bind_rows(mutate(., !!group_var := "Total")) %>%
group_by(gpvar = .[[group_var]]) %>%
summarise(
median_group = median(.data[[var]], na.rm = TRUE),
N = n()
)
}
ctn_me(df1, "var1", "outcome")
#> # A tibble: 3 × 3
#> gpvar median_group N
#> <chr> <dbl> <int>
#> 1 No 740. 30
#> 2 Total 754 60
#> 3 Yes 776. 30
Created on 2022-07-19 by the reprex package (v2.0.1)
Original answer:
If you use a different syntax inside the summarise() function it works as expected, so I think it's something to do with the summarise() function:
library(tidyverse)
set.seed(123)
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me <- function(df, var, group_var) {
df %>%
bind_rows(mutate(., !!group_var := "Total")) %>%
group_by(gpvar = .[[group_var]]) %>%
summarise(
median_group = median(!!sym(var), na.rm = TRUE),
N = n()
)
}
ctn_me(df1, "var1", "outcome")
#> # A tibble: 3 × 3
#> gpvar median_group N
#> <chr> <dbl> <int>
#> 1 No 740. 30
#> 2 Total 754 60
#> 3 Yes 776. 30
Created on 2022-07-19 by the reprex package (v2.0.1)

Try this for non-standard evaluation.
ctn_me<- function(df, var, group_var) {
df[[group_var]]<-as.character(df[[group_var]])
# df[[var]]<-as.numeric(df[[var]])
tbl1<-df %>%
bind_rows(mutate(., !!group_var := 'Total')) %>%
dplyr::group_by(.data[[group_var]])%>%
dplyr::summarise(
median=median(.data[[var]], na.rm = TRUE),
N = n())
print(tbl1)
}```

Related

How to combine function argument with group_by in R

I would like to use group_by( ) function with my customised function but the column names that goes within group_by would be defined in my function argument.
See a hypothetical example of what my data would look like:
data <- data.frame(ind = rep(c("A", "B", "C"), 4),
gender = rep(c("F", "M"), each = 6),
value = sample(1:100, 12))
And this is the result I would like to have:
result <- data %>%
group_by(ind, gender) %>%
mutate(value = mean(value)) %>%
distinct()
This is how I was trying to make my function to work:
myFunction <- function(data, set_group, variable){
result <- data %>%
group_by(get(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
distinct()
}
result3 <- myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
result3
I want to allow that the user define as many set_group as needed and as many variable as needed. I tried using get( ) function, all_of( ) function and mget( ) function within group_by but none worked.
Does anyone know how can I code it?
Thank you!
We could use across within group_by
myFunction <- function(data, set_group, variable){
data %>%
group_by(across(all_of(set_group))) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
Another option is to convert to symbols and evaluate (!!!)
myFunction <- function(data, set_group, variable){
data %>%
group_by(!!! rlang::syms(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
NOTE: get is used when there is a single object, for multiple objects mget can be used. But, it is better to use tidyverse functions

Summaries based on one reference column compared to the other columns in dplyr

I want to get sums of a variable depending on the other variable's na or non-na values in R. A working example code is below:
library(dplyr)
df <- data.frame(A = c(1,2,3,NA,4),
B = c(NA,2,3,NA,5),
C = c(3,4,NA,NA,NA),
REF = c(10,20,30,40,50))
df.na <- df %>% mutate_at(vars(-REF),is.na)
sums <- matrix(0,2,3)
row.names(sums) <- c("NON-NA","NA")
colnames(sums) <- c("A","B","C")
for(i in 1:3){
sums[,i] <- df.na %>% group_by_at(i) %>% summarise(sum=sum(REF)) %>% select(sum) %>% unlist()
}
> sums
A B C
NON-NA 110 100 30
NA 40 50 120
For example, since 4th term in the A column is NA, the corresponding column values are 30 and 10+20+3+50 = 150-30 = 120 in sums object.
My question is how do I get this output without a for loop?
Here is a solution using the pivot_ functions from tidyr. The approach pivots to a longer form so that you can group by column name and whether the column value is NA.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c("A", "B", "C")) %>%
mutate(isna = is.na(value)) %>%
group_by(name, isna) %>%
summarize(value = sum(REF)) %>%
pivot_wider()
isna A B C
<lgl> <dbl> <dbl> <dbl>
1 FALSE 110 100 30
2 TRUE 40 50 120
df <- data.frame(A = c(1,2,3,NA,4),
B = c(NA,2,3,NA,5),
C = c(3,4,NA,NA,NA),
REF = c(10,20,30,40,50))
library(tidyverse)
imap(.x = df[1:3],
.f = ~ df %>%
group_by(grp = is.na(.x)) %>%
summarise(!!.y := sum(REF, na.rm = T))) %>%
reduce(left_join)
#> Joining, by = "grp"
#> Joining, by = "grp"
#> # A tibble: 2 x 4
#> grp A B C
#> <lgl> <dbl> <dbl> <dbl>
#> 1 FALSE 110 100 30
#> 2 TRUE 40 50 120
Created on 2022-01-26 by the reprex package (v2.0.1)

Calculate mean and sd for given variables in a dataframe

Given a vector of names of numeric variables in a dataframe, I need to calculate mean and sd for each variable. For example, given the mtcars dataset and the following vector of variable names:
vars_to_transform <- c("mpg", "disp")
I'd like to have the following as result:
The first solution that came into my mind is the following:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } )
The result is the following:
As you can see, all the returned variables are characters, but I expected to have numbers for avg and sd.
Is there a way to fix this? Or is there any better solution than this?
P.S.
I'm using purr 0.3.4
Seems like an overcomplicated way of doing select->pivot->group->summarise.
mtcars %>%
select(all_of(vars_to_transform)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(
mean = mean(value),
sd = sd(value)
)
# A tibble: 2 x 3
name mean sd
<chr> <dbl> <dbl>
1 disp 231. 124.
2 mpg 20.1 6.03
The following works (instead of using c() in your code, use tibble):
vars_to_transform %>%
map_dfr(~ tibble(variable = .x, avg = mean(mtcars[[.x]], na.rm = T),
sd = sd(mtcars[[.x]], na.rm = T)))
Explanation: With c(), you are using a vector, whose elements must have the same type (character in your case, because variable is character). With tibble, one can have a different type per element.
#Gwang-Jin Kim suggests, in a comment bellow that I thank, one could also have used list instead of tibble.
Or try with adding type.convert:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } ) %>%
type.convert(as.is=T)
#> # A tibble: 2 × 3
#> variable avg sd
#> <chr> <dbl> <dbl>
#> 1 mpg 20.1 6.03
#> 2 disp 231. 124.
Another option:
library(purrr)
library(dplyr)
vars_to_transform <- c("mpg", "disp")
funs <- lst(mean, sd)
mtcars %>%
select(all_of(vars_to_transform)) %>%
map_df(~ funs %>%
map(exec, .x), .id = "var")
# A tibble: 2 x 3
var mean sd
<chr> <dbl> <dbl>
1 mpg 20.1 6.03
2 disp 231. 124.
m <- mtcars[, vars_to_transform]
tibble(variable = names(m), avg = apply(m, 2, mean), sd = apply(m, 2, sd))
## A tibble: 2 × 3
# variable avg sd
# <chr> <dbl> <dbl>
#1 mpg 20.1 6.03
#2 disp 231. 124.

Perform multiple two-sample t-test using dplyr in R

I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)

Passing parameters into function that uses dplyr

I have the following function to describe a variable
library(dplyr)
describe = function(.data, variable){
args <- as.list(match.call())
evalue = eval(args$variable, .data)
summarise(.data,
'n'= length(evalue),
'mean' = mean(evalue),
'sd' = sd(evalue))
}
I want to use dplyr for describing the variable.
set.seed(1)
df = data.frame(
'g' = sample(1:3, 100, replace=T),
'x1' = rnorm(100),
'x2' = rnorm(100)
)
df %>% describe(x1)
# n mean sd
# 1 100 -0.01757949 0.9400179
The problem is that when I try to apply the same descrptive using function group_by the describe function is not applied in each group
df %>% group_by(g) %>% describe(x1)
# # A tibble: 3 x 4
# g n mean sd
# <int> <int> <dbl> <dbl>
# 1 1 100 -0.01757949 0.9400179
# 2 2 100 -0.01757949 0.9400179
# 3 3 100 -0.01757949 0.9400179
How would you change the function to obtain what is desired using an small number of modifications?
You need tidyeval:
describe = function(.data, variable){
evalue = enquo(variable)
summarise(.data,
'n'= length(!!evalue),
'mean' = mean(!!evalue),
'sd' = sd(!!evalue))
}
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 27 -0.23852862 1.0597510
2 2 38 0.11327236 0.8470885
3 3 35 0.01079926 0.9351509
The dplyr vignette 'Programming with dplyr' has a thorough description of using enquo and !!
Edit:
In response to Axeman's comment, I'm not 100% why the group_by and describe does not work here.
However, using debugonce with the funciton in it's original form
debugonce(describe)
df %>% group_by(g) %>% describe(x1)
one can see that evalue is not grouped and is just a numeric vector of length 100.
Base NSE appears to work, too:
describe <- function(data, var){
var_q <- substitute(var)
data %>%
summarise(n = n(),
mean = mean(eval(var_q)),
sd = sd(eval(var_q)))
}
df %>% describe(x1)
n mean sd
1 100 -0.1266289 1.006795
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 33 -0.1379206 1.107412
2 2 29 -0.4869704 0.748735
3 3 38 0.1581745 1.020831

Resources