I wrote a simple function to create tables of percentages in dplyr:
library(dplyr)
df = tibble(
Gender = sample(c("Male", "Female"), 100, replace = TRUE),
FavColour = sample(c("Red", "Blue"), 100, replace = TRUE)
)
quick_pct_tab = function(df, col) {
col_quo = enquo(col)
df %>%
count(!! col_quo) %>%
mutate(Percent = (100 * n / sum(n)))
}
df %>% quick_pct_tab(FavColour)
# Output:
# A tibble: 2 x 3
FavColour n Percent
<chr> <int> <dbl>
1 Blue 58 58
2 Red 42 42
This works great. However, when I tried to build on top of this, writing a new function that calculated the same percentages with grouping, I could not figure out how to use quick_pct_tab within the new function - after trying multiple different combinations of quo(col), !! quo(col) and enquo(col), etc.
bygender_tab = function(df, col) {
col_enquo = enquo(col)
# Want to replace this with
# df %>% quick_pct_tab(col)
gender_tab = df %>%
group_by(Gender) %>%
count(!! col_enquo) %>%
mutate(Percent = (100 * n / sum(n)))
gender_tab %>%
select(!! col_enquo, Gender, Percent) %>%
spread(Gender, Percent)
}
> df %>% bygender_tab(FavColour)
# A tibble: 2 x 3
FavColour Female Male
* <chr> <dbl> <dbl>
1 Blue 52.08333 63.46154
2 Red 47.91667 36.53846
From what I understand non-standard evaluation in dplyr is deprecated so it would be great to learn how to achieve this using dplyr > 0.7. How do I have to quote the col argument to pass it through to a further dplyr function?
We need to do !! to trigger the evaluation of the 'col_enquo'
bygender_tab = function(df, col) {
col_enquo = enquo(col)
df %>%
group_by(Gender) %>%
quick_pct_tab(!!col_enquo) %>% ## change
select(!! col_enquo, Gender, Percent) %>%
spread(Gender, Percent)
}
df %>%
bygender_tab(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
#* <chr> <dbl> <dbl>
#1 Blue 54.54545 41.07143
#2 Red 45.45455 58.92857
Using the OP's function, the output is
# A tibble: 2 x 3
# FavColour Female Male
#* <chr> <dbl> <dbl>
#1 Blue 54.54545 41.07143
#2 Red 45.45455 58.92857
Note that the seed was not set while creating the dataset
Update
with rlang version 0.4.0 (ran with dplyr - 0.8.2), we can also use the {{...}} to do quote, unquote, substitution
bygender_tabN = function(df, col) {
df %>%
group_by(Gender) %>%
quick_pct_tab({{col}}) %>% ## change
select({{col}}, Gender, Percent) %>%
spread(Gender, Percent)
}
df %>%
bygender_tabN(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
# <chr> <dbl> <dbl>
#1 Blue 50 46.3
#2 Red 50 53.7
-checking output with previous function (set.seed was not provided)
df %>%
bygender_tab(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
# <chr> <dbl> <dbl>
#1 Blue 50 46.3
#2 Red 50 53.7
Related
I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)
My code is dirty.
if condition smaller than two, names = unpopular.
df <- data.frame(vote=c("A","A","A","B","B","B","B","B","B","C","D"),
val=c(rep(1,11))
)
df %>% group_by(vote) %>% summarise(val=sum(val))
out
vote val
<fct> <dbl>
1 A 3
2 B 6
3 C 1
4 D 1
but I need
vote val
<fct> <dbl>
1 A 3
2 B 6
3 unpopular 2
my idea is
df2 <- df %>% group_by(vote) %>% summarise(val=sum(val))
df2$vote[df2$val < 2] <- "unpop"
df2 %>% group_by....
it's not cool.
do you know any cool & helpful function ?
We can do a double grouping
library(dplyr)
df %>%
group_by(vote) %>%
summarise(val=sum(val)) %>%
group_by(vote = replace(vote, val <2, 'unpop')) %>%
summarise(val = sum(val))
-output
# A tibble: 3 x 2
# vote val
# <chr> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or another option with rowsum
df %>%
group_by(vote = replace(vote, vote %in%
names(which((rowsum(val, vote) < 2)[,1])), 'unpopular')) %>%
summarise(val = sum(val))
Or using fct_lump_n from forcats
library(forcats)
df %>%
group_by(vote = fct_lump_n(vote, 2, other_level = "unpop")) %>%
summarise(val = sum(val))
# A tibble: 3 x 2
# vote val
# <fct> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or using table
df %>%
group_by(vote = replace(vote,
vote %in% names(which(table(vote) < 2)), 'unpop')) %>%
summarise(val = sum(val))
If you want to vote based on sum of val in base R you can do this as :
aggregate(val~vote, transform(aggregate(val~vote, df, sum),
vote = replace(vote, val < 2, 'unpop')), sum)
# vote val
#1 A 3
#2 B 6
#3 unpop 2
I my code below, I was wondering why the result of n = n() is not shown in the final output?
library(tidyverse)
hsb <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/hsb.csv')
hsb %>% dplyr::select(math, sector) %>% group_by(sector) %>%
summarise(across(.fns = list(mean=mean, sd=sd), n = n()))
The issue seems to be with the closing bracket of across. We want the n to be a single column instead of repeating for each case, so for that, we can close the across and use n = n() separately i.e outside the across
library(dplyr)
hsb %>%
dplyr::select(math, sector) %>%
group_by(sector) %>%
summarise(across(.fns = list(mean=mean, sd=sd)), n = n(), .groups = 'drop')
# A tibble: 2 x 4
# sector math_mean math_sd n
# <int> <dbl> <dbl> <int>
#1 0 11.4 7.08 3642
#2 1 14.2 6.36 3543
Just to show that if we need multiple 'n' columns (not really needed). Here, we select only two columns and one of them is the grouping column, so it would return only a single 'n'
hsb %>%
dplyr::select(math, sector) %>%
group_by(sector) %>%
summarise(across(.fns = list(mean = mean, sd = sd,
n = ~ n())), .groups = 'drop')=
# A tibble: 2 x 4
# sector math_mean math_sd math_n
# <int> <dbl> <dbl> <int>
#1 0 11.4 7.08 3642
#2 1 14.2 6.36 3543
I have the following function to describe a variable
library(dplyr)
describe = function(.data, variable){
args <- as.list(match.call())
evalue = eval(args$variable, .data)
summarise(.data,
'n'= length(evalue),
'mean' = mean(evalue),
'sd' = sd(evalue))
}
I want to use dplyr for describing the variable.
set.seed(1)
df = data.frame(
'g' = sample(1:3, 100, replace=T),
'x1' = rnorm(100),
'x2' = rnorm(100)
)
df %>% describe(x1)
# n mean sd
# 1 100 -0.01757949 0.9400179
The problem is that when I try to apply the same descrptive using function group_by the describe function is not applied in each group
df %>% group_by(g) %>% describe(x1)
# # A tibble: 3 x 4
# g n mean sd
# <int> <int> <dbl> <dbl>
# 1 1 100 -0.01757949 0.9400179
# 2 2 100 -0.01757949 0.9400179
# 3 3 100 -0.01757949 0.9400179
How would you change the function to obtain what is desired using an small number of modifications?
You need tidyeval:
describe = function(.data, variable){
evalue = enquo(variable)
summarise(.data,
'n'= length(!!evalue),
'mean' = mean(!!evalue),
'sd' = sd(!!evalue))
}
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 27 -0.23852862 1.0597510
2 2 38 0.11327236 0.8470885
3 3 35 0.01079926 0.9351509
The dplyr vignette 'Programming with dplyr' has a thorough description of using enquo and !!
Edit:
In response to Axeman's comment, I'm not 100% why the group_by and describe does not work here.
However, using debugonce with the funciton in it's original form
debugonce(describe)
df %>% group_by(g) %>% describe(x1)
one can see that evalue is not grouped and is just a numeric vector of length 100.
Base NSE appears to work, too:
describe <- function(data, var){
var_q <- substitute(var)
data %>%
summarise(n = n(),
mean = mean(eval(var_q)),
sd = sd(eval(var_q)))
}
df %>% describe(x1)
n mean sd
1 100 -0.1266289 1.006795
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 33 -0.1379206 1.107412
2 2 29 -0.4869704 0.748735
3 3 38 0.1581745 1.020831
Using tidyr/dplyr, I have some factor columns which I'd like to Z-score, and then mutate an average Z-score, whilst retaining the original data for reference.
I'd like to avoid using a for loop in tidyr/dplyr, thus I'm gathering my data and performing my calculation (Z-score) on a single column. However, I'm struggling with restoring the wide format.
Here is a MWE:
library(dplyr)
library(tidyr)
# Original Data
dfData <- data.frame(
Name = c("Steve","Jwan","Ashley"),
A = c(10,20,12),
B = c(0.2,0.3,0.5)
) %>% tbl_df()
# Gather to Z-score
dfLong <- dfData %>% gather("Factor","Value",A:B) %>%
mutate(FactorZ = paste0("Z_",Factor)) %>%
group_by(Factor) %>%
mutate(ValueZ = (Value - mean(Value,na.rm = TRUE))/sd(Value,na.rm = TRUE))
# Now go wide to do some mutations (eg Z)Avg = (Z_A + Z_B)/2)
# This does not work
dfWide <- dfLong %>%
spread(Factor,Value) %>%
spread(FactorZ,ValueZ)%>%
mutate(Z_Avg = (Z_A+Z_B)/2)
# This is the desired result
dfDesired <- dfData %>% mutate(Z_A = (A - mean(A,na.rm = TRUE))/sd(A,na.rm = TRUE)) %>% mutate(Z_B = (B - mean(B,na.rm = TRUE))/sd(B,na.rm = TRUE)) %>%
mutate(Z_Avg = (Z_A+Z_B)/2)
Thanks for any help/input!
Another approach using dplyr (version 0.5.0)
library(dplyr)
dfData %>%
mutate_each(funs(Z = scale(.)), -Name) %>%
mutate(Z_Avg = (A_Z+B_Z)/2)
means <-function(x)mean(x, na.rm=T)
dfWide %>% group_by(Name) %>% summarise_each(funs(means)) %>% mutate(Z_Avg = (Z_A + Z_B)/2)
# A tibble: 3 x 6
Name A B Z_A Z_B Z_Avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
3 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
Here is one approach with long and wide format. For z-transformation, you can use the base function scale. Furthermore, this approach includes a join to combine the original data frame and the one including the new values.
dfLong <- dfData %>%
gather(Factor, Value, A:B) %>%
group_by(Factor) %>%
mutate(ValueZ = scale(Value))
# Name Factor Value ValueZ
# <fctr> <chr> <dbl> <dbl>
# 1 Steve A 10.0 -0.7559289
# 2 Jwan A 20.0 1.1338934
# 3 Ashley A 12.0 -0.3779645
# 4 Steve B 0.2 -0.8728716
# 5 Jwan B 0.3 -0.2182179
# 6 Ashley B 0.5 1.0910895
dfWide <- dfData %>% inner_join(dfLong %>%
ungroup %>%
select(-Value) %>%
mutate(Factor = paste0("Z_", Factor)) %>%
spread(Factor, ValueZ) %>%
mutate(Z_Avg = (Z_A + Z_B) / 2))
# Name A B Z_A Z_B Z_Avg
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
# 2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
# 3 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
I would just do it all in wide format. No need to keep switching between the long and wide formats.
dfData %>%
mutate(Z_A=(A-mean(unlist(dfData$A)))/sd(unlist(dfData$A)),
Z_B=(B-mean(unlist(dfData$B)))/sd(unlist(dfData$B))) %>%
mutate(Z_AVG=(Z_A+Z_B)/2)