n() not showing in the summarise() output from tidyvesre in R - r

I my code below, I was wondering why the result of n = n() is not shown in the final output?
library(tidyverse)
hsb <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/hsb.csv')
hsb %>% dplyr::select(math, sector) %>% group_by(sector) %>%
summarise(across(.fns = list(mean=mean, sd=sd), n = n()))

The issue seems to be with the closing bracket of across. We want the n to be a single column instead of repeating for each case, so for that, we can close the across and use n = n() separately i.e outside the across
library(dplyr)
hsb %>%
dplyr::select(math, sector) %>%
group_by(sector) %>%
summarise(across(.fns = list(mean=mean, sd=sd)), n = n(), .groups = 'drop')
# A tibble: 2 x 4
# sector math_mean math_sd n
# <int> <dbl> <dbl> <int>
#1 0 11.4 7.08 3642
#2 1 14.2 6.36 3543
Just to show that if we need multiple 'n' columns (not really needed). Here, we select only two columns and one of them is the grouping column, so it would return only a single 'n'
hsb %>%
dplyr::select(math, sector) %>%
group_by(sector) %>%
summarise(across(.fns = list(mean = mean, sd = sd,
n = ~ n())), .groups = 'drop')=
# A tibble: 2 x 4
# sector math_mean math_sd math_n
# <int> <dbl> <dbl> <int>
#1 0 11.4 7.08 3642
#2 1 14.2 6.36 3543

Related

Dplyr Summarise Groups as Column Names

I got a data frame with a lot of columns and want to summarise them with multiple functions.
test_df <- data.frame(Group = sample(c("A", "B", "C"), 10, T), var1 = sample(1:5, 10, T), var2 = sample(3:7, 10, T))
test_df %>%
group_by(Group) %>%
summarise_all(c(Mean = mean, Sum = sum))
# A tibble: 3 x 5
Group var1_Mean var2_Mean var1_Sum var2_Sum
<chr> <dbl> <dbl> <int> <int>
1 A 3.14 5.14 22 36
2 B 4.5 4.5 9 9
3 C 4 6 4 6
This results in a tibble with the first row Group and column names with a combination of the previous column name and the function name.
The desired result is a table with the previous column names as first row and the groups and functions in the column names.
I can achive this with
test_longer <- test_df %>% pivot_longer(cols = starts_with("var"), names_to = "var", values_to = "val")
# Add row number because spread needs unique identifiers for rows
test_longer <- test_longer %>%
group_by(Group) %>%
mutate(grouped_id = row_number())
spread(test_longer, Group, val) %>%
select(-grouped_id) %>%
group_by(var) %>%
summarise_all(c(Mean = mean, Sum = sum), na.rm = T)
# A tibble: 2 x 7
var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 var1 3.14 4.5 4 22 9 4
2 var2 5.14 4.5 6 36 9 6
But this seems to be a rather long detour... There probably is a better way, but I could not find it. Any suggestions? Thank you
There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22

Find relative frequencies of summarized columns in R

I need to get the relative frequencies of a summarized column in R. I've used dplyr's summarize to find the total of each grouped row, like this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars))
x total
<chr> <dbl>
1 expense 1 3600
2 expense 2 2150
3 expense 3 2000
But now I need to create a new column for the relative frequencies of each total row to get this result:
x total p
<chr> <dbl> <dbl>
1 expense 1 3600 46.45%
2 expense 2 2150 27.74%
3 expense 3 2000 25.81%
I've tried this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = scales::percent(total/sum(total))
and this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = total/sum(total)*100)
but the result is always this:
x total p
<chr> <dbl> <dbl>
1 expense 1 3600 100%
2 expense 2 2150 100%
3 expense 3 2000 100%
The problem seems to be the summarized total column that may be affecting the results. Any ideas to help me? Thanks
You get 100% because of the grouping. However, after you've summarized, dplyr will drop the one level of grouping. Meaning that if you e.g. do mutate() after, you get the results you need:
library(dplyr)
data <- tibble(
x = c("expense 1", "expense 2", "expense 3"),
dollars = c(3600L, 2150L, 2000L)
)
data %>%
group_by(x) %>%
summarise(total = sum(dollars)) %>%
mutate(p = total/sum(total)*100)
# A tibble: 3 x 3
x total p
<chr> <int> <dbl>
1 expense 1 3600 46.5
2 expense 2 2150 27.7
3 expense 3 2000 25.8
You get 100% because it calculates the total of that particular group. You need to ungroup. Assuming that you want to divide by total entries just divide by nrow(df).
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = total/nrow(data)*100)
After the first sum, ungroup and create p with mutate.
iris %>%
group_by(Species) %>%
summarise(total = sum(Sepal.Length)) %>%
ungroup() %>%
mutate(p = total/sum(total)*100)
## A tibble: 3 x 3
# Species total p
# <fct> <dbl> <dbl>
#1 setosa 250. 28.6
#2 versicolor 297. 33.9
#3 virginica 329. 37.6

dplyr number of rows across groups after filtering

I want the count and proportion (of all of elements) of each group in a data frame (after filtering). This code produces the desired output:
library(dplyr)
df <- data_frame(id = sample(letters[1:3], 100, replace = TRUE),
value = rnorm(100))
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(proportion = count / sum(count))
> summary
# A tibble: 3 x 3
id count proportion
<chr> <int> <dbl>
1 a 17 0.3695652
2 b 13 0.2826087
3 c 16 0.3478261
Is there an elegant solution to avoid the ungroup() and second summarize() steps. Something like:
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n(),
proportion = n() / [?TOTAL_ROWS()?])
I couldn't find such a function in the documentation, but I must be missing something obvious. Thanks!
You can use nrow on . which refers to the entire data frame piped in:
df %>%
filter(value > 0) %>%
group_by(id) %>%
summarise(count = n(), proportion = count / nrow(.))
# A tibble: 3 x 3
# id count proportion
# <chr> <int> <dbl>
#1 a 14 0.2592593
#2 b 22 0.4074074
#3 c 18 0.3333333

Passing column names through multiple functions with dplyr

I wrote a simple function to create tables of percentages in dplyr:
library(dplyr)
df = tibble(
Gender = sample(c("Male", "Female"), 100, replace = TRUE),
FavColour = sample(c("Red", "Blue"), 100, replace = TRUE)
)
quick_pct_tab = function(df, col) {
col_quo = enquo(col)
df %>%
count(!! col_quo) %>%
mutate(Percent = (100 * n / sum(n)))
}
df %>% quick_pct_tab(FavColour)
# Output:
# A tibble: 2 x 3
FavColour n Percent
<chr> <int> <dbl>
1 Blue 58 58
2 Red 42 42
This works great. However, when I tried to build on top of this, writing a new function that calculated the same percentages with grouping, I could not figure out how to use quick_pct_tab within the new function - after trying multiple different combinations of quo(col), !! quo(col) and enquo(col), etc.
bygender_tab = function(df, col) {
col_enquo = enquo(col)
# Want to replace this with
# df %>% quick_pct_tab(col)
gender_tab = df %>%
group_by(Gender) %>%
count(!! col_enquo) %>%
mutate(Percent = (100 * n / sum(n)))
gender_tab %>%
select(!! col_enquo, Gender, Percent) %>%
spread(Gender, Percent)
}
> df %>% bygender_tab(FavColour)
# A tibble: 2 x 3
FavColour Female Male
* <chr> <dbl> <dbl>
1 Blue 52.08333 63.46154
2 Red 47.91667 36.53846
From what I understand non-standard evaluation in dplyr is deprecated so it would be great to learn how to achieve this using dplyr > 0.7. How do I have to quote the col argument to pass it through to a further dplyr function?
We need to do !! to trigger the evaluation of the 'col_enquo'
bygender_tab = function(df, col) {
col_enquo = enquo(col)
df %>%
group_by(Gender) %>%
quick_pct_tab(!!col_enquo) %>% ## change
select(!! col_enquo, Gender, Percent) %>%
spread(Gender, Percent)
}
df %>%
bygender_tab(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
#* <chr> <dbl> <dbl>
#1 Blue 54.54545 41.07143
#2 Red 45.45455 58.92857
Using the OP's function, the output is
# A tibble: 2 x 3
# FavColour Female Male
#* <chr> <dbl> <dbl>
#1 Blue 54.54545 41.07143
#2 Red 45.45455 58.92857
Note that the seed was not set while creating the dataset
Update
with rlang version 0.4.0 (ran with dplyr - 0.8.2), we can also use the {{...}} to do quote, unquote, substitution
bygender_tabN = function(df, col) {
df %>%
group_by(Gender) %>%
quick_pct_tab({{col}}) %>% ## change
select({{col}}, Gender, Percent) %>%
spread(Gender, Percent)
}
df %>%
bygender_tabN(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
# <chr> <dbl> <dbl>
#1 Blue 50 46.3
#2 Red 50 53.7
-checking output with previous function (set.seed was not provided)
df %>%
bygender_tab(FavColour)
# A tibble: 2 x 3
# FavColour Female Male
# <chr> <dbl> <dbl>
#1 Blue 50 46.3
#2 Red 50 53.7

How to gather then mutate a new column then spread to wide format again

Using tidyr/dplyr, I have some factor columns which I'd like to Z-score, and then mutate an average Z-score, whilst retaining the original data for reference.
I'd like to avoid using a for loop in tidyr/dplyr, thus I'm gathering my data and performing my calculation (Z-score) on a single column. However, I'm struggling with restoring the wide format.
Here is a MWE:
library(dplyr)
library(tidyr)
# Original Data
dfData <- data.frame(
Name = c("Steve","Jwan","Ashley"),
A = c(10,20,12),
B = c(0.2,0.3,0.5)
) %>% tbl_df()
# Gather to Z-score
dfLong <- dfData %>% gather("Factor","Value",A:B) %>%
mutate(FactorZ = paste0("Z_",Factor)) %>%
group_by(Factor) %>%
mutate(ValueZ = (Value - mean(Value,na.rm = TRUE))/sd(Value,na.rm = TRUE))
# Now go wide to do some mutations (eg Z)Avg = (Z_A + Z_B)/2)
# This does not work
dfWide <- dfLong %>%
spread(Factor,Value) %>%
spread(FactorZ,ValueZ)%>%
mutate(Z_Avg = (Z_A+Z_B)/2)
# This is the desired result
dfDesired <- dfData %>% mutate(Z_A = (A - mean(A,na.rm = TRUE))/sd(A,na.rm = TRUE)) %>% mutate(Z_B = (B - mean(B,na.rm = TRUE))/sd(B,na.rm = TRUE)) %>%
mutate(Z_Avg = (Z_A+Z_B)/2)
Thanks for any help/input!
Another approach using dplyr (version 0.5.0)
library(dplyr)
dfData %>%
mutate_each(funs(Z = scale(.)), -Name) %>%
mutate(Z_Avg = (A_Z+B_Z)/2)
means <-function(x)mean(x, na.rm=T)
dfWide %>% group_by(Name) %>% summarise_each(funs(means)) %>% mutate(Z_Avg = (Z_A + Z_B)/2)
# A tibble: 3 x 6
Name A B Z_A Z_B Z_Avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
3 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
Here is one approach with long and wide format. For z-transformation, you can use the base function scale. Furthermore, this approach includes a join to combine the original data frame and the one including the new values.
dfLong <- dfData %>%
gather(Factor, Value, A:B) %>%
group_by(Factor) %>%
mutate(ValueZ = scale(Value))
# Name Factor Value ValueZ
# <fctr> <chr> <dbl> <dbl>
# 1 Steve A 10.0 -0.7559289
# 2 Jwan A 20.0 1.1338934
# 3 Ashley A 12.0 -0.3779645
# 4 Steve B 0.2 -0.8728716
# 5 Jwan B 0.3 -0.2182179
# 6 Ashley B 0.5 1.0910895
dfWide <- dfData %>% inner_join(dfLong %>%
ungroup %>%
select(-Value) %>%
mutate(Factor = paste0("Z_", Factor)) %>%
spread(Factor, ValueZ) %>%
mutate(Z_Avg = (Z_A + Z_B) / 2))
# Name A B Z_A Z_B Z_Avg
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
# 2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
# 3 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
I would just do it all in wide format. No need to keep switching between the long and wide formats.
dfData %>%
mutate(Z_A=(A-mean(unlist(dfData$A)))/sd(unlist(dfData$A)),
Z_B=(B-mean(unlist(dfData$B)))/sd(unlist(dfData$B))) %>%
mutate(Z_AVG=(Z_A+Z_B)/2)

Resources