dplyr pipe multiple datasets to summarize() - r

I am making a table using dplyr. I want to perform the same "summarize" command on multiple datasets. I know in ggplot2, you can just change out the dataset and rerun the plot, which is cool.
here's what I want to avoid:
table_1 <-
group_by(df_1, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
table_2 <-
group_by(df_2, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
Basically, is there a way to set up the summarize command as a function or something maybe so I can just pour in df_1 and df_2?

If you know all the variable names in advance and if they are the same in all the data sets you want to look at, you can just do something like:
myfunc <- function(df) {
df %>%
group_by(cyl) %>%
summarize(n = n(),
mean_hp = mean(hp))
}
myfunc(mtcars)
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429
And then use it with a different data set (that would have the same structure and variable names). If you need flexibility, i.e. you don't know all the variables in advance and what to be able to specify them as input in the function, look at the dplyr non standard evaluation vignette.
Here's just a tiny example of how you could implement "standard evaluation" into your function to allow for more flexibility. Consider if you wanted to allow the user of the function to specify by which column the data should be grouped, you could do:
myfunc <- function(df, grp) {
df %>%
group_by_(grp) %>% # notice that I use "group_by_" instead of "group_by"
summarize(n = n(),
mean_hp = mean(hp))
}
and then use it:
myfunc(mtcars, "gear")
#Source: local data frame [3 x 3]
#
# gear n mean_hp
#1 3 15 176.1333
#2 4 12 89.5000
#3 5 5 195.6000
myfunc(mtcars, "cyl")
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429

The %>% operator just passes on a tbl object as the first parameter to the next function. And summarize just expects a tbl. So you can define
mysummary <- function(.data) {
summarize(.data, n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
}
And then call
table_1 <- group_by(df_1, boro) %>% mysummary
table_2 <- group_by(df_2, boro) %>% mysummary
With an actual working example
mysummary <- function(.data) {
summarize(.data,
ave.mpg=mean(mpg),
ave.hp=mean(hp)
)
}
mtcars %>% group_by(cyl) %>% mysummary
mtcars %>% group_by(gear) %>% mysummary

Related

R: making group_by and summarise more efficient [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 2 years ago.
I've got a data set called data with column headers Region, 2006, 2007, and so on until 2020. The region column gives the name of the area, while the year columns give the population for that year. For example 2006 lists the population for that year in a particular region, 2007 lists the population for that year in a particular region and so on.
The below code gives me my desired output (it shows the total population for each year by region). However, it is very time consuming to type this code out. Is there a way to make this code more efficient and save time typing out 15 different lines?
newData <- data %>%
group_by(Region) %>%
summarise(totalPop2006 = sum(`2006`, na.rm = TRUE),
totalPop2007 = sum(`2007`, na.rm = TRUE),
totalPop2008 = sum(`2008`, na.rm = TRUE),
totalPop2009 = sum(`2009`, na.rm = TRUE),
totalPop2010 = sum(`2010`, na.rm = TRUE),
totalPop2011 = sum(`2011`, na.rm = TRUE),
totalPop2012 = sum(`2012`, na.rm = TRUE),
totalPop2013 = sum(`2013`, na.rm = TRUE),
totalPop2014 = sum(`2014`, na.rm = TRUE),
totalPop2015 = sum(`2015`, na.rm = TRUE),
totalPop2016 = sum(`2016`, na.rm = TRUE),
totalPop2017 = sum(`2017`, na.rm = TRUE),
totalPop2018 = sum(`2018`, na.rm = TRUE),
totalPop2019 = sum(`2019`, na.rm = TRUE),
totalPop2020 = sum(`2020`, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(Region)
Thanks!
We can use summarise with across
library(dplyr)
data %>%
group_by(Region) %>%
summarise(across(`2006`:`2020`, ~ sum(., na.rm = TRUE),
.names = 'totalPop{col}'), .groups = 'drop') %>%
arrange(Region)
Using the default dataset 'mtcars'
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise(across(disp:wt, ~ sum(., na.rm = TRUE), .names = 'totalPop{col}'),
.groups = 'drop')
# A tibble: 3 x 5
# cyl totalPopdisp totalPophp totalPopdrat totalPopwt
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 1156. 909 44.8 25.1
#2 6 1283. 856 25.1 21.8
#3 8 4943. 2929 45.2 56.0
Or in base R with aggregate
aggregate(. ~ Region, data[c('Region', 2006:2020)],
sum, na.rm = TRUE, na.action = NULL)

How can I loop different variables to the same command

I am trying to loop different variables into the same command:
Following is the list of variables and values I want to loop
behavior_list <- c("knocked1", "questions1", ...)
answer_list <- c(0, 1)
answer_label_list <- c("Yes", "No")
Following is the command:
data_aliki %>%
group_by(indicator) %>%
summarise(
total_indicator = n(),
yes_knocked1 = sum(knocked1==1, na.rm = TRUE)
)
I am trying to loop
yes_knocked1 = sum(knocked1==1, na.rm = TRUE)
no_knocked1 = sum(knocked1==0, na.rm = TRUE)
yes_questions1 = sum(questions1==1, na.rm = TRUE)
no_questions1 = sum(questions1==0, na.rm = TRUE)
Is there an easier way to do this instead of copy and paste?
You did not provide a reproducible example, so I will illustrate how to achieve what you want in dplyr for the mtcars data set:
mtcars %>% group_by(cyl) %>%
summarize_at(c("mpg","hp"), list("lt15" = ~sum(. < 15, na.rm = TRUE),
"lt18" = ~sum(. < 18, na.rm = TRUE)))
Output
cyl mpg_lt15 hp_lt15 mpg_lt18 hp_lt18
<dbl> <int> <int> <int> <int>
1 4 0 0 0 0
2 6 0 0 1 0
3 8 5 0 12 0
This should work in your case:
data_aliki %>%
group_by(indicator) %>%
summarize_at(c("knocked1","questions1"),
list("yes" = ~sum(. == 1, na.rm = TRUE),
"no" = ~sum(. == 0, na.rm = TRUE))

How to Add Column Totals to Grouped Summaries in R

I'm in the process of creating summaries tables based on subgroups and would love to add an overall summary in a tidyer/more efficient manner.
What I have so far is this. I've created summaries via levels within my factor variables.
library(tidyverse)
df <- data.frame(var1 = 10:18,
var2 = c("A","B","A","B","A","B","A","B","A"))
group_summary <- df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
Next I created an overall summary.
Summary <- df %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
Finally, I bound the two objects with dplyr::bind_rows
complete_summary <- bind_rows(Summary, group_summary)
What I've done works but it is very, very verbose and can't be the most efficient way. I tried to use ungroup
group_summary <- df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()) %>% ungroup %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
but it threw an error:
Evaluation error: object var1 not found.
Thanks in advance for your assistance.
Ideally, if you want to do it in one-chain, this is how you can do by using bind_rows to combine both the results, just like you've done - but removing the temporary objects you created.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
df <- data.frame(var1 = 10:18,
var2 = c("A","B","A","B","A","B","A","B","A"))
df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()) %>% #ungroup() %>%
bind_rows( df %>% summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()))
#> # A tibble: 3 x 6
#> var2 Max Median Min IQR Count
#> <fct> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 A 18 14 10 4 5
#> 2 B 17 14 11 3 4
#> 3 <NA> 18 14 10 4 9
Created on 2019-01-29 by the reprex package (v0.2.1)
Not the most elegant solution either, but simple:
c <- mtcars %>%
mutate(total_mean = mean(wt),
total_median = median(wt)) %>%
group_by(cyl) %>%
summarise(meanweight = mean(wt),
medianweight = median(wt),
total_mean = first(total_mean),
total_median = first(total_median))

Drop rows in data frame only if values in two columns are reversed and all other values identical

I am working with the iris dataset, and manipulating it as follows to get a species, feature1, feature2, value data frame:
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
filter(!(.xkey == .ykey)) %>%
mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
arrange(var)
}
test = iris %>%
gatherpairs(sapply(colnames(iris[, -ncol(iris)]), eval))
This was taken from https://stackoverflow.com/a/47731111/8315659
What this does is give me that data frame with all combinations of feature1 and feature2, but I want to remove duplicates where it is just the reverse being shown. For example, Petal.Length vs Petal.Width is the same as Petal.Width vs Petal.Length. But if there are two rows with identical values for Petal.Length vs Petal.Width, I do not want to drop that row. Therefore, just dropping rows where all values are identical except that .xkey and .ykey are reversed is what I would want to do. Essentially, this is just to recreate the bottom triangle of the ggplot matrix shown in the above linked answer.
How can this be done?
Jack
I think this could be accomplished using the first part of the source code, which performs a single gathering operation. Using the iris example, this will produce 600 rows of output, one for each of the 150 rows x 4 columns in iris.
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} # %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
# na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
# filter(!(.xkey == .ykey)) %>%
# mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
# arrange(var)
}

Normalising data with dplyr mutate() brings inconsistencies

I'm trying to reproduce the framework from this blogpost http://www.luishusier.com/2017/09/28/balance/ with the following code but it looks like I get inconsistent results
library(tidyverse)
library(magrittr)
ids <- c("1617", "1516", "1415", "1314", "1213", "1112", "1011", "0910", "0809", "0708", "0607", "0506")
data <- ids %>%
map(function(i) {read_csv(paste0("http://www.football-data.co.uk/mmz4281/", i ,"/F1.csv")) %>%
select(Date:AST) %>%
mutate(season = i)})
data <- bind_rows(data)
data <- data[complete.cases(data[ , 1:3]), ]
tmp1 <- data %>%
select(season, HomeTeam, FTHG:FTR,HS:AST) %>%
rename(BP = FTHG,
BC = FTAG,
TP = HS,
TC = AS,
TCP = HST,
TCC = AST,
team = HomeTeam)%>%
mutate(Pts = ifelse(FTR == "H", 3, ifelse(FTR == "A", 0, 1)),
Terrain = "Domicile")
tmp2 <- data %>%
select(season, AwayTeam, FTHG:FTR, HS:AST) %>%
rename(BP = FTAG,
BC = FTHG,
TP = AS,
TC = HS,
TCP = AST,
TCC = HST,
team = AwayTeam)%>%
mutate(Pts = ifelse(FTR == "A", 3 ,ifelse(FTR == "H", 0 , 1)),
Terrain = "Extérieur")
tmp3 <- bind_rows(tmp1, tmp2)
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
arrange((season), desc(pts), desc(diff_but))
Then I apply the framework mentioned above:
l1_0517 <- l1_0517 %>%
mutate(
# First, see how many goals the team scores relative to the average
norm_attack = but_p %>% divide_by(mean(but_p)) %>%
# Then, transform it into an unconstrained scale
log(),
# First, see how many goals the team concedes relative to the average
norm_defense = but_c %>% divide_by(mean(but_c)) %>%
# Invert it, so a higher defense is better
raise_to_power(-1) %>%
# Then, transform it into an unconstrained scale
log(),
# Now that we have normalized attack and defense ratings, we can compute
# measures of quality and attacking balance
quality = norm_attack + norm_defense,
balance = norm_attack - norm_defense
) %>%
arrange(desc(norm_attack))
When I look at the column norm_attack, I expect to find the same value for equivalent but_p values, which is not the case here:
head(l1_0517, 10)
for instance when but_p has value 83, row 5 and row 7, I get norm_attack at 0.5612738 and 0.5128357 respectively.
Is it normal? I would expect mean(l1_0517$but_p) to be fixed and therefore obtaining the same result when a value of l1_0517$but_p is log normalised?
UPDATE
I have tried to work on a simpler example but I can't reproduce this issue:
df <- tibble(a = as.integer(runif(200, 15, 100)))
df <- df %>%
mutate(norm_a = a %>% divide_by(mean(a)) %>%
log())
I found the solution after looking at the type of l1_0517
It is a grouped_df hence the different results.
The correct code is:
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
ungroup() %>%
arrange((season), desc(pts), desc(diff_but))

Resources