How can I loop different variables to the same command - r

I am trying to loop different variables into the same command:
Following is the list of variables and values I want to loop
behavior_list <- c("knocked1", "questions1", ...)
answer_list <- c(0, 1)
answer_label_list <- c("Yes", "No")
Following is the command:
data_aliki %>%
group_by(indicator) %>%
summarise(
total_indicator = n(),
yes_knocked1 = sum(knocked1==1, na.rm = TRUE)
)
I am trying to loop
yes_knocked1 = sum(knocked1==1, na.rm = TRUE)
no_knocked1 = sum(knocked1==0, na.rm = TRUE)
yes_questions1 = sum(questions1==1, na.rm = TRUE)
no_questions1 = sum(questions1==0, na.rm = TRUE)
Is there an easier way to do this instead of copy and paste?

You did not provide a reproducible example, so I will illustrate how to achieve what you want in dplyr for the mtcars data set:
mtcars %>% group_by(cyl) %>%
summarize_at(c("mpg","hp"), list("lt15" = ~sum(. < 15, na.rm = TRUE),
"lt18" = ~sum(. < 18, na.rm = TRUE)))
Output
cyl mpg_lt15 hp_lt15 mpg_lt18 hp_lt18
<dbl> <int> <int> <int> <int>
1 4 0 0 0 0
2 6 0 0 1 0
3 8 5 0 12 0
This should work in your case:
data_aliki %>%
group_by(indicator) %>%
summarize_at(c("knocked1","questions1"),
list("yes" = ~sum(. == 1, na.rm = TRUE),
"no" = ~sum(. == 0, na.rm = TRUE))

Related

R: Calculating Quantiles with (group_by .add = TRUE)

I am working with the R programming language.
I have the following dataset:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Patient_ID Gender Status Height Weight Hospital_Visits Disease
1 1 Female Citizen 145.0583 113.70725 1 No
2 2 Male Immigrant 161.2759 88.33188 18 No
3 3 Female Immigrant 138.5305 99.26961 6 Yes
4 4 Male Citizen 164.8102 84.31848 12 No
5 5 Male Citizen 159.1619 92.25090 12 Yes
6 6 Female Citizen 153.3513 101.31986 11 Yes
In a previous question (R: Calculating Proportions Based on Nested Groups), I learned how to calculate "nested proportions" based on ntiles (e.g. calculate 3 ntiles for one variable, group by these 3 ntiles and then claculate 3 ntiles for the second variable based on these previous ntiles,etc.):
# e.g. using 3 ntiles
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3),
Height_range = paste(min(Height), max(Height), sep = "-")) %>%
group_by(Height_ntile, Height_range, .add = TRUE) %>%
mutate(Weight_ntile = ntile(Weight, 3),
Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
Now, I am trying to repeat this exact same function but using "quantiles" instead:
I tried to modify the above code - here is my attempt:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_group = cut(Height, breaks = c(-Inf,
quantile(Height, c(0.33, 0.67)),
Inf)),
Height_range = paste(min(Height), max(Height), sep = "-")) %>%
group_by(Height_group, Height_range, .add = TRUE) %>%
mutate(Weight_group = cut(Weight, breaks = c(-Inf,
quantile(Weight, c(0.33, 0.67)),
Inf)),
Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
group_by(Weight_group, Weight_range, .add = TRUE) %>%
mutate(Hospital_Visits_group = cut(Hospital_Visits, breaks = c(-Inf,
quantile(Hospital_Visits, c(0.33, 0.67)),
Inf)),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
group_by(Hospital_Visits_group, Hospital_range, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
This code runs, but I am not sure if I have done this correctly (e.g. the "infinite" values appearing):
A tibble: 108 x 10
Gender Status Height_~1 Heigh~2 Weigh~3 Weigh~4 Hospi~5 Hospi~6 perce~7
<fct> <fct> <fct> <chr> <fct> <chr> <fct> <chr> <dbl>
1 Female Citizen (-Inf,14~ 115.86~ (-Inf,~ 58.991~ (-Inf,~ 1-20 0.314
2 Female Citizen (-Inf,14~ 115.86~ (-Inf,~ 58.991~ (7,14] 1-20 0.458
Can someone please show me if I have done this correctly?
Thanks!
Answer based on insights provided by #akrun:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_group = as.integer(cut(Height, breaks = c(-Inf,
quantile(Height, c(0.33, 0.67)),
Inf))),
Height_range = paste(min(Height), max(Height), sep = "-")) %>%
group_by(Height_group, Height_range, .add = TRUE) %>%
mutate(Weight_group = as.integer(cut(Weight, breaks = c(-Inf,
quantile(Weight, c(0.33, 0.67)),
Inf))),
Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
group_by(Weight_group, Weight_range, .add = TRUE) %>%
mutate(Hospital_Visits_group = as.integer(cut(Hospital_Visits, breaks = c(-Inf,
quantile(Hospital_Visits, c(0.33, 0.67)),
Inf))),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
group_by(Hospital_Visits_group, Hospital_range, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
Have I understood this correctly?

Using group_by and summarise_all to create dummy indicators for categorical variable

I want to generate dummy indicators for each id for the given categorical variable fruit. I observe the following warning when using summarise_all and self defined function. I also tried to use summarise_all(any) and it gave me warning when coercing double to logical. Is there any efficient or updated way to implement this? Thanks a lot!
fruit = c("apple", "banana", "orange", "pear",
"strawberry", "blueberry", "durian",
"grape", "pineapple")
df_sample = data.frame(id = c(rep("a", 3), rep("b", 5), rep("c", 6)),
fruit = c(sample(fruit, replace = T, size = 3),
sample(fruit, replace = T, size = 5),
sample(fruit, replace = T, size = 6)))
fruit_indicator =
model.matrix(~ -1 + fruit, df_sample) %>%
as.data.frame() %>%
bind_cols(df_sample) %>%
select(-fruit) %>%
group_by(id) %>%
summarise_all(funs(ifelse(any(. > 0), 1, 0)))
# Warning message:
# `funs()` is deprecated as of dplyr 0.8.0.
# Please use a list of either functions or lambdas:
#
# # Simple named list:
# list(mean = mean, median = median)
#
# # Auto named with `tibble::lst()`:
# tibble::lst(mean, median)
#
# # Using lambdas
# list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
You can use across which is available in dplyr 1.0.0 or higher.
library(dplyr)
model.matrix(~ -1 + fruit, df_sample) %>%
as.data.frame() %>%
bind_cols(df_sample) %>%
select(-fruit) %>%
group_by(id) %>%
summarise(across(.fns = ~as.integer(any(. > 0))))
# id fruitapple fruitbanana fruitdurian fruitgrape fruitpear
#* <chr> <int> <int> <int> <int> <int>
#1 a 0 1 1 0 1
#2 b 1 0 0 1 0
#3 c 0 1 0 1 1
# … with 1 more variable: fruitpineapple <int>

R: making group_by and summarise more efficient [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 2 years ago.
I've got a data set called data with column headers Region, 2006, 2007, and so on until 2020. The region column gives the name of the area, while the year columns give the population for that year. For example 2006 lists the population for that year in a particular region, 2007 lists the population for that year in a particular region and so on.
The below code gives me my desired output (it shows the total population for each year by region). However, it is very time consuming to type this code out. Is there a way to make this code more efficient and save time typing out 15 different lines?
newData <- data %>%
group_by(Region) %>%
summarise(totalPop2006 = sum(`2006`, na.rm = TRUE),
totalPop2007 = sum(`2007`, na.rm = TRUE),
totalPop2008 = sum(`2008`, na.rm = TRUE),
totalPop2009 = sum(`2009`, na.rm = TRUE),
totalPop2010 = sum(`2010`, na.rm = TRUE),
totalPop2011 = sum(`2011`, na.rm = TRUE),
totalPop2012 = sum(`2012`, na.rm = TRUE),
totalPop2013 = sum(`2013`, na.rm = TRUE),
totalPop2014 = sum(`2014`, na.rm = TRUE),
totalPop2015 = sum(`2015`, na.rm = TRUE),
totalPop2016 = sum(`2016`, na.rm = TRUE),
totalPop2017 = sum(`2017`, na.rm = TRUE),
totalPop2018 = sum(`2018`, na.rm = TRUE),
totalPop2019 = sum(`2019`, na.rm = TRUE),
totalPop2020 = sum(`2020`, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(Region)
Thanks!
We can use summarise with across
library(dplyr)
data %>%
group_by(Region) %>%
summarise(across(`2006`:`2020`, ~ sum(., na.rm = TRUE),
.names = 'totalPop{col}'), .groups = 'drop') %>%
arrange(Region)
Using the default dataset 'mtcars'
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise(across(disp:wt, ~ sum(., na.rm = TRUE), .names = 'totalPop{col}'),
.groups = 'drop')
# A tibble: 3 x 5
# cyl totalPopdisp totalPophp totalPopdrat totalPopwt
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 1156. 909 44.8 25.1
#2 6 1283. 856 25.1 21.8
#3 8 4943. 2929 45.2 56.0
Or in base R with aggregate
aggregate(. ~ Region, data[c('Region', 2006:2020)],
sum, na.rm = TRUE, na.action = NULL)

How to Add Column Totals to Grouped Summaries in R

I'm in the process of creating summaries tables based on subgroups and would love to add an overall summary in a tidyer/more efficient manner.
What I have so far is this. I've created summaries via levels within my factor variables.
library(tidyverse)
df <- data.frame(var1 = 10:18,
var2 = c("A","B","A","B","A","B","A","B","A"))
group_summary <- df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
Next I created an overall summary.
Summary <- df %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
Finally, I bound the two objects with dplyr::bind_rows
complete_summary <- bind_rows(Summary, group_summary)
What I've done works but it is very, very verbose and can't be the most efficient way. I tried to use ungroup
group_summary <- df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()) %>% ungroup %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n())
but it threw an error:
Evaluation error: object var1 not found.
Thanks in advance for your assistance.
Ideally, if you want to do it in one-chain, this is how you can do by using bind_rows to combine both the results, just like you've done - but removing the temporary objects you created.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
df <- data.frame(var1 = 10:18,
var2 = c("A","B","A","B","A","B","A","B","A"))
df %>% group_by(var2) %>%
filter(var2 != "NA") %>%
summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()) %>% #ungroup() %>%
bind_rows( df %>% summarise("Max" = max(var1, na.rm = TRUE),
"Median" = median(var1, na.rm = TRUE),
"Min" = min(var1, na.rm = TRUE),
"IQR" = IQR(var1, na.rm = TRUE),
"Count" = n()))
#> # A tibble: 3 x 6
#> var2 Max Median Min IQR Count
#> <fct> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 A 18 14 10 4 5
#> 2 B 17 14 11 3 4
#> 3 <NA> 18 14 10 4 9
Created on 2019-01-29 by the reprex package (v0.2.1)
Not the most elegant solution either, but simple:
c <- mtcars %>%
mutate(total_mean = mean(wt),
total_median = median(wt)) %>%
group_by(cyl) %>%
summarise(meanweight = mean(wt),
medianweight = median(wt),
total_mean = first(total_mean),
total_median = first(total_median))

dplyr pipe multiple datasets to summarize()

I am making a table using dplyr. I want to perform the same "summarize" command on multiple datasets. I know in ggplot2, you can just change out the dataset and rerun the plot, which is cool.
here's what I want to avoid:
table_1 <-
group_by(df_1, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
table_2 <-
group_by(df_2, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
Basically, is there a way to set up the summarize command as a function or something maybe so I can just pour in df_1 and df_2?
If you know all the variable names in advance and if they are the same in all the data sets you want to look at, you can just do something like:
myfunc <- function(df) {
df %>%
group_by(cyl) %>%
summarize(n = n(),
mean_hp = mean(hp))
}
myfunc(mtcars)
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429
And then use it with a different data set (that would have the same structure and variable names). If you need flexibility, i.e. you don't know all the variables in advance and what to be able to specify them as input in the function, look at the dplyr non standard evaluation vignette.
Here's just a tiny example of how you could implement "standard evaluation" into your function to allow for more flexibility. Consider if you wanted to allow the user of the function to specify by which column the data should be grouped, you could do:
myfunc <- function(df, grp) {
df %>%
group_by_(grp) %>% # notice that I use "group_by_" instead of "group_by"
summarize(n = n(),
mean_hp = mean(hp))
}
and then use it:
myfunc(mtcars, "gear")
#Source: local data frame [3 x 3]
#
# gear n mean_hp
#1 3 15 176.1333
#2 4 12 89.5000
#3 5 5 195.6000
myfunc(mtcars, "cyl")
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429
The %>% operator just passes on a tbl object as the first parameter to the next function. And summarize just expects a tbl. So you can define
mysummary <- function(.data) {
summarize(.data, n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
}
And then call
table_1 <- group_by(df_1, boro) %>% mysummary
table_2 <- group_by(df_2, boro) %>% mysummary
With an actual working example
mysummary <- function(.data) {
summarize(.data,
ave.mpg=mean(mpg),
ave.hp=mean(hp)
)
}
mtcars %>% group_by(cyl) %>% mysummary
mtcars %>% group_by(gear) %>% mysummary

Resources