Standard Evaluation to calculate the mean using summarise() - r

So I am taking a column name which is guaranteed to exist in the data frame as the input, and want to calculate the mean of the given column (Let's suppose c is the parameter in the function).
EDIT: Tried recreating a scenario here.
states <- c("Washington", "Washington", "California", "California")
random.data <- c(10, 20, 30, 40)
data <- data.frame(states, random.data)
data.frame <- data %>%
group_by(states) %>%
summarise_(average = mean(paste0("random", ".data")))
When I tried doing this I got the following error:
Warning message:
In mean.default(paste0("random", ".data")) :
argument is not numeric or logical: returning NA
I have an idea about standard evaluation, but for some reason it is not working for this mean function.

We can use sym with !!
col <- paste0("random", ".data")
data %>%
group_by(states) %>%
summarise(average = mean(!! rlang::sym(col)))
##or use directly
#summarise(average = mean(!! rlang::sym(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15
Or another option is get
data %>%
group_by(states) %>%
summarise(average = mean(get(paste0("random", ".data"))))
# A tibble: 2 x 2
# states average
# <fctr> <dbl>
#1 California 35
#2 Washington 15

Related

Create a function to get summary statistics of a data frame in R

I have below data frame df3.
City
Income
Cost
Age
NY
1237
2432
43
NY
6352
8632
32
Boston
6487
2846
54
NJ
6547
7353
42
Boston
7564
7252
21
NY
9363
7563
35
Boston
3262
7352
54
NY
9473
8667
76
NJ
6234
4857
31
Boston
5242
7684
39
NJ
7483
4748
47
NY
9273
6573
53
I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.
variable
Mean
SD
Income
XX
XX
Cost
XX
XX
Age
XX
XX
XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)
ST <- function(c) {
if (df3$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.
From ?summarize_at,
Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...
I'll demo the use of across.
summarize can be given multiple (named) functions at once, I'll show that, too.
Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.
Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.
ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
# variable mu sigma
# <chr> <dbl> <dbl>
# 1 Income 7140. 3550.
# 2 Cost 6773. 2576.
# 3 Age 47.8 17.7
Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:
NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).
It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)
From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in
ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
# City variable mu sigma
# <chr> <chr> <dbl> <dbl>
# 1 Boston Income 5639. 1847.
# 2 Boston Cost 6284. 2299.
# 3 Boston Age 42 15.7
# 4 NY Income 7140. 3550.
# 5 NY Cost 6773. 2576.
# 6 NY Age 47.8 17.7
Edit: I added the rounding.
ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}
ST("Boston")
# A tibble: 3 × 4
City variable mean sd
<chr> <chr> <dbl> <dbl>
1 Boston Age 42 15.7
2 Boston Cost 6284. 2299.
3 Boston Income 5639. 1847.

Why does R ignore relevel when using group_nest()?

As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
Adding a relevel argument (shown below)
Adding a relevel argument within the custom function itself (also shown below)
Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english' but could also be switched to work for 'computer science' too.
It does respect the relevel() function, the problem, such as it is, is that the returned results don't match with the major column. See what happens if you stop at the unnest() function:
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy))
Now, look at major and term
tmp %>% select(major, term)
# # A tibble: 6 × 2
# major term
# <fct> <chr>
# 1 undeclared "(Intercept)"
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english "relevel(major, ref = \"english\")undeclared"
# 4 undeclared "(Intercept)"
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english "relevel(major, ref = \"english\")undeclared"
You can see that the rows where major is "english" are actually for the "undeclared" parameter estimate. Taking the above result, I think you can capture what you want with the following:
tmp %>%
filter(term != "(Intercept)") %>%
mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(year, app_deadline, major, time, significant) %>%
full_join(test2)
# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
# year app_deadline major time significant admit reject total accept_rate
# <dbl> <date> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 2020-03-23 computer science 2020-01-01 Yes 300 100 400 0.75
# 2 2020 2020-03-23 undeclared 2020-01-01 Yes 800 210 1010 0.792
# 3 2021 2021-04-04 computer science 2021-01-01 Yes 1000 300 1300 0.769
# 4 2021 2021-04-04 undeclared 2021-01-01 No 500 1000 1500 0.333
# 5 NA 2021-04-04 english 2021-01-01 NA 450 1000 1450 0.310
# 6 NA 2020-03-23 english 2020-01-01 NA 100 900 1000 0.1
# 7 NA 2019-05-23 undeclared 2019-01-01 NA 1000 1500 2500 0.4

Fairly new to R , can anyone tell me the difference between the queries?

penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(penguins$bill_length_mm))
penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(bill_length_mm))
I'll word it a little more strongly: when using the pipe operator %>% and the dplyr package, you should not use the dataframe name with the column names ($-indexing); while it works sometimes, if anything in the pipeline removes, adds, or reorders the rows, then your subsequent calculations will be wrong. It isn't that you don't need to assign the dataframe name, it's that if you do use it then you are likely corrupting your data. The first code is broken, do not trust it. (Whether it is truly corrupted or not may be contextual; I don't know if it corrupts it here.)
Let me demonstrate. If we want to know the max bill length (mm) of all of the penguins, by sex, we should do something like this:
library(dplyr)
data("penguins", package = "palmerpenguins")
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 58
# 2 male 59.6
If for some reason we instead use penguins$bill_length_mm, then we'll see this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female NA
# 2 male NA
which will likely encourage us to add na.rm=TRUE to the data, and we'll get a seemingly valid-ish number:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm, na.rm = TRUE))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 59.6
# 2 male 59.6
but the problem is that max(.) is being passed all of penguins$bill_length_mm, not just the values within each group.
In this case, the use of penguins$ is not a syntax error, it is a logical error, and there is no way for dplyr or anything else in R to know that what you are doing is not what you really need. It works, because max(.) sees a vector and it returns a single number; then summarize(.) sees a single number and assigns it to a new variable.
And in this case, our results are corrupted.
The only time it may be valid to use penguins$ in this is if we truly need to bring in a number or object from outside of the current "view" of the data. Realize that the data that summarize(.) sees is not the data that started in the pipe: it has been filtered (by drop_na()), it might be changed (if we mutated some columns into it) or reordered (if we arrange the data).
But if we need to find out the percentage of the max bill length with respect to the max of the original data, we might do this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(
maxbill = max(bill_length_mm),
maxbill_ratio = max(bill_length_mm) / max(penguins$bill_length_mm, na.rm = TRUE)
)
# # A tibble: 2 x 3
# sex maxbill maxbill_ratio
# <fct> <dbl> <dbl>
# 1 female 58 0.973
# 2 male 59.6 1
(Recall that we needed to add na.rm=TRUE in that call because one of the rows has an NA ... and the data we see in that last max has not been filtered/cleaned by the drop_na() call.)

Use summarize and a for loop taking column names from a character vector

I have a dataset which I cannot share here, but I need to create columns using a for loop and the column names should come from a character vector. Below I try to replicate what I am trying to achieve using the flights dataset from the nycflights13 package.
install.packages("nycflights13")
library(nycflights13)
flights <- nycflights13::flights
flights <- flights[c(10, 16, 17)]
var_interest <- c("distance", "hour")
for (i in 1:length(var_interest)) {
flights %>% group_by(carrier) %>%
summarize(paste(var_interest[i], "n", sep = "_") = sum(paste(var_interest[i])))
}
This code generates the following error:
Error: unexpected '=' in:
" flights %>% group_by(carrier) %>%
summarize(paste(var_interest[i], "n", sep = "_") ="
> }
Error: unexpected '}' in "}"
My actual dataset is more complex than this example and therefore, I need to follow this approach. So if you could help me find what I am missing here, that would be highly appreciated!
The code can be modified to evaluate (!!) the column after converting the string to symbol, while on the lhs of assignment (:=) do the evaluation (!!) of string as well
out <- vector('list', length(var_interest))
for (i in seq_along(var_interest)) {
out[[i]] <- flights %>%
group_by(carrier) %>%
summarize(!! paste(var_interest[i], "n", sep = "_") :=
sum(!! rlang::sym(var_interest[i])), .groups = 'drop')
}
lapply(out, head, 3)
#[[1]]
# A tibble: 3 x 2
# carrier distance_n
# <chr> <dbl>
#1 9E 9788152
#2 AA 43864584
#3 AS 1715028
#[[2]]
# A tibble: 3 x 2
# carrier hour_n
# <chr> <dbl>
#1 9E 266419
#2 AA 413361
#3 AS 9013
There are multiple ways to pass a string column name and evaluate it.
As above stated, convert to a symbol and evaluate (!!).
Make use of across which can take either unquoted, or string or column index as integer i.e. In that case, we don't even need any loop
flights %>%
group_by(carrier) %>%
summarise(across(all_of(var_interest), ~
sum(., na.rm = TRUE), .names = '{.col}_n'),
.groups = 'drop')
# A tibble: 16 x 3
# carrier distance_n hour_n
# <chr> <dbl> <dbl>
# 1 9E 9788152 266419
# 2 AA 43864584 413361
# 3 AS 1715028 9013
# 4 B6 58384137 747278
# 5 DL 59507317 636932
# 6 EV 30498951 718187
# 7 F9 1109700 9441
# 8 FL 2167344 43960
# 9 HA 1704186 3324
#10 MQ 15033955 358779
#11 OO 16026 550
#12 UA 89705524 754410
#13 US 11365778 252595
#14 VX 12902327 63876
#15 WN 12229203 151366
#16 YV 225395 9300
A tidy way to do this might be to stack it longer rather than wider:
install.packages("nycflights13")
library(nycflights13)
flights <- nycflights13::flights %>%
select(carrier,distance,hour)
by_carrier <- purrr::map_dfr( c('distance','hour'), function(x) {
flights %>%
dplyr::group_by(carrier) %>%
dplyr::summarize(n = sum(!!as.name(x))) %>%
dplyr::mutate(key = x)
})
If you still want the for loop to append columns you can use the !!as.name() feature twice with something like
by_carrier <- NULL
for ( i in c('distance','hour')) {
df <-
flights %>%
dplyr::group_by(carrier) %>%
dplyr::summarize(!!as.name(i) := sum(!!as.name(i) ))
by_carrier <- bind_cols(by_carrier,df)
}
although you'd have to clean up the carrier columns after that one.

Meaning of error using . shorthand inside dplyr function

I'm getting a dplyr::bind_rows error. It's a very trivial problem, because I can easily get around it, but I'd like to understand the meaning of the error message.
I have the following data of some population groups for New England states, and I'd like to bind on a copy of these same values with the name changed to "New England," so that I can group by name and add them up, giving me values for the individual states, plus an overall value for the region.
df <- structure(list(name = c("CT", "MA", "ME", "NH", "RI", "VT"),
estimate = c(501074, 1057316, 47369, 76630, 141206, 27464)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
I'm doing this as part of a much larger flow of piped steps, so I can't just do bind_rows(df, df %>% mutate(name = "New England")). dplyr gives the convenient . shorthand for a data frame being piped from one function to the next, but I can't use that to bind the data frame to itself in a way I'd like.
What does work and gets me the output I want:
library(tidyverse)
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
bind_rows(mutate(., name = "New England")) %>%
group_by(name) %>%
summarise(estimate = sum(estimate))
#> # A tibble: 7 x 2
#> name estimate
#> <chr> <dbl>
#> 1 ct 501074
#> 2 ma 1057316
#> 3 me 47369
#> 4 New England 1851059
#> 5 nh 76630
#> 6 ri 141206
#> 7 vt 27464
But when I try to do the same thing with the . shorthand, I get this error:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(. %>% mutate(name = "New England"))
#> Error in bind_rows_(x, .id): Argument 2 must be a data frame or a named atomic vector, not a fseq/function
Like I said, doing it the first way is fine, but I'd like to understand the error because I write a lot of multi-step piped code.
As #aosmith noted in the comments it's due to the way magrittr parses the dot in this case :
from ?'%>%':
Using the dot-place holder as lhs
When the dot is used as lhs, the
result will be a functional sequence, i.e. a function which applies
the entire chain of right-hand sides in turn to its input.
To avoid triggering this, any modification of the expression on the lhs will do:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows((.) %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows({.} %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(identity(.) %>% mutate(name = "New England"))
Here's a suggestion that avoid the problem altogether:
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
replicate(2,.,simplify = FALSE) %>%
map_at(2,mutate_at,"name",~"New England") %>%
bind_rows
# # A tibble: 12 x 2
# name estimate
# <chr> <dbl>
# 1 ct 501074
# 2 ma 1057316
# 3 me 47369
# 4 nh 76630
# 5 ri 141206
# 6 vt 27464
# 7 New England 501074
# 8 New England 1057316
# 9 New England 47369
# 10 New England 76630
# 11 New England 141206
# 12 New England 27464

Resources