How to use tidy evaluation with column name as strings? - r

I've read most of the documentation about tidy evaluation and programming with dplyr but cannot get my head around this (simple) problem.
I want to programm with dplyr and give column names as strings as input to the function.
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_var) {
df %>%
group_by(group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
This gives me Error : Column 'group_var' is unknown.
What must I change inside the my_summarise function in order to make this work?

We can use also ensym with !!
my_summarise <- function(df, group_var) {
df %>%
group_by(!!rlang::ensym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
Or another option is group_by_at
my_summarise <- function(df, group_var) {
df %>%
group_by_at(vars(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')

Convert the string column name to a bare column name using as.name() and then use the new {{}} (read Curly-Curly) operator as below:
library(dplyr)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_var) {
grp_var <- as.name(group_var)
df %>%
group_by({{grp_var}}) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')

You can also use sym and !!
my_summarise <- function(df, group_var) {
df %>%
group_by(!!sym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
# A tibble: 2 x 2
g1 a
<dbl> <dbl>
1 1 3.5
2 2 2.67

Related

How to do rowwise matching in data.table between list-column and external vector?

I'm trying to do a rowwise matching for vector elements nested inside a data.table list-column.
library(tibble)
library(data.table)
my_dt <-
tibble::tribble(
~col_x, ~col_y,
"a", c(1, 2, 3),
"b", c(4, 5, 6),
"c", c(7, 8, 9)
) %>%
as.data.table()
external_vec <- 1:9
I want to match each element from col_y against external_vec such that I will get a new column in my_dt that is equivalent to external_vec %in% col_y, such as:
# desired output
## col_x col_y new_col
## 1: a 1,2,3 TRUE, TRUE, TRUE,FALSE,FALSE,FALSE,...
## 2: b 4,5,6 FALSE,FALSE,FALSE, TRUE, TRUE, TRUE,...
## 3: c 7,8,9 FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,...
EDIT
I know how to implement a solution based on tidyverse's purrr, like the following. However, I'm looking for a data.table native functionality.
# this works but relies on purrr and dplyr, which I hope to avoid in this problem
my_dt %>%
tibble::add_column(ext_vec_as_col = list(external_vec)) %>%
mutate(new_col = purrr::map2(.x = ext_vec_as_col, .y = col_y, .f = ~.x %in% .y) )
#> col_x col_y ext_vec_as_col new_col
#> 1: a 1,2,3 1,2,3,4,5,6,... TRUE, TRUE, TRUE,FALSE,FALSE,FALSE,...
#> 2: b 4,5,6 1,2,3,4,5,6,... FALSE,FALSE,FALSE, TRUE, TRUE, TRUE,...
#> 3: c 7,8,9 1,2,3,4,5,6,... FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,...
I've got the tidyverse-solution:
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
library(magrittr)
my_dt <-
tibble::tribble(
~col_x, ~col_y,
"a", c(1, 2, 3),
"b", c(4, 5, 6),
"c", c(7, 8, 9)
)
external_vec <- 1:9
my_dt %>%
mutate(new_col = col_y %>%
map(~external_vec %in% .x)) %>%
unnest_wider(col_y, names_sep = "_") %>% # this is just to see the result..
unnest_wider(new_col, names_sep = "_") # this is just to see the result..
Usually, library(tidyverse) includes a lot of these, but I just thought that it would be nicer to know exactly what you need to make this work.
data.table solution
I think this is the equivalent:
my_dt[, new_col := lapply(col_y, \(x) external_vec %in% x)]
my_dt
Performance
Since that is important, here is some performance benchmark
bench::mark(
tidy =
my_dt %>%
mutate(new_col = col_y %>%
map(~external_vec %in% .x)),
dt = my_dt[, new_col := lapply(col_y, \(x) external_vec %in% x)],
op_tidy = my_dt %>%
tibble::add_column(ext_vec_as_col = list(external_vec)) %>%
mutate(new_col = purrr::map2(.x = ext_vec_as_col, .y = col_y, .f = ~.x %in% .y)),
check = FALSE
) %>%
plot()

Vectorization to extract and bind very nested data

I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.
You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()
If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))

Nested loops in R that create new variable names and lags

I am new to R, but experienced in Stata. To learn R, I am tracking Covid-19 infections. That requires creating seven-day trailing averages, and I do so with the following loop.
for (mylag in c(1:7)) {
data <- data %>% group_by(state) %>% mutate(!!paste0("deathIncrease", "_", mylag) := lag(deathIncrease, mylag)) %>% ungroup()
}
This works, but then I want to run the same code, not just for deaths, but also for cases. So I tried the following.
var_list <- c("deathIncrease", "positiveIncrease")
for (var in var_list) {
for (mylag in c(1:7)) {
var <- enquo(var)
varname <- enquo( paste0(quo_name(var), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!var, mylag)) %>% ungroup()
}
}
But that leads to the error arg must be a symbol. Any help would be much appreciated. In Stata, loops are simpler. Is there no package that gets R to automatically fill in the looping variables everywhere, like so: {{ var }}?
Edit: here is a minimal working example. The first way to create lags works, but only for var1. The second nested loop does not.
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
for (mylag in c(1:2)) {
df <- df %>% group_by(group_var) %>% mutate(!!paste0("var1", "_lag", mylag) := lag(var1, mylag)) %>% ungroup()
}
## Another loop
var_list <- c("var1", "var2")
for (myvar in var_list) {
for (mylag in c(1:2)) {
myvar <- enquo(myvar)
varname <- enquo( paste0(quo_name(myvar), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!myvar, mylag)) %>% ungroup()
}
}
You can use the function get(), like lag(get(myvar), mylag), to point the specific column the string myvar is referred to:
for(mylag in 1:7){
for(myvar in c('deathIncrease', 'positiveIncrease')){
data <- data %>%
group_by(state) %>%
mutate(
!!paste0(myvar, '_', mylag) := lag(get(myvar), mylag)
) %>%
ungroup()
}
}
My first solution contained a function, that did not respect grouped data. I wanted to look at that anyways, so i spend a bit of time to respect grouped data as well.
This is my solution now, it works as expected on grouped data, but it feels a bit hacky tbh.
add_lag <- function(.data, column, days) {
group <- unlist(groups(.data))
if(is.null(group)){
new <- mapply(function(x, y) {
lag(x, y)
}, x = .data[column], y = sort(rep(days, length(column))))
if(is.null(dim(new))){
new <- t(new)
}
new <- as.data.frame(new, stringsAsFactors = F)
names(new) <- paste0(column, "_", sort(rep(days, length(column))))
new <- as_tibble(new) %>%
select(sort(names(new)))
mutate(.data, !!!new)
} else {
tmp <- .data %>%
nest()
tmp$data <- lapply(tmp$data, function(x,y){
x %>%
add_lag(column,y)
}, y = days)
tmp %>% unnest(c(data))
}
}
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
df %>%
group_by(group_var) %>%
add_lag("var1", 1:2)
# A tibble: 6 x 5
# Groups: group_var [2]
group_var var1 var2 var1_1 var1_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 NA NA
2 A 2 11 1 NA
3 A 3 12 2 1
4 B 1 10 NA NA
5 B 2 11 1 NA
6 B 3 12 2 1

Tidyeval with list of column names in a function

I am trying to create a function that passes a list of column names to a dplyr function. I know how to do this if the list of columns names is given in the ... form, as explained in the tidyeval documentation:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, ...) {
group_var <- quos(...)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1, g2)
But if I want to list the column names as an argument of the function, the above solution will not work (of course):
my_summarise <- function(df, group_var, sum_var) {
group_var <- quos(group_var) # nor enquo(group_var)
sum_var <- enquo(sum_var)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, list(g1, g2), a)
my_summarise(df, list(g1, g2), b)
How can I get the items inside the list to be quoted individually?
This question is similar to Passing dataframe column names in a function inside another function but in the comments it was suggested to use strings, while here I would like to use bare column names.
library(dplyr)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise = function(df, group_var, fun_name) {
df %>%
group_by(!!! group_var) %>%
summarize_all(fun_name)
}
my_summarise(df, alist(g1, g2), mean)
alist() handles the arguments 'g1' and 'g2' as function arguments (does not evaluate them) while !!! (same as UQS() unquotes and splices the list. sum_var is not necessary as it looks like you want to take the mean of both 'a' and 'b'. Also, you can generalize it by passing in the function as well.
You could pass your list of arguments using alist instead of list, as it won't evaluate the arguments.
my_summarise = function(df, group_var, sum_var) {
group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, alist(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0
Another alternative would be to pass that argument directly with quos instead of list as shown in this answer, which bypasses some complications all together.
my_summarise = function(df, group_var, sum_var) {
# group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, quos(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0

dplyr 0.7 - Specify grouping variable as string

I have some variable names specified as string (e.g. input from Shiny app) and I would like to use them in my dplyr and ggplot2 code as if they were variables.
I got it to work by trial and error, but I feel like there must be a better way. What is a better way to perform these operations?
library(rlang)
library(ggplot2)
library(dplyr)
someString <- "g1"
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_var) {
print(group_var)
df %>%
group_by(!!group_var) %>%
summarise(a = mean(a))
}
my_plot <- function(df, group_var) {
print(group_var)
ggplot(data = df %>%
group_by(!!group_var) %>%
summarise(a = mean(a)),
aes_string(x = quo_name(group_var), y = "a")) +
geom_bar(stat = "identity")
}
my_summarise(df, quo(UQ(sym(someString))))
my_plot(df, quo(UQ(sym(someString))))
Either of these options are probably simpler:
my_summarise <- function(df, group_var) {
print(group_var)
df %>%
#Either works
group_by_at(.vars = group_var) %>%
#group_by(!!sym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df,someString)
my_plot <- function(df, group_var) {
print(group_var)
ggplot(data = df %>%
group_by_at(.vars = group_var) %>%
#group_by(!!sym(group_var)) %>%
summarise(a = mean(a)),
aes_string(x = group_var, y = "a")) +
geom_bar(stat = "identity")
}
my_plot(df, someString)
...where you could use either group_by or group_by_at.
What about calling with my_summarise(df, as.name(someString))?

Resources