Using a list of symbols with tidy-select - r

In our data aggregation pipeline we have a bunch of conditioning variables that are used to define groups. To improve code readability and maintainability, we use a preconfigured list of symbols with tidy evaluation as per this illustrative snippet:
# this is the list of our condition variables
condition_vars <- rlang::exprs(var1, var2, var3)
# split the data into groups
data %>% group_by(!!!condition_vars) %>% summarize(...)
This works great but I can't figure out what would be the elegant way to use this in <tidy-select> context, e.g. for something like nest. The problem is that new nest() wants something like nest(data = c(var1, var2, var3)) and not nest(var1, var2, var3), so nest(!!!condition_vars) will give me a warning.
The best I could come up with is
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
vars <- exprs(y, z)
nest(df, data = !!call2("c", !!!vars))
but surely there is a better way...

You can do nest(df, data = c(!!!vars)).
But nowadays, if the expressions are simple column names, I would store them in a character vector. You can supply the character vectors with all_of() in selection contexts. In action verbs like mutate() or group_by(), use across() to create a selection context where you can use all_of() (and other features like starts_with()).
cols <- c("cyl", "am")
mtcars %>% group_by(across(all_of(cols))
mtcars %>% nest(data = all_of(cols))

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

How do I convert a bunch of factors to ordinals?

I have a bunch of factors that are really ordinals but they're coded as numerics.
This is my code
student_performance <-
read_csv("https://raw.githubusercontent.com/UBC-MDS/ellognea-smwatts-student-performance/master/data/student-math-perf.csv") %>%
as_tibble()
convert.to.ordinals <-
c("Medu",
"Fedu",
"traveltime",
"studytime")
student_perf %>%
mutate_at(vars(convert.to.ordinals), as.factor(ordered = T))
I'm trying to organize them as ordinals and get them to be in ascending order, so it would be the same as doing factor(student_performance$Medu, levels = c(1, 2, 3, 4)) except for all of the ones in the list of variable names
In the newer version of dplyr, we can use across to loop over the column names specified in the vector convert.to.ordinals, apply the function to transform i.e. factor to modify those columns and assign the output back to the original object to change that object
library(dplyr)
student_performance <- student_performance %>%
mutate(across(all_of(convert.to.ordinals), ~
factor(., ordered = TRUE)))
NOTE: The across is a generic way to loop over groups of columns and it replaces the mutate_at or mutate_all, mutate_if with certain changes in the .cols to specify whether a subset of column names with all_of wrapper or select_helpers such as matches, starts_with, ends_with or everything() (mutate_all) or the where (mutate_if)
Or with mutate_at, the key is the lambda function (~ => function(x))
student_performance %>%
mutate_at(vars(convert.to.ordinals), ~ factor(., ordered = TRUE))

Programmatically choosing which variables to put into dplyr pipe

I'm working with datasets (from smartphone experience sampling) where I have to very frequently performed grouped operations (such as find the variability of a measure within each person, or within each day within each person, etc). Typical code might look like the code below, which calculates within-day variability for some variables, then takes the mean of the within-day variability and joins it to the original data.
output <- group_by(mydata, id, day) %>%
mutate_at(vars(angr, sad, guil, anx, hap), funs(sd(., na.rm = TRUE))) %>%
ungroup() %>%
group_by(id) %>%
summarize_at(vars(angr, sad, guil, anx, hap), funs('var_day_mean' = mean(., na.rm = TRUE))) %>%
join(mydata, .)
What I want to do is be able to save this as a function so that instead of having to type out angr, sad, guil, anx, hap many times over, I can call this code (and slight variations on it saved as different functions) on a vector of variable names in a string. So the desired functionality is:
vars <- c('angr', 'sad', 'guil', 'anx', 'hap')
output <- myfunc(vars)
Where myfunc performs the piped operations above.
I'm aware that there is a vignette for non standard evaluation using dplyr but it's very limited and doesn't cover mutate or most of what I need to do with this use case, so would appreciate any insight.
Reproducible example - what I desire is essentially that the below code work, but currently the dplyr pipe cannot take vars as a character vector the way I have input it.
Edit: I was mistaken - the below code does work, and dplyr can function in this way (and can also take character vectors to group_by, making this easy to program with). I leave the code below as a (working) reference.
data <- data.frame('ID' = rep(1:10, each = 10),
'day' = rep(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 10),
'anx' = rnorm(100), 'sad' = rnorm(100), 'hap' = rnorm(100))
vars = c('anx', 'sad', 'hap')
out <- group_by(data, ID, day) %>%
mutate_at(vars, funs(sd(., na.rm = TRUE)))
With mutate_at you can simply supply the names of the columns as a vector:
mtcars %>% mutate_at(c("mpg", "hp"), funs(mean))
This should do the trick.

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Resources