I'm currently repeating a lot code, since I need to summarize always the same columns for different groups. How can I do this effectively by writing the summarize function (which is always the same) only once, but define the output name and group_by arguments case by case?
A minimum example:
col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8
df <- data.frame(col1, col2, col3, col4, col5)
result1 <- df %>%
group_by(col1, col2) %>%
summarize(sum1 = sum(col4, col5))
result2 <- df %>%
group_by(col2, col3) %>%
summarize(sum1 = sum(col4, col5))
result3 <- df %>%
group_by(col1, col3) %>%
summarize(sum1 = sum(col4, col5))
Using combn:
combn(colnames(df)[1:3], 2, FUN = function(x){
df %>%
group_by(.dots = x) %>%
summarize(sum1 = sum(col4, col5))
}, simplify = FALSE)
To use dplyr in own functions, you can use tidy evaluation. The reason for this is the way dplyr evaluates dplyr code, something called non standard evaluation, which wraps everything what does not behave like normal R Code. I recommend to read this:
https://tidyeval.tidyverse.org/modifying-inputs.html#modifying-quoted-expressions
summarizefunction <- function(data, ..., sumvar1, sumvar2) {
groups <- enquos(...)
sumvar1 <- enquo(sumvar1)
sumvar2 <- enquo(sumvar2)
result <- data %>%
group_by(!!!groups) %>%
summarise(sum1 = sum(!!sumvar1, !!sumvar2))
return(result)
}
summarizefunction(df, col1, col2, sumvar1 = col4, sumvar2 = col5)
You can use the enquo keyword to wrap quote parameters which prevents them from being evaluated immediately. This you can use the !! (called bang bang) operator to unquote the parameter. I think this is the most flexible and reuseable solution, even when you have to write some more initial code.
You can also use purrr::partial in these situations :
library(purrr)
summarize45 <- partial(summarize, sum1 = sum(col4, col5))
result1b <- df %>%
group_by(col1, col2) %>%
summarize45()
identical(result1, result1b)
# [1] TRUE
Or pushing it further :
gb_df <- partial(group_by, df)
result1c <- gb_df(col1, col2) %>% summarize45()
identical(result1, result1c)
# [1] TRUE
Firstly you'll need to evaluate the variables with a function as such:
library(tidyverse)
res_func <- function(x, y){
df %>%
group_by(!!as.symbol(x), !!as.symbol(y)) %>%
summarize(sum1 = sum(col4, col5))
}
works a charm:
res_func("col1", "col2")
# A tibble: 4 x 3
# Groups: col1 [2]
col1 col2 sum1
<fct> <fct> <int>
1 UK Social 10
2 UK Tech 6
3 US Social 8
4 US Tech 12
We can use assign to create a function that names your frame against the parameters you've passed in through the function:
res_func2 <- function(x, y){
assign(paste0("result_", x, y),
df %>%
group_by(!!as.symbol(x), !!as.symbol(y)) %>%
summarize(sum1 = sum(col4, col5)),
envir = parent.frame())
}
This creates a new df called result_col1col2 by simply running res_func2("col1", "col2")
> result_col1col2
# A tibble: 4 x 3
# Groups: col1 [2]
col1 col2 sum1
<fct> <fct> <int>
1 UK Social 10
2 UK Tech 6
3 US Social 8
4 US Tech 12
Related
I have a data.frame where values are repeated in col1.
col1 <- c("A", "A", "B", "B", "C")
col2 <- c(1995, 1997, 1999, 2000, 2005)
df <- data.frame(col1, col2)
I want to combine values in col2 that correspond to the same letter in col1 into one cell, so that col2 shows a range of values for a particular letter in col1. I do this by splitting the data.frame by col1, applying fun, and binding the split data.frames back together.
library(tidyverse)
split_df <- split(df, df$col1)
fun <- function(df) {
if (length(unique(df$col2)) > 1) {
df$col2 <- paste(min(df$col2),
max(df$col2),
sep = "-")
df <- distinct(df)
}
return(df)
}
split_df <- lapply(split_df, fun)
df <- do.call(rbind, split_df)
This works, but I am wondering if there is a more intuitive or more efficient solution?
Base R way using aggregate -
aggregate(col2~col1, df, function(x) paste0(unique(range(x)), collapse = '-'))
# col1 col2
#1 A 1995-1997
#2 B 1999-2000
#3 C 2005
Same can also be written with dplyr -
library(dplyr)
df %>%
group_by(col1) %>%
summarise(col2 = paste0(unique(range(col2)), collapse = '-'))
One option would be the tidyverse, where you can accomplish this a little more succinctly. The basic idea is the same:
library(tidyverse)
new.result <- df %>%
group_by(col1) %>%
summarize(
col2 = ifelse(n() == 1, as.character(col2), paste(min(col2), max(col2), sep = '-'))
)
col1 col2
<chr> <chr>
1 A 1995-1997
2 B 1999-2000
3 C 2005
A different (but possibly overcomplicated) approach assumes that you have at most two years per grouping. We can pivot the start and end years into their own columns, and then paste them together directly. This requires a little more data transformation but avoids having to check explicitly for groups with 1 year:
df %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = col2) %>%
rowwise() %>%
mutate(
vec = list(c(`1`, `2`)),
col2 = paste(vec[!is.na(vec)], collapse = '-')
) %>%
select(col1, col2)
Let us consider the following dataframe
library(dplyr)
d <- data.frame(v1 = c("a","a","b","b"), v2 = c("X","Y","Y","X"))
For the "a" group, the v2 column is in the order (X,Y), which I consider the correct order. By opposition, the "b" group the order is incorrect (Y,X).
Using dplyr and the do() function, I can check for each group, whether the order is correct or not:
filter_fn <- function(my_row){
iX <- filter(my_row, v2 == "X")$i
iY <- filter(my_row, v2 == "Y")$i
res <- as.logical(iX < iY)
return(data.frame(res))
}
d %>%
group_by(v1) %>%
dplyr::mutate(i = row_number()) %>%
do(filter_fn(.)) %>%
ungroup()
But to avoid the multiplication of functions, I want to have the logic directly written in the dplyr chain. I have tried with group_map and group_modify:
d %>%
group_by(v1) %>%
dplyr::mutate(i = row_number()) %>%
group_map( ~ {
filter(.$v2 == "X")$i < filter(.$v2 == "Y")$i
})
But apparently my understanding of group_map is wrong.
In the documentation I don't see how a function can be used in do(.) without having to be previously defined as function per se.
The expected output would be a following dataframe
v1 res
a TRUE
b FALSE
You can define the correct order, use match to get position of v2 and diff to calculate the difference of their occurrence in each v1. Make res as TRUE if the order matches.
library(dplyr)
correct_order = c('X', 'Y')
d %>%
group_by(v1) %>%
summarise(res = all(diff(match(correct_order, v2)) > 0))
# v1 res
# <chr> <lgl>
#1 a TRUE
#2 b FALSE
We can either reshape to 'wide' format and then do an elementwise comparison for each of columns
library(stringr)
library(dplyr)
library(tidyr)
library(data.table)
d %>%
mutate(rn = str_c('col', rowid(v1))) %>%
pivot_wider(names_from = rn, values_from = v2) %>%
transmute(v1, res = col1 < col2)
# A tibble: 2 x 2
# v1 res
# <chr> <lgl>
#1 a TRUE
#2 b FALSE
Or another option is to have an ordered variable, then grouped by 'v1', check whether all the levels of the variable is equal to the unique values in an elementwise comparison
d %>%
mutate(v2 = ordered(v2, c('X', 'Y'))) %>%
group_by(v1) %>%
summarise(res = all(levels(v2) == unique(v2)))
# A tibble: 2 x 2
# v1 res
# <chr> <lgl>
#1 a TRUE
#2 b FALSE
How can I utilize the magic dots (...) / ellipsis in order to filter based off an arbitrary column?
df = tibble::tibble(col1 = c('a', 'b', 'c'), col2 = c(1,3,4))
my_func = function(x, ...){
df %>%
dplyr::filter(... == x)
}
my_func('a', col1)
# Should return:
# A tibble: 1 x 2
col1 col2
<chr> <dbl>
1 a 1
We can convert to a quosure (quo) and evaluate (!!). Here, we assume there would be only a single column passed into the filter
my_func <- function(x, ...){
df %>%
dplyr::filter(!! quo(...) == x)
}
my_func('a', col1)
# A tibble: 1 x 2
# col1 col2
# <chr> <dbl>
#1 a 1
If there are multiple columns, then it may be better to use filter_at
It may be also better to make use of {{}} if we can pass as an argument instead of ...
my_func <- function(x, coln) {
df %>%
filter({{coln}} == x)
}
my_func('a', col1)
I am fairly new to R. I wrote the below function which tries to summarise a dataframe, based on a feature variable (passed to the function as 'variable') and a target variable (passed to the function as target_var). I also pass it a value (target_val) on which to filter.
The function below falls over on the filter line (filter(target_var == target_val)). I think it has something to do with quo, quosure etc, but can't figure out how to fix it. The following code should be ready to run - if you exclude the filter line it should work, if you included the filter line it will fall over.
library(dplyr)
target <- c('good', 'good', 'bad', 'good', 'good', 'bad')
var_1 <- c('debit_order', 'other', 'other', 'debit_order','debit_order','debit_order')
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
group_by_(variable_string) %>%
mutate(total = sum(cnt)) %>%
mutate(rate = cnt / total) %>%
filter(target_var == target_val)
assign(df_name, temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, 'bad')
so I assume you want to filter by target good or bad.
In my understanding, always filter() before you group_by(), as you will possibly ommit your filter variables. I restructured your function a little:
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
mutate(total = sum(cnt),
rate = cnt / total)
names(temp_df1) <- c(variable_string,"target","cnt","total","rate" )
temp_df1 <- temp_df1[temp_df1$target == target_val,]
assign( df_name,temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, "bad")
result:
> odds_var_1
# A tibble: 2 x 5
# Groups: var_1 [2]
var_1 target cnt total rate
<chr> <chr> <int> <int> <dbl>
1 debit_order bad 1 4 0.25
2 other bad 1 2 0.5
I have a dataframe that looks like this
df = data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5), race = rep(1:2, 5))
I'm trying to write a function that takes on a dataframe as a first argument together with any number of arguments that represent column names in that dataframe and use these column names to perform operations on the dataframe. My function would look like this:
library(dplyr)
myFunction <- function(df, ...){
columns <- list(...)
for (i in 1:length(columns)){
var <- enquo(columns[[i]])
df <- df %>% group_by(!!var)
}
df2 = summarise(df, mean = mean(wt))
return(df2)
}
I call the function as the following
myFunction(df, race, gender)
However, I get the following error message:
Error in myFunction(df, race, gender) : object 'race' not found
We can convert the elements in ... to quosures and then do the evaluation (!!!)
myFunction <- function(dat, ...){
columns <- quos(...) # convert to quosures
dat %>%
group_by(!!! columns) %>% # evaluate
summarise(mean = mean(wt))
}
myFunction(df, race, gender)
# A tibble: 2 x 3
# Groups: race [?]
# race gender mean
# <int> <int> <dbl>
#1 1 1 75
#2 2 2 76
myFunction(df, race)
# A tibble: 2 x 2
# race mean
# <int> <dbl>
#1 1 75
#2 2 76
NOTE: In the OP's example, 'race' and 'gender' are the same
If it change it, will see the difference
df <- data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5),
race = rep(1:2, each = 5))
myFunction(df, race, gender)
myFunction(df, race)
myFunction(df, gender)
If we decide to pass the arguments as quoted strings, then we can make use of group_by_at
myFunction2 <- function(df, ...) {
columns <- c(...)
df %>%
group_by_at(columns) %>%
summarise(mean= mean(wt))
}
myFunction2(df, "race", "gender")