Passing parameters into function that uses dplyr - r

I have the following function to describe a variable
library(dplyr)
describe = function(.data, variable){
args <- as.list(match.call())
evalue = eval(args$variable, .data)
summarise(.data,
'n'= length(evalue),
'mean' = mean(evalue),
'sd' = sd(evalue))
}
I want to use dplyr for describing the variable.
set.seed(1)
df = data.frame(
'g' = sample(1:3, 100, replace=T),
'x1' = rnorm(100),
'x2' = rnorm(100)
)
df %>% describe(x1)
# n mean sd
# 1 100 -0.01757949 0.9400179
The problem is that when I try to apply the same descrptive using function group_by the describe function is not applied in each group
df %>% group_by(g) %>% describe(x1)
# # A tibble: 3 x 4
# g n mean sd
# <int> <int> <dbl> <dbl>
# 1 1 100 -0.01757949 0.9400179
# 2 2 100 -0.01757949 0.9400179
# 3 3 100 -0.01757949 0.9400179
How would you change the function to obtain what is desired using an small number of modifications?

You need tidyeval:
describe = function(.data, variable){
evalue = enquo(variable)
summarise(.data,
'n'= length(!!evalue),
'mean' = mean(!!evalue),
'sd' = sd(!!evalue))
}
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 27 -0.23852862 1.0597510
2 2 38 0.11327236 0.8470885
3 3 35 0.01079926 0.9351509
The dplyr vignette 'Programming with dplyr' has a thorough description of using enquo and !!
Edit:
In response to Axeman's comment, I'm not 100% why the group_by and describe does not work here.
However, using debugonce with the funciton in it's original form
debugonce(describe)
df %>% group_by(g) %>% describe(x1)
one can see that evalue is not grouped and is just a numeric vector of length 100.

Base NSE appears to work, too:
describe <- function(data, var){
var_q <- substitute(var)
data %>%
summarise(n = n(),
mean = mean(eval(var_q)),
sd = sd(eval(var_q)))
}
df %>% describe(x1)
n mean sd
1 100 -0.1266289 1.006795
df %>% group_by(g) %>% describe(x1)
# A tibble: 3 x 4
g n mean sd
<int> <int> <dbl> <dbl>
1 1 33 -0.1379206 1.107412
2 2 29 -0.4869704 0.748735
3 3 38 0.1581745 1.020831

Related

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

How to add a column based on values of columns indicated by another column in a tibble in R

In the example below, I would like to add column 'value' based on the values of column 'variable' (i.e., 1 and 20).
toy_data <-
tibble::tribble(
~x, ~y, ~variable,
1, 2, "x",
10, 20, "y"
)
Like this:
x
y
variable
value
1
2
x
1
10
20
y
20
However, none of the below works:
toy_data %>%
dplyr::mutate(
value = get(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable, inherits = TRUE)
)
toy_data %>%
dplyr::mutate(
value = !!variable
)
How can I do this?
If you know which variables you have in the dataframe in advance: use simple logic like ifelse() or dplyr::case_when() to choose between them.
If not: use functional programming. Under is an example:
library(dplyr)
f <- function(data, variable_col) {
data[[variable_col]] %>%
purrr::imap_dbl(~ data[[.y, .x]])
}
toy_data$value <- f(toy_data, "variable")
Here are a few options that should scale well.
First is a base option that works along both the variable column and its index. (I made a copy of the data frame just so I had the original intact for more programming.)
library(dplyr)
toy2 <- toy_data
toy2$value <- mapply(function(v, i) toy_data[[v]][i], toy_data$variable, seq_along(toy_data$variable))
toy2
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Second uses purrr::imap_dbl to iterate along the variable and its index and return a double.
toy_data %>%
mutate(value = purrr::imap_dbl(variable, function(v, i) toy_data[[v]][i]))
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Third is least straightforward, but what I'd most likely use personally, maybe just because it's a process that fits many of my workflows. Pivotting makes a long version of the data, letting you see both values of variable and corresponding values of x and y, which you can then filter for where those 2 columns match. Then self-join back to the data frame.
inner_join(
toy_data,
toy_data %>%
tidyr::pivot_longer(cols = -variable, values_to = "value") %>%
filter(variable == name),
by = "variable"
) %>%
select(-name)
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Edit: #jpiversen rightly points out that the self-join won't work if variable has duplicates—in that case, add a row number to the data and use that as an additional joining column. Here I first add an additional observation to illustrate.
toy3 <- toy_data %>%
add_row(x = 5, y = 4, variable = "x") %>%
tibble::rowid_to_column()
inner_join(
toy3,
toy3 %>%
pivot_longer(cols = c(-rowid, -variable), values_to = "value") %>%
filter(variable == name),
by = c("rowid", "variable")
) %>%
select(-name, -rowid)

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

dplyr: passing column name to summarize inside function

I have the following example, where I pass a simple dataframe to a function that summarizes a column. The name of the summarizing column, s, I would like to have as a parameter to the function:
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,2),
a=c(1:10),
b=c(10:19))
sum <- function(df, s){
df <- df %>%
group_by(id) %>%
summarize(s = sum(a))
return(df)
}
sum(df = df, s = "summarizing.column.label")
However, regardless of the value I set, the summarizing-column always get the same name s. Is there a way to alter it?
EDIT: The output I would like is:
sum(df = df, s = "summarizing.column.label")
id summarizing.column.label
<dbl> <int>
1 1.00 15
2 2.00 40
sum(df = df, s = "a")
id a
<dbl> <int>
1 1.00 15
2 2.00 40
If we are passing a quoted argument, then one option is after the summarise, we use rename_at
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarize(a = sum(a))%>%
rename_at("a", ~ s)
}
sumf(df, s ="summarizing.column.label" )
# A tibble: 2 x 2
# id summarizing.column.label
# <dbl> <int>
#1 1.00 15
#2 2.00 40
sumf(df, s ="a" )
# A tibble: 2 x 2
# id a
# <dbl> <int>
#1 1.00 15
#2 2.00 40
Or another option is to make use of := with !!
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarize(a = sum(a))%>%
rename(!! (s) := a)
}
sumf(df, s ="summarizing.column.label" )
# A tibble: 2 x 2
# id summarizing.column.label
# <dbl> <int>
#1 1.00 15
#2 2.00 40
Or within summarise
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarise(!!(s) := sum(a))
}
sumf(df, s ="summarizing.column.label" )
Try this:
sum <- function(df, s){
df <- df %>%
group_by(id) %>%
summarize(!!s := sum(a))
return(df)
}

Dplyr function to compute average, n, sd and standard error

I find myself writing this bit of code all the time to produce standard errors for group means ( to then use for plotting confidence intervals).
It would be nice to write my own function to do this in one line of code, though. I have read the nse vignette in dplyr on non-standard evaluation and this blog post as well. I get it somewhat, but I'm too much of a noob to figure this out on my own. Can anyone help out? Thanks.
var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
df<-data.frame(var1, var2)
df %>%
group_by(var1) %>%
summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))
You can use the function enquo to explicitly name the variables in your function call:
my_fun <- function(x, cat_var, num_var){
cat_var <- enquo(cat_var)
num_var <- enquo(num_var)
x %>%
group_by(!!cat_var) %>%
summarize(avg = mean(!!num_var), n = n(),
sd = sd(!!num_var), se = sd/sqrt(n))
}
which gives you:
> my_fun(df, var1, var2)
# A tibble: 2 x 5
var1 avg n sd se
<fctr> <dbl> <int> <dbl> <dbl>
1 green 4.873617 7 0.7515280 0.2840509
2 red 5.337151 3 0.1383129 0.0798550
and that matches the ouput of your example:
> df %>%
+ group_by(var1) %>%
+ summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))
# A tibble: 2 x 5
var1 avg n sd se
<fctr> <dbl> <int> <dbl> <dbl>
1 green 4.873617 7 0.7515280 0.2840509
2 red 5.337151 3 0.1383129 0.0798550
EDIT:
The OP has asked to remove the group_by statement from the function to add the ability to group_by more than one variables. There are two ways to go about this IMO. First, you could simply remove the group_by statement and pipe a grouped data frame into the function. That method would look like this:
my_fun <- function(x, num_var){
num_var <- enquo(num_var)
x %>%
summarize(avg = mean(!!num_var), n = n(),
sd = sd(!!num_var), se = sd/sqrt(n))
}
df %>%
group_by(var1) %>%
my_fun(var2)
Another way to go about this is to use ... and quos to allow for the function to capture multiple arguments for the group_by statement. That would look like this:
#first, build the new dataframe
var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
var3 <- sample(c("A", "B"), size = 10, replace = TRUE)
df<-data.frame(var1, var2, var3)
# using the first version `my_fun`, it would look like this
df %>%
group_by(var1, var3) %>%
my_fun(var2)
# A tibble: 4 x 6
# Groups: var1 [?]
var1 var3 avg n sd se
<fctr> <fctr> <dbl> <int> <dbl> <dbl>
1 green A 5.248095 1 NaN NaN
2 green B 5.589881 2 0.7252621 0.5128378
3 red A 5.364265 2 0.5748759 0.4064986
4 red B 4.908226 5 1.1437186 0.5114865
# Now doing it with a new function `my_fun2`
my_fun2 <- function(x, num_var, ...){
group_var <- quos(...)
num_var <- enquo(num_var)
x %>%
group_by(!!!group_var) %>%
summarize(avg = mean(!!num_var), n = n(),
sd = sd(!!num_var), se = sd/sqrt(n))
}
df %>%
my_fun2(var2, var1, var3)
# A tibble: 4 x 6
# Groups: var1 [?]
var1 var3 avg n sd se
<fctr> <fctr> <dbl> <int> <dbl> <dbl>
1 green A 5.248095 1 NaN NaN
2 green B 5.589881 2 0.7252621 0.5128378
3 red A 5.364265 2 0.5748759 0.4064986
4 red B 4.908226 5 1.1437186 0.5114865

Resources