Create dplyr statements to later be evaluated in R - r

I want to create a single function called eval_data where the user can input
a list of data frames
a list of dplyr functions to apply to the data frames
a list of columns to select from each dataframe:
This will look something like:
eval_data <- function(data, dplyr_logic, select_vector) {
data %>%
# this doesn't work
eval(dplyr_logic) %>%
select(
{ select_vector }
)
}
The dplyr_logic is a list of either:
nothing
a mutate statement
2 mutate statements
a filter
Input 1: List of data frames:
dd <- list()
dd$data <- list(
mutate0 = iris,
mutate1 = iris,
mutate2= iris,
filter1 = iris
)
Input 3 Select vector:
select_vec <- list(
c("Species", "Sepal.Length"),
c("Species", "New_Column1"),
c("Species", "New_Column2", "New_Column3"),
c("Species", "Sepal.Width")
)
Input 2: list of logic to apply to each data frame in the list
logic <- list(
# do nothing -- this one works
I(),
#mutate1
rlang::expr(mutate(New_Column1 = case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small'
)
)),
#mutate2
rlang::expr(mutate(New_Column2 = case_when(
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2'
)) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
)
)
),
#filter1
rlang::expr(filter(Sepal.Width > 3))
)
# eval_data(dd$data[[1]], logic[[1]], select_vec[[1]]) works
# eval_data(dd$data[[2]], logic[[2]], select_vec[[2]]) does not
Desired Goal:
pmap(dd$data, logic, select_vec, ~eval_data)
Desired Output
pmap_output <- list(
iris1 = iris %>% I() %>% select("Species", "Sepal.Length"),
iris2 = iris %>%
mutate(New_Column1 =
case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small')) %>%
select("Species", "New_Column1"),
iris4 = iris %>%
mutate(New_Column2 = case_when(
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2'
)) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
)
) %>%
select("Species", "New_Column2", "New_Column3"),
iris3 = iris %>% filter(Sepal.Width > 3) %>% select("Species", "Sepal.Width")
)
What do I need to change in eval_data and the logic list in order to make this work? Any help appreciated!!

Two changes. First, you need to include data %>% into your dplyr logic evaluation:
eval_data <- function(data, dplyr_logic, select_vector) {
rlang::expr( data %>% !!dplyr_logic ) %>%
eval() %>%
select( one_of(select_vector) )
}
Second, the chained mutate is actually a bit tricky. Recall that x %>% f(y) can be rewritten as f(x,y). Your double-mutate expression can therefore be re-written as mutate( mutate(expr1), expr2 ). When you feed the data to it, it becomes
mutate(data, mutate(expr1), expr2)
instead of the desired
mutate(mutate(data, expr1), expr2)
So, we need to use the pronoun . to specify where the pipe input should go to in our complex expression:
logic <- rlang::exprs( # We can use exprs instead of list(expr())
I(),
mutate(New_Column1 = case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small'
)),
{mutate(., New_Column2 = case_when( # <--- NOTE the { and the .
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2')) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
))}, # <--- NOTE the matching }
filter(Sepal.Width > 3)
)
Everything works now:
res <- pmap(list(dd$data, logic, select_vec), eval_data)
## Compare to desired output
map2_lgl( res, pmap_output, identical )
# mutate0 mutate1 mutate2 filter1
# TRUE TRUE TRUE TRUE

Related

Different function outcome based on nested structure

I'm trying to apply a function which looks at the structure of a nested tibble, then depending on which group the row belongs different outcomes are sought, for instance:
library(tidyverse)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
) %>%
group_by(species) %>%
nest()
# idea of function is that if the group == "virginica" then a specific rule is applied whereby if the date
# within the data falls within the interval "summer_2018" should be assigned to the newly created season column
# for all other groups further rules will be applied (in my case it will be a different set of dates but I simplified it here)
# in the case of the actual function I have, there are a number of intervals per group and a number of outcomes
fun1 <- function(nested_list) {
if (nested_list$species == "virginica") {
nested_list %>% mutate(data = map(data, ~ mutate(.x, season = case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
"winter_2018"
))))
} else {
nested_list %>% mutate(data = map(data, ~ mutate(.x, season = "not_applicable_wrong_group")))
}
}
# from the above I would want the nested tibble row which == "virginica" to be assigned summer_2018, but all the rest
# should be assigned "not_applicable_wrong_group" as the test which looks at group should return false.
# however, it all fails to work
df2 = fun1(df)
#> Warning in if (nested_list$species == "virginica") {: the condition has length >
#> 1 and only the first element will be used
I'm not sure this is the best structure, but in theory I'll have a number of groups and slightly different rules will need to be applied to each group.
I think its probably a mistake to nest the table and then transform it different ways; I would think it would easier to simply transform it unnested. Putting that aside for a moment ...
I think you can make the following changes to your starting df and to your function
for your df , add a final step of dplyr::group_split()
this will then allow you to
purrr::map over each group and as it will have a single species value per group your existing code will likely work as you intend it.
ok here is that :
library(tidyverse)
library(lubridate)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
) %>%group_by(species) %>% nest() %>% group_split()
fun1 <- function(nested_list) {
map(nested_list,~{
if (.x$species == "virginica") {
.x %>% mutate(data = map(data, ~ mutate(.x, season = case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
TRUE ~ "winter_2018"
))))
} else {
.x %>% mutate(data = map(data, ~ mutate(.x, season = "not_applicable_wrong_group")))
}
}) %>% bind_rows()
}
fun1(df)
but again, my advice would be to not overcomplicate by nesting; it doesnt seem particularly natural here. already you have mutate with maps inside them and other mutates... overcomplicated for the sort of scenario you are indicating
Therefore consider the relative simplicity of a non nested approach; instead of adding group_split on your df, make your df unnested. then you might do
library(tidyverse)
library(lubridate)
df <- data.frame(
species = c(
"setosa",
"virginica",
"versicolor"
),
detection_timestamp_utc = as.POSIXct("2018-03-22 23:59:59", tz = "UTC")
)
fun1 <- function(normal_df) {
normal_df %>% mutate(season = case_when(
species == "virginica" ~
case_when(
detection_timestamp_utc %within% interval(
ymd_hms("2018-02-01 00:00:00"),
ymd_hms("2018-09-30 23:59:59")
) ~ "summer_2018",
TRUE ~ "winter_2018"
),
TRUE ~ "not_applicable_wrong_group"
))
}
fun1(df)
#or if you must
fun1(df) %>% group_by(species) %>% nest()

default arguments not being recognized in custom function using dplyr

Take this function foo(). I want it to have a default argument of cyl because that's the name of the field it will usually process.
library(tidyverse)
foo <- function(x = cyl){
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
# works:
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
But I am surprised that the function will not work unless I explicitly supply the default argument. See failing code below
# fails:
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
Error: Problem with `mutate()` column `cyl_refactor`. ℹ `cyl_refactor = foo()`. x object 'cyl' not found
It seems that default arguments are only processed when there is also a data parameter as below.
foo2 <- function(data, x = cyl){
data %>%
mutate(cyl_refactor = case_when(
{{x}} == 6 ~ TRUE,
{{x}} == 8 ~ FALSE,
{{x}} == 4 ~ NA
))
}
mtcars %>%
foo2() %>%
select(cyl, cyl_refactor)
I am sure there is some gap in my knowledge of quasiquotation, but I would like to understand how to use a default argument in foo().
Here's one that will "work" though I woudn't recommend it
foo <- function(x = cyl){
x <- enquo(x)
eval.parent(rlang::quo_squash(rlang::quo(case_when(
!!x == 6 ~ TRUE,
!!x == 8 ~ FALSE,
!!x == 4 ~ NA
))))
}
# Both run without error
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
The problem is that in order for case_when to work, you can't just pass in a column name without also passing in the data. In order to "find" the data in this case, I've used eval.parent() to go up the call chain to try to find the cyl variable.
It's better to make proper functions where you pass in the input data directly (rather than variable names they need to look up themselves).
We could do this with missing and cur_data_all
foo <- function(x = cyl){
if(missing(x)) x <- cur_data_all()[["cyl"]]
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
-testing
> out1 <- mtcars %>%
+ mutate(cyl_refactor = foo(cyl)) %>%
+ select(cyl, cyl_refactor)
> out2 <- mtcars %>%
+ mutate(cyl_refactor = foo()) %>%
+ select(cyl, cyl_refactor)
>
> identical(out1, out2)
[1] TRUE

Making tidyeval function inside case_when

I have a data set that I like to impute one value among others based on probability distribution of those values. Let make some reproducible example first
library(tidyverse)
library(janitor)
dummy1 <- runif(5000, 0, 1)
dummy11 <- case_when(
dummy1 < 0.776 ~ 1,
dummy1 < 0.776 + 0.124 ~ 2,
TRUE ~ 5)
df1 <- tibble(q1 = dummy11)
here is the output:
df1 %>% tabyl(q1)
q1 n percent
1 3888 0.7776
2 605 0.1210
5 507 0.1014
I used mutate and sample to share value= 5 among value 1 and 2 like this:
df1 %>%
mutate(q1 = case_when(q1 == 5 ~ sample(
2,
length(q1),
prob = c(0.7776, 0.1210),
replace = TRUE
),
TRUE ~ as.integer(q1))
)
and here is the result :
q1 n percent
1 4322 0.8644
2 678 0.1356
This approach seems working, however since I need to apply this for several variables I tried to write a function that working with tidyverse with tidyeval, like this
my_impute <- function(.data, .prob_var, ...) {
.prob_var <- enquo(.prob_var)
.data %>%
sample(2, prob=c(!!.prob_var), replace = TRUE)
}
# running on data
df1 %>%
mutate(q1 = case_when(q1 == 5 ~ !!my_impute(q1),
TRUE ~ as.integer(q1))
)
The error is :
Error in eval_tidy(pair$lhs, env = default_env) : object 'q1' not found
We need the prob values from the 'percent' column generated from tabyl, so the function can be modified to
library(janitor)
library(dplyr)
my_impute <- function(.data, .prob_var, vals, ...) {
.prob_var = enquo(.prob_var)
.prob_vals <- .data %>%
janitor::tabyl(!!.prob_var) %>%
filter(!!.prob_var %in% vals) %>%
pull(percent)
.data %>%
mutate(!! .prob_var := case_when(!! .prob_var == 5 ~
sample(
2,
n(),
prob = .prob_vals,
replace = TRUE
),
TRUE ~ as.integer(q1))
)
}
df1 %>%
my_impute(q1, vals = 1:2) %>%
tabyl(q1)
# q1 n percent
# 1 4285 0.857
# 2 715 0.143
Just to add my two cents, the new version of rlang allows to replace the quasiquotation process: enquo() + !! and you can use curly-curly to embrace variables: The function would be like:
my_impute <- function(.data, .prob_var, vals, ...) {
#.prob_var = enquo(.prob_var)
# commented out since it is no longer needed
.prob_vals <- .data %>%
janitor::tabyl({{.prob_var}}) %>%
filter({{.prob_var}} %in% {{vals}}) %>%
pull(percent)
.data %>%
mutate( {{.prob_var}} := case_when( {{.prob_var}} == 5 ~
sample(
2,
n(),
prob = {{.prob_vals}},
replace = TRUE
),
TRUE ~ as.integer(q1))
)
}

Repetitive filtering with multiple conditions without a loop

I have a large dataset of around 35000 observations and 24 variables (one of which is a time-series), but I can summarise what I want to achieve using iris.
library(tidyverse)
iris.new <- iris %>%
arrange(Species, Sepal.Length, Sepal.Width) %>%
group_by(Species)
unwanted <- iris.new %>%
filter(Sepal.Length > 5 & Sepal.Width==min(Sepal.Width))
while(nrow(unwanted)!=0) {
iris.new <- iris.new %>%
arrange(Species, Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
filter(!(Sepal.Length > 5 & Sepal.Width == min(Sepal.Width)))
unwanted <- iris.new %>%
filter(Sepal.Length > 5 & Sepal.Width==min(Sepal.Width))
}
I want to filter only Sepal.Length > 5, which has minimum Sepal.Width within observations for each Species (setosa and versicolor has none). When I got rid of the first one, I repeat the filter to see if there are any and finally used a 'while' loop to do that for me.
Is there a way to filter them without using a loop?
I think this does the trick:
# get minimum Sepal.Width without Sepal.Length > 5
iris_min <- iris %>%
group_by(Species) %>%
filter(Sepal.Length <= 5) %>%
summarize(min_sep_width = min(Sepal.Width))
# check to see that nothing is below our minimum
# or equal to it with a Sepal.Length that's too long
iris_new <- iris %>%
left_join(iris_min, by = c('Species')) %>%
filter(min_sep_width < Sepal.Width |
(min_sep_width == Sepal.Width & Sepal.Length <= 5)) %>%
select(-min_sep_width)

catch different errors from dplyr data frame

I have a situation where my data frame can contain different errors and I want to catch both cases with an if statement afterwards.
Situtation 1:
the data frame contains NA
library(dplyr)
data(iris)
attach(iris)
data <- iris %>% filter(Sepal.Length >=7.9)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
this is FALSE:
nrow(sepal_slope) == 0
# FALSE
is.na is TRUE here as intended
is.na(sepal_slope)
# TRUE
Situation 2: the data frame is empty
data <- iris %>% filter(Sepal.Length >=12)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
now this is TRUE as intended:
nrow(sepal_slope) == 0
# TRUE
but this produces an error:
is.na(sepal_slope)
# sepal_slope
So I cannot use
if(nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5
# Error in if (nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5 :
argument is of length zero
How can I catch both situations in one if statement
Of course the case where sepal_slope contains a num value should be handled, if should yield TRUE here by default.
If you coerce your sepal_slope to numeric both cases will respond as TRUE to is.na.
if(is.na(as.numeric(unlist(sepal_slope))[1])) sepal_slope <- 5

Resources