default arguments not being recognized in custom function using dplyr - r

Take this function foo(). I want it to have a default argument of cyl because that's the name of the field it will usually process.
library(tidyverse)
foo <- function(x = cyl){
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
# works:
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
But I am surprised that the function will not work unless I explicitly supply the default argument. See failing code below
# fails:
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
Error: Problem with `mutate()` column `cyl_refactor`. ℹ `cyl_refactor = foo()`. x object 'cyl' not found
It seems that default arguments are only processed when there is also a data parameter as below.
foo2 <- function(data, x = cyl){
data %>%
mutate(cyl_refactor = case_when(
{{x}} == 6 ~ TRUE,
{{x}} == 8 ~ FALSE,
{{x}} == 4 ~ NA
))
}
mtcars %>%
foo2() %>%
select(cyl, cyl_refactor)
I am sure there is some gap in my knowledge of quasiquotation, but I would like to understand how to use a default argument in foo().

Here's one that will "work" though I woudn't recommend it
foo <- function(x = cyl){
x <- enquo(x)
eval.parent(rlang::quo_squash(rlang::quo(case_when(
!!x == 6 ~ TRUE,
!!x == 8 ~ FALSE,
!!x == 4 ~ NA
))))
}
# Both run without error
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
The problem is that in order for case_when to work, you can't just pass in a column name without also passing in the data. In order to "find" the data in this case, I've used eval.parent() to go up the call chain to try to find the cyl variable.
It's better to make proper functions where you pass in the input data directly (rather than variable names they need to look up themselves).

We could do this with missing and cur_data_all
foo <- function(x = cyl){
if(missing(x)) x <- cur_data_all()[["cyl"]]
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
-testing
> out1 <- mtcars %>%
+ mutate(cyl_refactor = foo(cyl)) %>%
+ select(cyl, cyl_refactor)
> out2 <- mtcars %>%
+ mutate(cyl_refactor = foo()) %>%
+ select(cyl, cyl_refactor)
>
> identical(out1, out2)
[1] TRUE

Related

Change iteration number in one Dplyr command

I need to be able to change iteration number in each seperated line of one dplyr code.
I have prepared and example of 'by hand' approach and what I need to do in 'pseudo steps'.
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
is there any trick maybe using purrr package or other solution as well?
I need to iterate over some unique values of some variable. And for each unique value create a new column in dataframe. I need to automatize this, however I am not able to do so on my own.
You can do this using sym to convert text to symbols and !! to evaluate within dplyr functions. See this question and this vignette for further details.
For your application, you probably want something like this:
carbs = c(1,2,4)
for(cc in carbs){
var_name = sym(paste0("VARIABLE",cc))
cr$data[[1]] = cr$data[[1]] %>%
mutate(!!var_name := case_when(carb == cc ~ hp/mpg,
TRUE ~ 0))
}
There are three key parts to this:
sym turns the text string into a symbol variable.
!! means that a symbol is treated as an R expression
:= lets us use !! evaluation on the left-hand-side of the equation

Automate Dplyr's mutate function

What is the best way to automate mutate function in one dplyr aggregation.
Best if I demonstrate on the example.
So in the first part of an example I am creating new columns based on values of variable gear. However, imagine I need to automate this step to automatically 'iterate' over all unique values of gear and creates new columns for each value.
Is there any how to do to so?
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
One way would be to use dummy_cols from package fastDummies
Doing it for one dataframe at a time:
cr$data[[1]] %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))
You can also do this first and the group by gear since you are not using gear value in calculation so it wouldn't matter. For that:
cr_new=mtcars %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))%>%
group_by(gear)%>%
nest()
Perhaps, something like this would help -
library(dplyr)
library(purrr)
bind_cols(mtcars, map_dfc(unique(mtcars$carb),
~mtcars %>%
transmute(!!paste0('carb', .x) := case_when(carb == 1 ~ hp/mpg,TRUE ~ 0))))
It sounds a lot like what's called "the XY-problem".
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Please read about tidy data, and/or tidyr's pivot_longer/pivot_wider. Column names should not encode information.

Access column name inside function used inside summarize_all dplyr structure

I'm building a dplyr structure to run some custom functions over the columns of a dataframe in 1 block of code
currently my function looks this
funx <- function(x) {
logchoice <- if(max(x) < 400) {'T' } else { 'F' }
logtest <- suppressWarnings(log10(x))
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <- tibble(pvalue = y$p.value, Transform = logchoice)
return(z)
}
and the dplyr structure looks like this:
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather %>%
unnest %>%
arrange(pvalue) %>%
rename(Parameter = key)
which gives me:
Parameter pvalue Transform
1 cyl 0.00000000 T
2 drat 0.03026093 T
3 hp 0.04252001 T
4 disp 0.06050505 F
I would like to know how I can access the column name inside my function, mainly because I would like to change the name in the result table to look like the output of this: paste(original_column_name, 'log10', sep = '') if the function applies the log transformation, but leave the original name as is when it decides not to.
so the expected output is:
Parameter pvalue Transform
1 log10_cyl 0.00000000 T
2 log10_drat 0.03026093 T
3 log10_hp 0.04252001 T
4 disp 0.06050505 F
You were quite close. You can just add a mutate() to the end
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather() %>%
unnest() %>%
arrange(pvalue) %>%
rename(Parameter = key) %>%
mutate(Parameter = ifelse(Transform == "T", paste0("log10_", Parameter), Parameter)) %>%
select(Parameter, pvalue)
# Parameter pvalue
# log10_cyl 0.00000000
# log10_drat 0.01389723
# disp 0.02771770
# log10_hp 0.08493466
Answering in a separate post as the solution is a different. To get the column names in a print(), I would pass them in the function and use purrr::map_dfr to build a dataframe of the result. The small changes I made are to grab the column name, col_name, and specify the dataframe. I tried a few approaches to grab the column name using your original function but came out unsuccessful.
logtest_pval <- function(col, df) {
col_name <- col
x <- df %>% pull(!!col)
logchoice <- ifelse(max(x) < 400, TRUE, FALSE)
logtest <- log10(x)
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <-
tibble(
transform = logchoice,
column = ifelse(logchoice, paste0("log10_", col_name), col_name),
pvalue = y$p.value
)
print(paste0(z, collapse = " | "))
return(z)
}
Then you can build your dataframe:
purrr::map_dfr(
.x = names(mtcars), # the columns to use
.f = logtest_pval, # the function to use
df = mtcars # additional arguments needed
)
Here's another example
df <-
mtcars %>%
select_if(is.numeric)
pvalues <-
map_dfr(names(df), logtest_pval, df)

regression output in dplyr

I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'
glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))
Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268

catch different errors from dplyr data frame

I have a situation where my data frame can contain different errors and I want to catch both cases with an if statement afterwards.
Situtation 1:
the data frame contains NA
library(dplyr)
data(iris)
attach(iris)
data <- iris %>% filter(Sepal.Length >=7.9)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
this is FALSE:
nrow(sepal_slope) == 0
# FALSE
is.na is TRUE here as intended
is.na(sepal_slope)
# TRUE
Situation 2: the data frame is empty
data <- iris %>% filter(Sepal.Length >=12)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
now this is TRUE as intended:
nrow(sepal_slope) == 0
# TRUE
but this produces an error:
is.na(sepal_slope)
# sepal_slope
So I cannot use
if(nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5
# Error in if (nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5 :
argument is of length zero
How can I catch both situations in one if statement
Of course the case where sepal_slope contains a num value should be handled, if should yield TRUE here by default.
If you coerce your sepal_slope to numeric both cases will respond as TRUE to is.na.
if(is.na(as.numeric(unlist(sepal_slope))[1])) sepal_slope <- 5

Resources