NSE on complex expressions with dplyr's do() - r

Can someone help me understand how NSE works with dplyr when the variable reference is in the form ".$mpg" .
After reading here, I thought using as.name would do it, since I have a character string that gives a variable name.
For example, this works:
mtcars %>%
summarise_(interp(~mean(var), var = as.name("mpg")))
and this doesn't work:
mtcars %>%
summarise_(interp(~mean(var), var = as.name(".$mpg")))
but this does:
mtcars %>%
summarise(mean(.$mpg))
and so does this:
mtcars %>%
summarise(mean(mpg))
I want to be able to specify the variable in the form .$mpg so that I can use it with do() when I don't have the option of specifying a dot for the data like in the following example:
library(dplyr)
library(broom)
mtcars %>%
tbl_df() %>%
slice(., 1) %>%
do(tidy(prop.test(.$mpg, .$disp, p = .50)))
chose random variables here to demonstrate how the prop.test function works, please don't interpret this as misuse of the test.
Eventually, I want to turn this into a function like this:
library(lazyeval)
library(broom)
library(dplyr)
p_test <- function(x, miles, distance){
x %>%
tbl_df() %>%
slice(., 1) %>%
do_(tidy(prop.test(miles, distance, p = .50)))
}
p_test(mtcars, ".$mpg", ".$disp")
I originally thought that I would have to do something like:
interp(~var, var = as.name(miles) where miles would get replaced with .$mpg, but as I mentioned at the top this does not seem to work.

The reason is that as.name creates an unevaluated variable name, but .$mpg, when used in code, is not a variable name. Rather, it’s a complex expression which is equivalent to:
`$`(., mpg)
That is, it’s a function call to the function $ with two arguments. Using as.name causes R to subsequently search for a variable with the name `.$mpg` rather than calling the above-described function.
That’s the explanation of why your attempt doesn’t work. The solution is then relatively straightforward: instead of creating an unevaluated variable name, we need to create an unevaluated function call expression. We can do this in various ways, and I’m going to show two here.
The first is simply to call parse:
p_test = function (data, miles, distance) {
x = parse(text = miles)[[1]]
n = parse(text = distance)[[1]]
data %>%
slice(1) %>%
do_(interp(~tidy(prop.test(x, n, p = 0.5)), x = x, n = n))
}
Now you can call p_test(mtcars, '.$mpg', '.$disp') and get the desired result.
However, a more dplyr-y way of doing the same thing would be to pass unevaluated objects to p_test:
p_test(mtcars, mpg, disp)
… and we can easily do this with a simple change:
p_test_ = function (data, var1, var2) {
data %>%
slice(1) %>%
do_(interp(~tidy(prop.test(.$x, .$n, p = 0.5)),
x = as.name(var1), n = as.name(var2)))
}
p_test = function (data, var1, var2) {
p_test_(data, substitute(var1), substitute(var2))
}
Now the following two pieces of code both work:
p_test(mtcars, mpg, disp)
p_test_(mtcars, 'mpg', 'disp')

Related

tidy function cannot be used within future_map?

I have R code below.
for the last row, when I used map() function, it worked well.
however, when I changed to future_map() function, I got the following error message:
"Error: Problem with mutate() column model.
i model = future_map(splits, fun1).
x no applicable method for 'tidy' applied to an object of class "c('lmerMod', 'merMod')""
any idea on what's wrong? thanks.
fun1 <- function(data) {
data %>% analysis %>%
lmer(val ~ period + (1 | id), data = .) %>% tidy
}
plan(multisession)
raw %>%
nest(data = -c(analyte, var)) %>%
mutate(boot = future_map(data, ~ bootstraps(
data = .x,
times = 5,
strata = id
),
.progress = T)) %>%
unnest(boot) %>%
mutate(model =future_map(splits, fun1))
I experienced exactly the same problem with one of my scripts. In order to get future_map to work properly with tidy, I needed to explicitly reference the broom package (i.e. I needed to use broom::tidy in place of tidy). In your example, you are attempting to extract summary statistics from a mixed model, so the code should run without error if we modify fun1 to be as follows:
fun1 <- function(data) {
data %>% analysis %>%
lmer(val ~ period + (1 | id), data = .) %>% broom.mixed::tidy
}
UPDATE (13-Dec-2021):
After a bit more reading, I now understand that the problem, as described in the original post, is due to the broom.mixed package not being attached in the R environment(s) where the future is evaluated. Instead of modifying fun1 (which is a very hacky way of resolving the problem), we should make use of the .options argument of future_map to guarantee that broom.mixed is attached (and all associated functions are available) in the future environments. The following code should run without error:
fun1 <- function(data) {
data %>%
analysis %>%
lmer(val ~ period + (1 | id), data = .) %>%
tidy
}
plan(multisession)
raw %>%
nest(data = -c(analyte, var)) %>%
mutate(boot = future_map(data, ~ bootstraps(data = .x,
times = 5,
strata = id),
.progress = T)) %>%
unnest(boot) %>%
mutate(model = future_map(splits,
fun1,
.options = furrr_options(packages = "broom.mixed")))
My take-home from this is that it's probably good practice to always list the packages that we need to use (as a character vector) using the .options argument of future_map, just to be on the safe side. I hope this helps!

How to put a formula within a function in R?

I want to store a dplyr function/formula (e.g. filter(exercise=="Inadequate") or mutate(exercise="adequate") in the variable_to_filter section for my function. I have lots of variables that need to go through this function. How can I do that? I know the code below doesn't work, but I hope you can see the logic in what I'm trying to do.
exercise_inadequate<-(exercise=="Inadequate")
variable_to_mutate<-(mutate(exercise="adequate"))
difference_pe<-function(percent, variable_to_filter, variable_to_mutate){
filtered <- dataset %>% filter(variable_to_filter)
sampled <- sample_frac(filtered, percent/100)
sampled <- sampled %>% mutate(variable_to_mutate)
}
difference_pe(100, exercise_inadequate, exercise_adequate)
I would prefer passing the column name and value separately to the function because evaluating string as condition in filter statement can be ugly.
library(dplyr)
library(rlang)
difference_pe<- function(dataset, percent, col, value) {
filtered <- dataset %>% filter({{col}} == value)
sampled <- sample_frac(filtered, percent/100)
return(sampled)
}
You can use this function as :
difference_pe(dataset, 100, exercise, "Inadequate")
If for some reason the above is not possible and you need to pass condition as string we can use parse_expr which is similar to eval parse.
exercise_inadequate<- 'exercise=="Inadequate"'
difference_pe<- function(dataset, percent, variable_to_filter) {
filtered <- dataset %>% filter(eval(parse_expr(variable_to_filter)))
#filtered <- dataset %>% filter(eval(parse(text = variable_to_filter)))
sampled <- sample_frac(filtered, percent/100)
return(sampled)
}
difference_pe(dataset, 100, exercise_inadequate)

How to properly parse (?) mdsets in expss within a loop?

I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.
%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally

Supplying the list argument to .l in pmap() or pwalk()

I am unclear as to when arguments can be explicitly paired in pmap() and pwalk()'s .l argument. Sometimes these purrr functions only seem to work when the dataframe supplied to them has names that map directly to the expected arguments of the function named in .f. Other times a full dataframe can be supplied to pmap() and the variables can be pair mapped explicitly.
library(dplyr)
library(purrr)
library(tibble)
set.seed(57)
ds_mt <-
mtcars %>%
rownames_to_column("model") %>%
mutate(am = factor(am, labels = c("auto", "manual"))) %>%
select(model, mpg, wt, cyl, am) %>%
sample_n(3)
foo <- function(model, am, mpg){
print(
paste("The", model, "has a", am, "transmission and gets", mpg, "mpgs.")
)
}
Why do these code chunks work?
ds_mt %>%
select(model, am, mpg) %>%
pwalk(
.l = .,
.f = foo
)
# example with explicit pair mapping
ds_mt %>%
mutate(
new_var =
pmap(
.l = list(model=model, am=am, mpg=mpg),
.f = foo
)
)
While these code chunks fail?
ds_mt %>%
pwalk(
.l = list(model, am, mpg),
.f = foo
)
ds_mt %>%
pwalk(
.l = list(model=model, am=am, mpg=mpg),
.f = foo
)
Your problem has nothing to do with pmap() or pwalk(). It stems from some misunderstanding of how the pipe and the mutate() function work.
First, the pipe:
Unless otherwise specified by a dot, the pipe passes the left-hand side (LHS) as the first argument of the function on the right-hande side (RHS).
So this works:
ds_mt %>%
select(model, am, mpg) %>%
pwalk(
.l = .,
.f = foo
)
because your list (= your data frame since a data frame is a list of vectors), which is the LHS of the pipe, is used as the first argument of pwalk() on the RHS.
In this case, you actually do not need the dot and you could have written it much more simply as:
ds_mt %>%
select(model, am, mpg) %>%
pwalk(foo)
On the other hand, when you try to run:
ds_mt %>%
pwalk(
.l = list(model, am, mpg),
.f = foo
)
the connection between your LHS and your RHS do not follow the rules of the pipe, so R has no idea what model is since you don't have any object called model.
For this expression to work, you can write it, without the pipe, in this manner:
pwalk(
.l = list(ds_mt$model, ds_mt$am, ds_mt$mpg),
.f = foo
)
Or, if you want to use the pipe, you have to replace the LHS of the pipe by dots (since it is not passed as the first argument of the function on the RHS) where it is necessary for the code to work, but here, since you are passing the LHS inside nested functions, you also have to put the RHS between curly braces because R would otherwise pass the LHS as the first argument of the outer-most function of the RHS:
ds_mt %>% {
pwalk(
.l = list(.$model, .$am, .$mpg),
.f = foo
)
}
or, in a style a little more compact:
ds_mt %>% {pwalk(list(.$model, .$am, .$mpg), foo)}
In conclusion, it is not enough to have an object on the LHS of a pipe for R to magically apply it at the right places of the RHS (but I think your confusion might come from the case of dplyr functions (see below)). By default, it is used as the first argument of the function on the RHS (and in that case, you don't need any dot). For other placements, you do need a dot at each place where the LHS is needed. And in the case of nested functions (as you have here), you also need to enclose the RHS in curly braces or R will pass the LHS as the first argument of your outer-most RHS function.
Now, to your mutate() example:
ds_mt %>%
mutate(
new_var = pmap(
.l = list(model, am, mpg),
.f = foo
)
)
This one works because, with newer versions of dplyr, the data frame and dollar sign are not necessary anymore when calling variables inside the mutate() function. So here, R does not wonder what model is because you are in a "mutate framework", so to speak, and R understands model as meaning .$model or ds_mt$model. So here again, this has nothing to do with pmap() or pwalk() but is a particularity of the dplyr functions (it would be the same with summarise()). I guess this shortcut of notation that dplyr functions allow is what lead you to some confusions.
Finally, what you call "explicit pair mapping" has no effect. Since you defined your function foo() as accepting 3 arguments, as long as you keep the arguments in the right order,
foo(model = model, am = am, mpg = mpg)
and
foo(model, am, mpg)
are exactly the same. If you swap the arguments around however, you do need to be explicit. For instance:
foo(am = am, model = model, mpg = mpg)

Can one argument be mapped to more than one argument in a user-defined function?

Assume I want to run this:
MS_date<-bind_inpatient_MSW %>%
arrange(NRIC,
APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION) %>%
group_by(NRIC,
APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION) %>%
mutate(n_marital_status=n_distinct(MARITAL_STATUS,na.rm=TRUE))
and this
TH_date<-bind_inpatient_MSW %>%
arrange(NRIC,
APPROVED_DATE_BILL) %>%
group_by(NRIC,
APPROVED_DATE_BILL) %>%
mutate(n_TH=n_distinct(TYPE_OF_HOUSING,na.rm=TRUE))
These two differ by the variables that arrange and group the dataframe, as well as the added variable. I would like to write a user-defined function so that I dont have to write this more than once. I tried as follows:
df_date<-function(df,grpby,cntby){
dfnew<-df %>%
arrange(grpby) %>%
group_by(grpby) %>%
mutate(n=n_distinct(cntby,na.rm=TRUE))
return(dfnew)
}
And applying df_date(bind_inpatient_MSW,NRIC,APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION,MARITAL_STATUS)
and
df_date(bind_inpatient_MSW,NRIC,APPROVED_DATE_BILL,TYPE_OF_HOUSING)
They wouldnt work. How could I solve this?
You can try something like:
fun <- function(dat,group,ctnby) {
dat %>%
group_by_(group) %>%
do((function(., ctnby) {
with(., data.frame(n = n_distinct(get(ctnby))))
}
)(.,ctnby))
}
fun(mtcars,"cyl","hp")
which avoids lazy evaluation using do.

Resources