Passing data-variables to R formulas - r

Let's say I'd like to write anscombe %>% lm_tidy("x1", "y1") (Actually, I'd like to write anscombe %>% lm_tidy(x1, y1), where x1 and y1 are part of the data frame). So, as the following function seems working:
plot_gg <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
ggplot(df, aes(x = !!x, y = !!y)) + geom_point() +
geom_smooth(formula = y ~ x, method="lm", se = FALSE)
}
I started writing the following function:
lm_tidy_1 <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
fm <- y ~ x ##### I tried many stuff here!
lm(fm, data=df)
}
## Error in model.frame.default(formula = fm, data = df, drop.unused.levels = TRUE) :
## object is not a matrix
One comment in passing in column name as argument states that embrace {{...}} is a shorthand notation for the quote-unquote pattern. Unfortunately, error messages were different in both situations:
lm_tidy_2 <- function(df, x, y) {
fm <- !!enquo(y) ~ !!enquo(x) # alternative: {{y}} ~ {{x}} with different errors!!
lm(fm, data=df)
}
## Error:
## ! Quosures can only be unquoted within a quasiquotation context.
This seems working (based on #jubas's answer but we're stuck with string handling and paste):
lm_tidy_str <- function(df, x, y) {
fm <- formula(paste({{y}}, "~", {{x}}))
lm(fm, data=df)
}
Yet again, {{y}} != !!enquo(y). But it's worse: the following function breaks down with the same Quosure error as earlier:
lm_tidy_str_1 <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
fm <- formula(paste(!!y, "~", !!x))
lm(fm, data=df)
}
Is {{y}} != !!enquo(y)?
How to pass data-variables to lm?
EDIT: Sorry, there were left-overs from my many trials. I want to directly pass the data-variables (say x1 and y1) to the function that is going to use them as formula components (such as lm) and not their string versions ("x1" and "y1"): I try to avoid strings as long as possible and it's more streamlined from the user perspective.

Consider:
lm_tidy_1 <- function(df, x, y) {
fm <- reformulate(as.character(substitute(x)), substitute(y))
lm(fm, data=df)
}
lm_tidy_1(iris, Species, Sepal.Length)
lm_tidy_1(iris, 'Species', Sepal.Length)
lm_tidy_1(iris, Species, 'Sepal.Length')
lm_tidy_1(iris, 'Species', 'Sepal.Length')
Edit:
If you need the formula to appear, change the call object:
lm_tidy_1 <- function(df, x, y) {
fm <- reformulate(as.character(substitute(x)), substitute(y))
res<-lm(fm, data=df)
res$call[[2]]<- fm
res
}
lm_tidy_1(iris, Species, Sepal.Length)
Call:
lm(formula = Sepal.Length ~ Species, data = df)
Coefficients:
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582

#BiranSzydek's answer is pretty good.
However it has 3 downsides:
Call:
lm(formula = fm, data = .)
One cannot see the formula nor the data which were actually used.
One has to input the symbols as strings.
The dependency from rlang - though it is a great package.
You can indeed solve this problem with pure base R!
The solution in pure base R
R is actually under-the-hood a Lisp. It is suitable for such meta-programming tasks. The only downside of R is its horrible syntax.
Especially when facing meta-programming, it is not as beautiful and as elegant like the Lisp languages. The syntax really can confuse a lot - as you experienced it yourself when trying to solve this problem.
The solution is to use substitute() by which you can substitute code pieces in a quoted manner:
lm_tidy <- function(df, x, y) {
# take the arguments as code pieces instead to evaluate them:
.x <- substitute(x)
.y <- substitute(y)
.df <- substitute(df)
# take the code piece `y ~ x` and substitute using list lookup table
.fm <- substitute(y ~ x, list(y=.y, x=.x))
# take the code `lm(fm, data=df)` and substitute with the code pieceses defined by the lookup table
# by replacing them by the code pieces stored in `.fm` and `.df`
# and finally: evaluate the substituted code in the parent environment (the environment where the function was called!)
eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}
The trick is to use eval.parent(substitute( <your expression>, <a list which determines the evaluation lookup-table for the variables in your expression>)).
Beware of scoping! As long as <your expression> is constructed only using variables which are defined inside the function or inside the lookup-list given to substitute(), there won't be any scoping problems! But avoid to refer to any other variables within <your expression>! - So this is the only rule you have to obey to use eval()/eval.parent() safely in this context!
but even if, the eval.parent() takes care, that the substituted code
is executed within the environment where this function was called.
Now, you can do:
lm_tidy(mtcars, cyl, mpg)
the output is now as desired:
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
And we did this with pure base R!
The trick for safe use of eval() is really that every variable in the substitute() expression is defined/given inside the lookup tables for substitute() or the function's argument. In other words: None of the replaced variables refers to any dangling variables outside the function definition.
plot_gg function
So following these rules, your plot_gg function would be defined as:
plot_gg <- function(df, x, y) {
.x <- substitute(x)
.y <- substitute(y)
.df <- substitute(df)
.fm <- substitute( y ~ x, list(x=.x, y=.y))
eval.parent(substitute(
ggplot(df, aes(x=x, y=y)) + geom_point() +
geom_smooth(formula = fm, method="lm", se=FALSE),
list(fm=.fm, x=.x, y=.y, df=.df)
))
}
When you want to enter x and y as strings
lm_tidy_str <- function(df, x, y) {
.x <- as.name(x)
.y <- as.name(y)
.df <- substitute(df)
.fm <- substitute(y ~ x, list(y=.y, x=.x))
eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}
plot_gg_str <- function(df, x, y) {
.x <- as.name(x)
.y <- as.name(y)
.df <- substitute(df)
.fm <- substitute( y ~ x, list(x=.x, y=.y))
eval.parent(substitute(
ggplot(df, aes(x=x, y=y)) + geom_point() +
geom_smooth(formula = fm, method="lm", se=FALSE),
list(fm=.fm, x=.x, y=.y, df=.df)
))
}
lm_tidy_str(mtcars, "cyl", "mpg")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
#
require(ggplot2)
plot_gg_str(mtcars, "cyl", "mpg")

Wrap the formula in "expr," then evaluate it.
library(dplyr)
lm_tidy <- function(df, x, y) {
x <- sym(x)
y <- sym(y)
fm <- expr(!!y ~ !!x)
lm(fm, data = df)
}
This function is equivalent:
lm_tidy <- function(df, x, y) {
fm <- expr(!!sym(y) ~ !!sym(x))
lm(fm, data = df)
}
Then
lm_tidy(mtcars, "cyl", "mpg")
gives
Call:
lm(formula = fm, data = .)
Coefficients:
(Intercept) cyl
37.885 -2.876
EDIT per comment below:
library(rlang)
lm_tidy_quo <- function(df, x, y){
y <- enquo(y)
x <- enquo(x)
fm <- paste(quo_text(y), "~", quo_text(x))
lm(fm, data = df)
}
You can then pass symbols as arguments
lm_tidy_quo(mtcars, cyl, mpg)

Related

Function table(x, y) compatible with R base language and native pipe

I am looking to develop a tab(x, y) or tab(d, x, y) function compatible with r base language (table(d$x, d$y)) and native pipe (d |> table(x, y)).
Native pipe compatible function:
tab1 <- function(d, x, y){
eval(substitute(table(d$x, d$y)))
}
mtcars |> tab1(cyl, vs)
# or
tab1(mtcars, cyl, vs)
R base:
tab2 <- function (x, y) {
result <- table(x, y)
result
}
tab2(mtcars$cyl, mtcars$vs)
What should I add to the tab1 function so that the command tab1(mtcars, mtcars$cyl, mtcars$vs) does not return an error message (sometimes, using a pipe compatible function, the name of the dataframe is repeated incorrectly)?
I would like to get a function (tab(d, x, y) or tab(x, y)) that works with these commands:
tab(mtcars, cyl, vs) # it's tab1 function
mtcars |> tab(cyl, vs) # it's tab1 function
tab(mtcars, mtcars$cyl, mtcars$vs) # doesn't work with tab1 function
and ideally also with:
tab(mtcars$cyl, mtcars$vs) # it's tab2 function
With this definition the examples below all work as in table. Note that table(mtcars$vs, mtcars$cyl) does not show names so tab3 does not either in those cases. (If table in tab3 were replaced with qtab from collapse then it would show names.)
tab3 <- function(d = parent.frame(), x, y, ...) {
if (missing(y)) eval(substitute(table(d, x)))
else eval(substitute(table(x, y)), d)
}
tab3(x = mtcars$vs, y = mtcars$cyl)
tab3(mtcars, vs, cyl)
tab3(mtcars$vs, mtcars$cyl)
tab3(mtcars, mtcars$vs, mtcars$cyl)
The code could be simplified if d were last:
tab4 <- function(x, y, d = parent.frame(), ...) {
eval(substitute(table(x, y)), d)
}
tab4(x = mtcars$vs, y = mtcars$cyl)
tab4(vs, cyl, mtcars)
tab4(mtcars$vs, mtcars$cyl)
tab4(mtcars$vs, mtcars$cyl, mtcars)
Perhaps you could just use xtabs. All of these work and do show names.
xtabs(~ vs + cyl, mtcars)
xtabs(~ mtcars$vs + mtcars$cyl)
xtabs(~ mtcars$vs + mtcars$cyl, mtcars)

Function within a function: cramer function "Error in loglin"

I tried to embed the cramer function from sjstats package. Although the function works perfectly outside the custom function, it doesn't work within it.
Thank you very much in advance.
library (sjstats)
cramer2 <- function(dta, x, y){
effsize <- cramer(x ~ y, data = dta)
return(effsize)
}
cramer2(x=gender, y=age, dta=df)
Error in loglin(data, margins, start = start, fit = fitted, param = param, :
falsche Spezifikationen für 'table' oder 'start'
This happens because x and y are not automatically substituted in a formula for the variables you have passed. Look:
f <- function(x, y) {
return(x ~ y)
}
f(a, b)
#> x ~ y
If you want the variables substituted, you can do something like
f2 <- function(x, y) {
call("~", substitute(x), substitute(y))
}
f2(a, b)
#> a ~ b
So in your case you can do:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", substitute(x), substitute(y)))
effsize <- cramer(f, data = dta)
return(effsize)
}
Obviously we don't have your data, but using the built-in data set efc we can demonstrate this works as expected:
data(efc)
cramer2(efc, e16sex, c161sex)
#> [1] 0.05258249
Created on 2022-02-27 by the reprex package (v2.0.1)
The solution provided by Allan works perfectly fine if your function does not target the variables with quotation marks, i.e. customfunction(dta=mydata, x= gender, y=age, weight=dataweight).
If, however, you must for some reason target the variables with quotation marks, e.g. customfunction(dta=mydata, x= "gender", y="age", weight="dataweight").
Then replace substitute with sym:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", sym(x), sym(y)))
effsize <- cramer(f, weights=dta[[weight]], data = dta)
return(effsize)
}

How can I use dplyr/magrittr's pipe inside functions in R?

I'm trying to write a function which takes as argument a dataframe and the name of the function. When I try to write the function with the standard R syntax, I can get the good result using eval and substitute as recommanded by #hadley in http://adv-r.had.co.nz/Computing-on-the-language.html
> df <- data.frame(y = 1:10)
> f <- function(data, x) {
+ out <- mean(eval(expr = substitute(x), envir = data))
+ return(out)
+ }
> f(data = df, x = y)
[1] 5.5
Now, when I try to write the same function using the %>% operator, it doesn't work :
> df <- data.frame(y = 1:10)
> f <- function(data, x) {
+ data %>%
+ eval(expr = substitute(x), envir = .) %>%
+ mean()
+ }
> f(data = df, x = y)
Show Traceback
Rerun with Debug
Error in eval(expr, envir, enclos) : objet 'y' introuvable
>
How can I using the combine the piping operator with the use of eval and substitute ? It's seems really tricky for me.
A workaround would be
f <- function(data, x) {
v <- substitute(x)
data %>%
eval(expr = v, envir = .) %>%
mean()
}
The problem is that the pipe functions (%>%) are creating another level of closure which interferes with the evaluation of substitute(x). You can see the difference with this example
df <- data.frame(y = 1:10)
f1 <- function(data, x) {
print(environment())
eval(expr = environment(), envir = data)
}
f2 <- function(data, x) {
print(environment())
data %>%
eval(expr = environment(), envir = .)
}
f1(data = df, x = y)
# <environment: 0x0000000006388638>
# <environment: 0x0000000006388638>
f2(data = df, x = y)
# <environment: 0x000000000638a4a8>
# <environment: 0x0000000005f91ae0>
Notice how the environments differ in the matrittr version. You want to take care of substitute stuff as soon as possible when mucking about with non-standard evaluation.
I hope your use case is a bit more complex than your example, because it seems like
mean(df$y)
would be a much easier bit of code to read.
I've been trying to understand my problem.
First, I've written what I want with the summarise() function :
> library(dplyr)
> df <- data.frame(y = 1:10)
> summarise_(.data = df, mean = ~mean(y))
mean
1 5.5
Then I try to program my own function. I've found a solution which seems to work with the lazyeval package in this post. I use the lazy() and the interp() functions to write what I want.
The first possibility is here :
> library(lazyeval)
> f <- function(data, col) {
+ col <- lazy(col)
+ inter <- interp(~mean(x), x = col)
+ summarise_(.data = data, mean = inter)
+ }
> f(data = df, col = y)
mean
1 5.5
I can also use pipes :
> f <- function(data, col) {
+ col <- lazy(col)
+ inter <- interp(~mean(x), x = col)
+ data %>%
+ summarise_(.data = ., mean = inter)
+ }
>
> f(data = df, col = y)
mean
1 5.5
I would not use eval and substitute.
What follows is a simplified version of this great post suited to your question.
df <- data.frame(y = 1:10)
f <- function(data, x) {
x <- enquo(x)
df %>% summarise(mean = mean(!!x))
}
f(data = df, x = y)
There are two things happening here:
Tranforming the column name with enquo()
Prefixing the column with !!
Please see refer to the link for a more complicated example.

Using strings in a for loop in R - error when including string as variable in ANOVA within for loop

I have the following ANOVA in R which works great:
fit <- aov(dependent1 ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
"dependent1", "X", and "Z" are the column names.
I want to make a for loop where I loop over a certain amount of dependent variables, and I tried this:
dependent_variables <- c("dependent1", "dependent2", "dependent3")
for (i in dependent_variables) {
fit <- aov(i ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
}
If I run this, I get an error message:
Error in model.frame.default(formula = i ~ X + Z + X * :
variable lengths differ (found for 'X')
Any idea what goes wrong here?
Example data (which may or may not fulfil the criteria for an ANOVA)
X <- rnorm(100)
Z <- rnorm(100)
dependent1 <- rnorm(100)
dependent2 <- rnorm(100)
dependent3 <- rnorm(100)
dataset <- cbind(data.frame(X, Z, dependent1, dependent2, dependent3))
The following script would work, you need to put in the row column numbers of your dependent variables:
for (i in 3:5) {
fit <- aov(dataset[ , i] ~ X + Z + X*Z, data=dataset)
drop <- drop1(fit,~.,test="F")
print(fit)
print(drop)
}
Why not loop through data instead of looping through names? Perhaps this is a bit clunkier than what you're trying to do.
Create data
dependent1 = runif(100);
dependent2 = runif(100);
dependent3 = runif(100);
dataset = data.frame(X=1:100, Z=rnorm(1,1,100))
Run single ANOVA
fit = aov(dependent1 ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
cbind the dependents together and loop over them, storing results in list objects
d = cbind(dependent1, dependent2, dependent3)
fit = list(); drop = list()
for (i in 1:ncol(d)) {
fit[[i]] = aov(d[,i] ~ X + Z + X*Z, data=dataset)
drop[[i]] = drop1(fit[[i]],~.,test="F")
}
** Edited: called fit instead of fit[[i]]. Sorry about that.

R - converting dot-dot-dot in a formula

I want to write a function that can take a variable number of inputs and regress the first input on the rest of the inputs. More specifically,
Hypothetically, suppose the function had been supplied with 2 or 3 or 4 variables, I would defined it as:
egen_neut<-function(x,y) residuals(lm(x~y,na.action=na.exclude)
egen_neut<-function(x,y,z) residuals(lm(x~y+z,na.action=na.exclude)
egen_neut<-function(x,y,z,w) residuals(lm(x~y+z+w,na.action=na.exclude)
how can I convert the dot-dot-dot, i.e. "...", such that it can be interpreted as a formula with a "+" between the variables, i.e. what will go in place of the ????? below
egen_neut<-function(x,...) {
residuals(lm(x ~ ?????,na.action=na.exclude)
}
Here's one way to do it:
ff <- function(x, ...) {
mc <- as.list(match.call())[-1]
ll <- as.character(mc[[1]])
rr <- paste(sapply(mc[-(1)], as.character), collapse="+")
fm <- as.formula(paste(ll, "~", rr))
# now you can execute `lm` as:
lm(fm, na.action = na.exclude)
}
# now you can do
ff(x, y, z, w)
where all these input variables reside in the global environment. Hope this helps.
Since you're dealing with data.frame, this is how I'd do it:
ff <- function(df, ...) {
mc <- as.list(match.call())[-(1:2)]
ll <- as.character(mc[[1]])
rr <- paste(sapply(mc[-(1)], as.character), collapse="+")
fm <- as.formula(paste(ll, "~", rr))
# now you can execute `lm` as:
lm(fm, data = df, na.action = na.exclude)
}
Suppose your data.frame is DF with columns x, y, z and you want to do x ~ y then:
ff(DF, x, y)
Or x ~ y + z, then:
ff(DF, x, y, z)
You get the idea.
You don't have to do that. Just write your function to take a formula argument:
egen_neut <- function(fm)
resid(lm(fm, na.action=na.exclude))
egen_neut(x ~ y)
egen_neut(x ~ y + z)
egen_neut(x ~ y + z + w)
If you put all the variables into a data frame, then you can use a loop to extract a subset of the columns and a . in the formula object to fit an additive model to all of the variables in the data frame not already in the formula. Suppose your data frame, d, with columns x, y, z, etc. Then something like
sapply(seq(2, length(d)), function(ix, d) residuals(lm(x ~ ., d[, seq(ix)])), d = d)
should do the trick, but this is untried so it might need some tweaking

Resources