R - converting dot-dot-dot in a formula - r

I want to write a function that can take a variable number of inputs and regress the first input on the rest of the inputs. More specifically,
Hypothetically, suppose the function had been supplied with 2 or 3 or 4 variables, I would defined it as:
egen_neut<-function(x,y) residuals(lm(x~y,na.action=na.exclude)
egen_neut<-function(x,y,z) residuals(lm(x~y+z,na.action=na.exclude)
egen_neut<-function(x,y,z,w) residuals(lm(x~y+z+w,na.action=na.exclude)
how can I convert the dot-dot-dot, i.e. "...", such that it can be interpreted as a formula with a "+" between the variables, i.e. what will go in place of the ????? below
egen_neut<-function(x,...) {
residuals(lm(x ~ ?????,na.action=na.exclude)
}

Here's one way to do it:
ff <- function(x, ...) {
mc <- as.list(match.call())[-1]
ll <- as.character(mc[[1]])
rr <- paste(sapply(mc[-(1)], as.character), collapse="+")
fm <- as.formula(paste(ll, "~", rr))
# now you can execute `lm` as:
lm(fm, na.action = na.exclude)
}
# now you can do
ff(x, y, z, w)
where all these input variables reside in the global environment. Hope this helps.
Since you're dealing with data.frame, this is how I'd do it:
ff <- function(df, ...) {
mc <- as.list(match.call())[-(1:2)]
ll <- as.character(mc[[1]])
rr <- paste(sapply(mc[-(1)], as.character), collapse="+")
fm <- as.formula(paste(ll, "~", rr))
# now you can execute `lm` as:
lm(fm, data = df, na.action = na.exclude)
}
Suppose your data.frame is DF with columns x, y, z and you want to do x ~ y then:
ff(DF, x, y)
Or x ~ y + z, then:
ff(DF, x, y, z)
You get the idea.

You don't have to do that. Just write your function to take a formula argument:
egen_neut <- function(fm)
resid(lm(fm, na.action=na.exclude))
egen_neut(x ~ y)
egen_neut(x ~ y + z)
egen_neut(x ~ y + z + w)

If you put all the variables into a data frame, then you can use a loop to extract a subset of the columns and a . in the formula object to fit an additive model to all of the variables in the data frame not already in the formula. Suppose your data frame, d, with columns x, y, z, etc. Then something like
sapply(seq(2, length(d)), function(ix, d) residuals(lm(x ~ ., d[, seq(ix)])), d = d)
should do the trick, but this is untried so it might need some tweaking

Related

Passing data-variables to R formulas

Let's say I'd like to write anscombe %>% lm_tidy("x1", "y1") (Actually, I'd like to write anscombe %>% lm_tidy(x1, y1), where x1 and y1 are part of the data frame). So, as the following function seems working:
plot_gg <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
ggplot(df, aes(x = !!x, y = !!y)) + geom_point() +
geom_smooth(formula = y ~ x, method="lm", se = FALSE)
}
I started writing the following function:
lm_tidy_1 <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
fm <- y ~ x ##### I tried many stuff here!
lm(fm, data=df)
}
## Error in model.frame.default(formula = fm, data = df, drop.unused.levels = TRUE) :
## object is not a matrix
One comment in passing in column name as argument states that embrace {{...}} is a shorthand notation for the quote-unquote pattern. Unfortunately, error messages were different in both situations:
lm_tidy_2 <- function(df, x, y) {
fm <- !!enquo(y) ~ !!enquo(x) # alternative: {{y}} ~ {{x}} with different errors!!
lm(fm, data=df)
}
## Error:
## ! Quosures can only be unquoted within a quasiquotation context.
This seems working (based on #jubas's answer but we're stuck with string handling and paste):
lm_tidy_str <- function(df, x, y) {
fm <- formula(paste({{y}}, "~", {{x}}))
lm(fm, data=df)
}
Yet again, {{y}} != !!enquo(y). But it's worse: the following function breaks down with the same Quosure error as earlier:
lm_tidy_str_1 <- function(df, x, y) {
x <- enquo(x)
y <- enquo(y)
fm <- formula(paste(!!y, "~", !!x))
lm(fm, data=df)
}
Is {{y}} != !!enquo(y)?
How to pass data-variables to lm?
EDIT: Sorry, there were left-overs from my many trials. I want to directly pass the data-variables (say x1 and y1) to the function that is going to use them as formula components (such as lm) and not their string versions ("x1" and "y1"): I try to avoid strings as long as possible and it's more streamlined from the user perspective.
Consider:
lm_tidy_1 <- function(df, x, y) {
fm <- reformulate(as.character(substitute(x)), substitute(y))
lm(fm, data=df)
}
lm_tidy_1(iris, Species, Sepal.Length)
lm_tidy_1(iris, 'Species', Sepal.Length)
lm_tidy_1(iris, Species, 'Sepal.Length')
lm_tidy_1(iris, 'Species', 'Sepal.Length')
Edit:
If you need the formula to appear, change the call object:
lm_tidy_1 <- function(df, x, y) {
fm <- reformulate(as.character(substitute(x)), substitute(y))
res<-lm(fm, data=df)
res$call[[2]]<- fm
res
}
lm_tidy_1(iris, Species, Sepal.Length)
Call:
lm(formula = Sepal.Length ~ Species, data = df)
Coefficients:
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582
#BiranSzydek's answer is pretty good.
However it has 3 downsides:
Call:
lm(formula = fm, data = .)
One cannot see the formula nor the data which were actually used.
One has to input the symbols as strings.
The dependency from rlang - though it is a great package.
You can indeed solve this problem with pure base R!
The solution in pure base R
R is actually under-the-hood a Lisp. It is suitable for such meta-programming tasks. The only downside of R is its horrible syntax.
Especially when facing meta-programming, it is not as beautiful and as elegant like the Lisp languages. The syntax really can confuse a lot - as you experienced it yourself when trying to solve this problem.
The solution is to use substitute() by which you can substitute code pieces in a quoted manner:
lm_tidy <- function(df, x, y) {
# take the arguments as code pieces instead to evaluate them:
.x <- substitute(x)
.y <- substitute(y)
.df <- substitute(df)
# take the code piece `y ~ x` and substitute using list lookup table
.fm <- substitute(y ~ x, list(y=.y, x=.x))
# take the code `lm(fm, data=df)` and substitute with the code pieceses defined by the lookup table
# by replacing them by the code pieces stored in `.fm` and `.df`
# and finally: evaluate the substituted code in the parent environment (the environment where the function was called!)
eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}
The trick is to use eval.parent(substitute( <your expression>, <a list which determines the evaluation lookup-table for the variables in your expression>)).
Beware of scoping! As long as <your expression> is constructed only using variables which are defined inside the function or inside the lookup-list given to substitute(), there won't be any scoping problems! But avoid to refer to any other variables within <your expression>! - So this is the only rule you have to obey to use eval()/eval.parent() safely in this context!
but even if, the eval.parent() takes care, that the substituted code
is executed within the environment where this function was called.
Now, you can do:
lm_tidy(mtcars, cyl, mpg)
the output is now as desired:
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
And we did this with pure base R!
The trick for safe use of eval() is really that every variable in the substitute() expression is defined/given inside the lookup tables for substitute() or the function's argument. In other words: None of the replaced variables refers to any dangling variables outside the function definition.
plot_gg function
So following these rules, your plot_gg function would be defined as:
plot_gg <- function(df, x, y) {
.x <- substitute(x)
.y <- substitute(y)
.df <- substitute(df)
.fm <- substitute( y ~ x, list(x=.x, y=.y))
eval.parent(substitute(
ggplot(df, aes(x=x, y=y)) + geom_point() +
geom_smooth(formula = fm, method="lm", se=FALSE),
list(fm=.fm, x=.x, y=.y, df=.df)
))
}
When you want to enter x and y as strings
lm_tidy_str <- function(df, x, y) {
.x <- as.name(x)
.y <- as.name(y)
.df <- substitute(df)
.fm <- substitute(y ~ x, list(y=.y, x=.x))
eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}
plot_gg_str <- function(df, x, y) {
.x <- as.name(x)
.y <- as.name(y)
.df <- substitute(df)
.fm <- substitute( y ~ x, list(x=.x, y=.y))
eval.parent(substitute(
ggplot(df, aes(x=x, y=y)) + geom_point() +
geom_smooth(formula = fm, method="lm", se=FALSE),
list(fm=.fm, x=.x, y=.y, df=.df)
))
}
lm_tidy_str(mtcars, "cyl", "mpg")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
#
require(ggplot2)
plot_gg_str(mtcars, "cyl", "mpg")
Wrap the formula in "expr," then evaluate it.
library(dplyr)
lm_tidy <- function(df, x, y) {
x <- sym(x)
y <- sym(y)
fm <- expr(!!y ~ !!x)
lm(fm, data = df)
}
This function is equivalent:
lm_tidy <- function(df, x, y) {
fm <- expr(!!sym(y) ~ !!sym(x))
lm(fm, data = df)
}
Then
lm_tidy(mtcars, "cyl", "mpg")
gives
Call:
lm(formula = fm, data = .)
Coefficients:
(Intercept) cyl
37.885 -2.876
EDIT per comment below:
library(rlang)
lm_tidy_quo <- function(df, x, y){
y <- enquo(y)
x <- enquo(x)
fm <- paste(quo_text(y), "~", quo_text(x))
lm(fm, data = df)
}
You can then pass symbols as arguments
lm_tidy_quo(mtcars, cyl, mpg)

Function within a function: cramer function "Error in loglin"

I tried to embed the cramer function from sjstats package. Although the function works perfectly outside the custom function, it doesn't work within it.
Thank you very much in advance.
library (sjstats)
cramer2 <- function(dta, x, y){
effsize <- cramer(x ~ y, data = dta)
return(effsize)
}
cramer2(x=gender, y=age, dta=df)
Error in loglin(data, margins, start = start, fit = fitted, param = param, :
falsche Spezifikationen für 'table' oder 'start'
This happens because x and y are not automatically substituted in a formula for the variables you have passed. Look:
f <- function(x, y) {
return(x ~ y)
}
f(a, b)
#> x ~ y
If you want the variables substituted, you can do something like
f2 <- function(x, y) {
call("~", substitute(x), substitute(y))
}
f2(a, b)
#> a ~ b
So in your case you can do:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", substitute(x), substitute(y)))
effsize <- cramer(f, data = dta)
return(effsize)
}
Obviously we don't have your data, but using the built-in data set efc we can demonstrate this works as expected:
data(efc)
cramer2(efc, e16sex, c161sex)
#> [1] 0.05258249
Created on 2022-02-27 by the reprex package (v2.0.1)
The solution provided by Allan works perfectly fine if your function does not target the variables with quotation marks, i.e. customfunction(dta=mydata, x= gender, y=age, weight=dataweight).
If, however, you must for some reason target the variables with quotation marks, e.g. customfunction(dta=mydata, x= "gender", y="age", weight="dataweight").
Then replace substitute with sym:
library (sjstats)
cramer2 <- function(dta, x, y) {
f <- as.formula(call("~", sym(x), sym(y)))
effsize <- cramer(f, weights=dta[[weight]], data = dta)
return(effsize)
}

Using strings in a for loop in R - error when including string as variable in ANOVA within for loop

I have the following ANOVA in R which works great:
fit <- aov(dependent1 ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
"dependent1", "X", and "Z" are the column names.
I want to make a for loop where I loop over a certain amount of dependent variables, and I tried this:
dependent_variables <- c("dependent1", "dependent2", "dependent3")
for (i in dependent_variables) {
fit <- aov(i ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
}
If I run this, I get an error message:
Error in model.frame.default(formula = i ~ X + Z + X * :
variable lengths differ (found for 'X')
Any idea what goes wrong here?
Example data (which may or may not fulfil the criteria for an ANOVA)
X <- rnorm(100)
Z <- rnorm(100)
dependent1 <- rnorm(100)
dependent2 <- rnorm(100)
dependent3 <- rnorm(100)
dataset <- cbind(data.frame(X, Z, dependent1, dependent2, dependent3))
The following script would work, you need to put in the row column numbers of your dependent variables:
for (i in 3:5) {
fit <- aov(dataset[ , i] ~ X + Z + X*Z, data=dataset)
drop <- drop1(fit,~.,test="F")
print(fit)
print(drop)
}
Why not loop through data instead of looping through names? Perhaps this is a bit clunkier than what you're trying to do.
Create data
dependent1 = runif(100);
dependent2 = runif(100);
dependent3 = runif(100);
dataset = data.frame(X=1:100, Z=rnorm(1,1,100))
Run single ANOVA
fit = aov(dependent1 ~ X + Z + X*Z, data=dataset)
drop1(fit,~.,test="F")
cbind the dependents together and loop over them, storing results in list objects
d = cbind(dependent1, dependent2, dependent3)
fit = list(); drop = list()
for (i in 1:ncol(d)) {
fit[[i]] = aov(d[,i] ~ X + Z + X*Z, data=dataset)
drop[[i]] = drop1(fit[[i]],~.,test="F")
}
** Edited: called fit instead of fit[[i]]. Sorry about that.

Error in terms.formula(formula) : '.' in formula and no 'data' argument

I'm tring to use neuralnet for prediction.
Create some X:
x <- cbind(seq(1, 50, 1), seq(51, 100, 1))
Create Y:
y <- x[,1]*x[,2]
Give them a names
colnames(x) <- c('x1', 'x2')
names(y) <- 'y'
Make data.frame:
dt <- data.frame(x, y)
And now, I got error
model <- neuralnet(y~., dt, hidden=10, threshold=0.01)
error in terms.formula(formula) : '.' in formula and no 'data'
argument
For example, in lm(linear model) this is worked.
As my comment states, this looks like a bug in the non-exported function neuralnet:::generate.initial.variables. As a work around, just build a long formula from the names of dt, excluding y, e.g.
n <- names(dt)
f <- as.formula(paste("y ~", paste(n[!n %in% "y"], collapse = " + ")))
f
## gives
> f
y ~ x1 + x2
## fit model using `f`
model <- neuralnet(f, data = dt, hidden=10, threshold=0.01)
> model
Call: neuralnet(formula = f, data = dt, hidden = 10, threshold = 0.01)
1 repetition was calculated.
Error Reached Threshold Steps
1 53975276.25 0.00857558698 1967
Offering a simpler alternative to the previous answer, you can create a formula from names of dt using reformulate():
f <- reformulate(setdiff(colnames(dt), "y"), response="y")
reformulate() doesn't require the use of paste() and automatically adds the terms together.
To expand a formula
f <- formula(terms(f, data= dt))
or even shorter
f <- formula(dt, f)
where f is the formula and dt is the data.
For instance, the original formula could be:
f <- as.formula("y ~ .")

How to reliably get dependent variable name from formula object?

Let's say I have the following formula:
myformula<-formula("depVar ~ Var1 + Var2")
How to reliably get dependent variable name from formula object?
I failed to find any built-in function that serves this purpose.
I know that as.character(myformula)[[2]] works, as do
sub("^(\\w*)\\s~\\s.*$","\\1",deparse(myform))
It just seems to me, that these methods are more a hackery, than a reliable and standard method to do it.
Does anyone know perchance what exactly method the e.g. lm use? I've seen it's code, but it is a little to cryptic to me... here is a quote for your convenience:
> lm
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
else if (method != "qr")
warning(gettextf("method = '%s' is not supported. Using 'qr'",
method), domain = NA)
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- as.vector(model.weights(mf))
if (!is.null(w) && !is.numeric(w))
stop("'weights' must be a numeric vector")
offset <- as.vector(model.offset(mf))
if (!is.null(offset)) {
if (length(offset) != NROW(y))
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
length(offset), NROW(y)), domain = NA)
}
if (is.empty.model(mt)) {
x <- NULL
z <- list(coefficients = if (is.matrix(y)) matrix(, 0,
3) else numeric(), residuals = y, fitted.values = 0 *
y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
0) else if (is.matrix(y)) nrow(y) else length(y))
if (!is.null(offset)) {
z$fitted.values <- offset
z$residuals <- y - offset
}
}
else {
x <- model.matrix(mt, mf, contrasts)
z <- if (is.null(w))
lm.fit(x, y, offset = offset, singular.ok = singular.ok,
...)
else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
...)
}
class(z) <- c(if (is.matrix(y)) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
if (model)
z$model <- mf
if (ret.x)
z$x <- x
if (ret.y)
z$y <- y
if (!qr)
z$qr <- NULL
z
}
Try using all.vars:
all.vars(myformula)[1]
I suppose you could also cook your own function to work with terms():
getResponse <- function(formula) {
tt <- terms(formula)
vars <- as.character(attr(tt, "variables"))[-1] ## [1] is the list call
response <- attr(tt, "response") # index of response var
vars[response]
}
R> myformula <- formula("depVar ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar"
It is just as hacky as as.character(myformyula)[[2]] but you have the assurance that you get the correct variable as the ordering of the call parse tree isn't going to change any time soon.
This isn't so good with multiple dependent variables:
R> myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar1 + depVar2"
as they'll need further processing.
I found an useful package 'formula.tools' which is suitable for your task.
code Example:
f <- as.formula(a1 + a2~a3 + a4)
lhs.vars(f) #get dependent variables
[1] "a1" "a2"
rhs.vars(f) #get independent variables
[1] "a3" "a4"
Based on your edit to get the actual response, not just its name, we can use the nonstandard evaluation idiom employed by lm() and most other modelling functions with a formula interface in base R
form <- formula("depVar ~ Var1 + Var2")
dat <- data.frame(depVar = rnorm(10), Var1 = rnorm(10), Var2 = rnorm(10))
getResponse <- function(form, data) {
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
y
}
> getResponse(form, dat)
1 2 3 4 5
-0.02828573 -0.41157817 2.45489291 1.39035938 -0.31267835
6 7 8 9 10
-0.39945771 -0.09141438 0.81826105 0.37448482 -0.55732976
As you see, this gets the actual response variable data from the supplied data frame.
How this works is that the function first captures the function call without expanding the ... argument as that contains things not needed for the evaluation of the data for the formula.
Next, the "formula" and "data" arguments are matched with the call. The line mf[c(1L, m)] selects the function name from the call (1L) and the locations of the two matched arguments. The drop.unused.levels argument of model.frame() is set to TRUE in the next line, and then the call is updated to switch the function name in the call from lm to model.frame. All the above code does is takes the call to lm() and processes that call into a call to the model.frame() function.
This modified call is then evaluated in the parent environment of the function - which in this case is the global environment.
The last line uses the model.response() extractor function to take the response variable from the model frame.
This should always give you all dependent vars:
myformula<-formula("depVar1 + depVar2 ~ Var1 + Var2")
as.character(myformula[[2]])[-1]
#[1] "depVar1" "depVar2"
And I wouldn't consider this particularly "hacky".
Edit:
Something strange happens with 3 dependents:
myformula<-formula("depVar1 + depVar2 + depVar3 ~ Var1 + Var2")
as.character(myformula[[2]])
#[1] "+" "depVar1 + depVar2" "depVar3"
So this might not be as reliable as I thought.
Edit2:
Okay, myformula[[2]] is a language object and as.character seems to do something similar as languageEl.
length(myformula[[2]])
#[1] 3
languageEl(myformula[[2]],which=1)
#`+`
languageEl(myformula[[2]],which=2)
#depVar1 + depVar2
languageEl(myformula[[2]],which=3)
#depVar3
languageEl(languageEl(myformula[[2]],which=2),which=2)
#depVar1
If you check the length of each element, you could create your own extraction function. But this is probably too much of a hack.
Edit3:
Based on the answer by #seancarmody all.vars(myformula[[2]]) is the way to go.
Using all.vars is very tricky as it won't detect the response from a one-sided formula. For example
all.vars(~x+1)
[1] "x"
that is wrong.
Here is the most reliable way of getting the response:
getResponseFromFormula = function(formula) {
if (attr(terms(as.formula(formula)) , which = 'response'))
all.vars(formula)[1]
else
NULL
}
getResponseFromFormula(~x+1)
NULL
getResponseFromFormula(y~x+1)
[1] "y"
Note that you can replace all.vars(formula)[1] in the function with formula[2] if the formula contains more than one variable for the response.
I know this question is quite old, but I thought I'd add a base R answer which doesn't require indexing, doesn't depend on the order of the variables listed in a call to all.vars, and which gives the response variables as separate elements when there is more than one:
myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
all_vars <- all.vars(myformula)
response <- all_vars[!(all_vars %in% labels(terms(myformula)))]
> response
[1] "depVar1" "depVar2"

Resources