I would like to write a function and calls different sub-functions with parameters specified by string, such as:
genericModel <- function(model, dat, y, x, ...) {
fit <- get(model)(get(y) ~ get(x), data = dat, ...)
return(fit)
}
I am able to get it to work with simple cases:
> d <- data.frame(x.var = rnorm(10), y.var = rnorm(10), w = rep(1, 10))
> genericModel('lm', d, 'y.var', 'x.var')
Call:
get(model)(formula = get(y) ~ get(x), data = dat)
Coefficients:
(Intercept) get(x)
-0.04242 -0.31619
However, I have not been successful in terms of passing other optional arguments by string:
> genericModel('lm', d, 'y.var', 'x.var', weights = 'w')
Error in model.frame.default(formula = get(y) ~ get(x), data = dat, weights = "w", :
variable lengths differ (found for '(weights)')
I know I can do genericModel('lm', d, 'y.var', 'x.var', weights = d$w), but that defeats the purpose of creating a flexible function where I can specify the model and column names by string.
Also I can foresee complications where the optional parameters include both column names of the data.frame(ex:weights = w) and generic options for the sub-function(ex:na.action=na.pass).
EDIT:
Just to clarify, what I am hoping to achieve is:
genericModel('lm', d, 'y.var', 'x.var', weights = 'w')
genericModel('glm', d, 'y.var', 'x.var', family = 'binomial')
To run linear regression and logistic regression, respectively. I need some way to pass the optional arguments when calling genericModel.
Does anyone know how to deal with this? Thanks.
One suggestion: rather than fiddling with strings to specify analysis variables, what you should do is pass the formula. This is also much more flexible, since you'll be able to pass complicated model formulas directly to the underlying functions without any parsing.
If you do this, then obtaining what you want is simple with some language hacking. Get the call to the function, then manipulate it to call the model-fitting function instead.
genericModel <- function(mod, formula, data, ...)
{
cl <- match.call(expand=TRUE)
cl[[1]] <- cl$mod
cl$mod <- NULL
eval(cl, parent.frame())
}
genericModel(lm, mpg ~ hp, data=mtcars, weights=gear)
genericModel(glm, Volume ~ Girth + Height, data=trees, family=Gamma(link=log))
Related
Please help me understand the re-producible example below.
I am trying to write a function glm_func() that would call glm(). It works perfectly fine outside of a function.
However, if I pass the linear model formula as an argument, the function glm_func() gives out a strange error:
Error in eval(extras, data, env) : object 'modeldata' not found
Can someone help me understand what went wrong?
# Fully reproducable example
# Specify data
aa = data.frame(y=1:100, x1=1:100, x2=rep(1, 100), z=runif(100))
lm_formula = as.formula('y ~ x1 + x2')
weight_var = 'z'
# GLM works as-is outside of a function
model1 = glm(formula = lm_formula, data = aa, weights = aa[[weight_var]])
# Why does this function not work?
glm_func <- function(modeldata, formula, weight){
thismodel=glm(
formula = formula, #<----- Does not work if formula is passed from argument
data = modeldata, weights = modeldata[[weight]])}
glm_func(modeldata=aa, formula=lm_formula, weight=weight_var)
# This function works
glm_func2 <- function(modeldata, weight){
thismodel=glm(
formula = y ~ x1 + x2, #<----- Works if formula is hardcoded
data = modeldata, weights = modeldata[[weight]])}
glm_func2(modeldata=aa, weight=weight_var)
From help("formula"):
A formula object has an associated environment, and this environment
(rather than the parent environment) is used by model.frame to
evaluate variables that are not found in the supplied data argument.
Formulas created with the ~ operator use the environment in which they
were created. Formulas created with as.formula will use the env
argument for their environment.
From this one would expect that you don't need to care about the environment if you use the data argument. Sadly that's not the case here because the weights are evaluated within the formula's environment (Thanks to useruser2554330 for pointing this out!).
So, you need to ensure that your function environment is associated with the formula:
glm_func <- function(modeldata, formula, weight){
environment(formula) <- environment()
glm(formula = formula, data = modeldata,
weights = modeldata[[weight]])
}
glm_func(modeldata=aa, formula=lm_formula, weight=weight_var)
#works
Personally, I'd do this instead:
glm_func <- function(modeldata, formula, weight){
environment(formula) <- environment()
eval(
bquote(
glm(formula = .(formula), data = modeldata,
weights = modeldata[[weight]])
)
)
}
This way, the actual formula is printed when you print the model object.
As #Roland commented that a formula object has an associated environment so instead of passing a formula object you can pass the variables and create the formula inside the function.
glm_func <- function(modeldata, resp, predictor, weight){
glm(formula = reformulate(predictor, resp),
data = modeldata, weights = modeldata[[weight]])
}
glm_func(modeldata=aa, 'y', c('x1', 'x2'), weight=weight_var)
I've encounter a weird behavior when calling lm within a lapply using the weights argument.
My code consist of a list of formula on which I run a linear model that I call in lapply. So far it was working:
dd <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100),
wg = runif(100,1,100))
ls.form <- list(
formula(y~x1+x2),
formula(y~x3+x4),
formula(y~x1|x2|x3),
formula(y~x1+x2+x3+x4)
)
res.no.wg <- lapply(ls.form, lm, data = dd)
However, when I add the weights argument, I get a weird error:
res.with.wg <- lapply(ls.form, lm, data = dd, weights = dd[,"wg"])
Error in eval(extras, data, env) :
..2 used in an incorrect context, no ... to look in
It's like if the ... from lapply was conflicting with the ... of the lm call but only because of the weights argument.
Any idea was is the cause of this problem and how to fix it?
NOTE: using the call without the lapply works as expected:
lm(ls.form[[1]], data = dd, weights = dd[,"wg"] )
Call:
lm(formula = ls.form[[1]], data = dd, weights = dd[, "wg"])
Coefficients:
(Intercept) x1 x2
-0.12020 0.06049 -0.01937
EDIT The final call is a lapply within a function of the type:
f1 <- function(samp, dat, wgt){
res.with.wg2 <- lapply(ls.form, function(x) {lm(formula = x, data=dat[samp,], weights=dat[samp,wgt])})
}
f1(1:66, dat=dd, wgt = "wg")
I am not really sure why it is not working, but I do think I have a fix for you:
res.with.wg2 <- lapply(ls.form,
function(x) {lm(formula = x, data=dd, weights=dd[,"wg"])})
Hope this helps!
There is a note in the help file for lapply:
For historical reasons, the calls created by lapply are unevaluated,
and code has been written (e.g., bquote) that relies on this. This
means that the recorded call is always of the form FUN(X[[i]], ...),
with i replaced by the current (integer or double) index. This is not
normally a problem, but it can be if FUN uses sys.call or match.call
or if it is a primitive function that makes use of the call. This
means that it is often safer to call primitive functions with a
wrapper, so that e.g. lapply(ll, function(x) is.numeric(x)) is
required to ensure that method dispatch for is.numeric occurs
correctly.
lm uses match.call twice in its opening lines:
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
The solution noted in the help file and by #Florian is to use an anonymous function wrapper.
Update
For this specific problem of changing the model formula, you can rewrite to avoid calling lm within the lapply by using update instead:
# create base model (the formula here doesn't really matter, but we can specify the weights safely here)
baselm <- lm(y+x1,data=dd,weights=dd[,"wg"])
# update with lapply
lapply(ls.form,update,object=baselm)
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = dd, weights = dd[, "wg"])
Coefficients:
(Intercept) x1 x2
0.07561 0.16111 0.15014
...
The R code shown below is a minimal working example to reproduce an error that I can't say I understand. Running the script should yield the error, Error in eval(expr, envir, enclos) : could not find function "fitModel". After reading a thing or two on environments I think I understand why this is happening in this case, "fitModel" is not defined in the execution environment of "obscureFunction". This I fixed by making the following change to "myFormula":
myFormula <- "y ~ eval(fitModel(x, a), envir = environment(fitModel))"
I don't understand how "fitModel" can be evaluated in the environment of "fitModel" when the function can't be found in the calling environment of "obscureFunction", in other words I don't understand why this code change works. I also don't understand why the original code works fine if the body of "topFunction" is run without calling it, i.e. if we define "fitModel" and "obscureFunction" in R_GlobalEnv and call "obscureFunction" from the console.
## Minimum Working Example to reproduce error
rm(list = ls())
library(minpack.lm)
topFunction <- function(){
fitModel <- function(x, a){
exp(-a * x)
}
## Create a function to use with lapply()
obscureFunction <- function(){
x <- seq(-1, 1, 0.01)
y <- exp(-0.5 * x)
Data <- data.frame(x, y)
init <- c(a = 1)
myFormula <- "y ~ fitModel(x, a)"
myFormula <- as.formula(myFormula)
nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data)
return(nlsOutput)
}
## Function call
obscureFunction()
## Other calculations done with fitModel()
}
topFunction()
Well, there are two issues here. The first is using a string for a formula. It's better to use
myFormula <- y ~ fitModel(x, a)
The reason is that formulas capture their environment, strings do not. (As noted by #BridieG, the as.formula() will capture the environment; I skipped over that line when reading the code. I still think it's better to create the formula directly.) Having a reference environment makes it easier to find functions used in a formula. So if you were using lm() rather than nlsLM, this would work with these two changes
# myFormula <- "y ~ fitModel(x, a)" ... becomes
myFormula <- y ~ fitModel(x, 1)
#nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data) ...becomes
nlsOutput <- lm(formula = myFormula, data = Data)
This works with the formula syntax (unquoted var names) and not the string because the formula can capture the environment.
At least that's how it should work. Package authors are free to evaluate formulas how ever they want, and the authors of the nlsLM() function decided to ignore the environment assigned to the formula. They do so in this function inside nlsLM()
FCT <- function(par) {
mf[m] <- par
rhs <- eval(formula[[3L]], envir = mf)
res <- lhs - rhs
res <- .swts * res
res
}
So this is the second problem. Here they enforce the evaluation in the mf object which is a data.frame made up of the covariates of the data and the parameter estimates. Had it been written as
rhs <- eval(formula[[3L]], envir = mf, environment(formula))
it would have worked. This is basically what model.frame() does in lm() that allows this to work. We can make our own "corrected" version of the function with
# tested with minpack.lm_1.1-8
nlsLM2<-nlsLM
body(nlsLM2)[[27]][[3]][[3]][[3]]<-quote(rhs<-eval(formula[[3L]], envir = mf, environment(formula)))
And then make these substitutions
# myFormula <- "y ~ fitModel(x, a)" ... becomes
myFormula <- y ~ fitModel(x, a)
#nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data) ...becomes
nlsOutput <- nlsLM2(formula = myFormula, start = init, data = Data)
it work work and return
Nonlinear regression model
model: y ~ fitModel(x, a)
data: Data
a
0.5
residual sum-of-squares: 0
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.49e-08
So there's not really much you can say about how all R functions handle environments and scope. This behavior is unique to how the nlsLM() authors decided to evaluate their parameters.
I've done a fair amount of reading here on SO and learned that I should generally avoid manipulation of formula objects as strings, but I haven't quite found how to do this in a safe manner:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
# Arguments are unquoted and in the typical form for lm etc
# Do some plotting with lattice using formula & groups (works, not shown)
# Append 'groups' to 'formula':
# Change y ~ x as passed in argument 'formula' to
# y ~ x * gr where gr is the argument 'groups' with
# scoping so it will be understood by aov
new_formula <- y ~ x * gr
# Now do some anova (could do if formula were right)
model <- aov(formula = new_formula, data = data)
# And print the aov table on the plot (can do)
print(summary(model)) # this will do for testing
}
Perhaps the closest I came was to use reformulate but that only gives + on the RHS, not *. I want to use the function like this:
p <- tf(carat ~ color, groups = clarity, data = diamonds)
and have the aov results for carat ~ color * clarity. Thanks in Advance.
Solution
Here is a working version based on #Aaron's comment which demonstrates what's happening:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
print(deparse(substitute(groups)))
f <- paste(".~.*", deparse(substitute(groups)))
new_formula <- update.formula(formula, f)
print(new_formula)
model <- aov(formula = new_formula, data = data)
print(summary(model))
}
I think update.formula can solve your problem, but I've had trouble with update within function calls. It will work as I've coded it below, but note that I'm passing the column to group, not the variable name. You then add that column to the function dataset, then update works.
I also don't know if it's doing exactly what you want in the second equation, but take a look at the help file for update.formula and mess around with it a bit.
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html
tf <- function(formula,groups,d){
d$groups=groups
newForm = update(formula,~.*groups)
mod = lm(newForm,data=d)
}
dat = data.frame(carat=rnorm(10,0,1),color=rnorm(10,0,1),color2=rnorm(10,0,1),clarity=rnorm(10,0,1))
m = tf(carat~color,dat$clarity,d=dat)
m2 = tf(carat~color+color2,dat$clarity,d=dat)
tf2 <- function(formula, group, d) {
f <- paste(".~.*", deparse(substitute(group)))
newForm <- update.formula(formula, f)
lm(newForm, data=d)
}
mA = tf2(carat~color,clarity,d=dat)
m2A = tf2(carat~color+color2,clarity,d=dat)
EDIT:
As #Aaron pointed out, it's deparse and substitute that solve my problem: I've added tf2 as the better option to the code example so you can see how both work.
One technique I use when I have trouble with scoping and calling functions within functions is to pass the parameters as strings and then construct the call within the function from those strings. Here's what that would look like here.
tf <- function(formula, data, groups) {
f <- paste(".~.*", groups)
m <- eval(call("aov", update.formula(as.formula(formula), f), data = as.name(data)))
summary(m)
}
tf("mpg~vs", "mtcars", "am")
See this answer to one of my previous questions for another example of this: https://stackoverflow.com/a/7668846/210673.
Also see this answer to the sister question of this one, where I suggest something similar for use with xyplot: https://stackoverflow.com/a/14858661/210673
I have a large set of model specifications to test, which share a dv but have unique IVs. In the following example
foo <- data.frame(dv = sample(c(0,1), 100, replace=T),
x1 = runif(100),
x2 = runif(100))
I want the first model to only include x1, the second x2, the third both, and the fourth their interaction. So I thought a sensible way would be to build a list of formula statements:
bar <- list("x1",
"x2",
"x1+x2",
"x1*x2")
which I would then use in a llply call from the plyr package to obtain a list of model objects.
require(plyr)
res <- llply(bar, function(i) glm(dv ~ i, data = foo, family = binomial()))
Unfortunately I'm told
Error in model.frame.default(formula = dv ~ i, data = foo, drop.unused.levels = TRUE):variable lengths differ (found for 'i')
Obviously I'm mixing up something fundamental--do I need to manipulate the original foo list in some fashion?
Your problem is with how you are specifying the formula, since inside the function i is a variable. This would work:
glm(paste("dv ~", i), data = foo, family = binomial())
The problem is that dv ~ i isn't a formula. i is (inside the anonymous function) simply a symbol that represents a variable containing a character value.
Try this:
bar <- list("dv~x1",
"dv~x2",
"dv~x1+x2",
"dv~x1*x2")
res <- llply(bar, function(i) glm(i, data = foo, family = binomial()))
But setting statistical issues aside, it might possibly be easier to use something like ?step or ?stepAIC in the MASS package for tasks similar to this?