passing a column name to function in R [duplicate] - r

I'm trying to understand why
foo = function(d,y,x) {
fit = with(d, lm(y ~ x))
}
foo(myData, Y, X)
won't work, where for instance
myData = data.frame(Y=rnorm(50), X=runif(50))
The bit that seems tricky to me is passing the arguments x and y to a formula, as in lm(y ~ x).

#DMT's answer explains what's going on nicely.
Here are the hoops to jump through if you want things to work as you expect:
lmwrap <- function(d,y,x) {
ys <- deparse(substitute(y))
xs <- deparse(substitute(x))
f <- reformulate(xs,response=ys)
return(lm(f,data=d))
}
mydata <- data.frame(X=1:10,Y=rnorm(10))
lmwrap(mydata,Y, X)
Or it can be simplified a bit if you pass the column names as strings rather than symbols.
lmwrap <- function(d,y,x) {
f <- reformulate(xs, response=ys)
return(lm(f, data=d))
}
lmwrap(mydata, "Y", "X")
This approach will be a bit fragile, e.g. if you pass arguments through another function. Also, getting the "Call" part of the formula to read Y~X takes more trickery ...

Y and X are your column names, not variables. They wouldn't in this case, be arguments to your function unless you passed them in as strings and essentially call
lm(mydata[,"Y"]~ mydata[,"X"])
If you were to run ls() on your console, Y and X would most likely not be there, so the function won't work. Print both x and y prior to the fit = call, and you'll likely see NULLs, which won't fly in lm.
One way to do this in your form is the following
lmwrap<-function(df, yname, xname){
fit=lm(d[,yname] ~ d[,xname])
}
lmwrap(mydata,"Y", "X")
But you could just make the lm call like regular

Related

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

R Programming: Evaluating an expression when objects exist in multiple environments

Short Version
An expression with two variables, x and y, where x is contained in environment 1
and y is contained in a second environment. How does the programmer evaluate
the expression?
Detailed Version
I have a function that takes a formula and data.frame as arguments. On the
the right hand side of the formula is a call to the function splines::bs to
generate a B-spline basis. The workhorse function does a few things, one of
which requires extracting the bs call from the formula and evaluating it. The
problem I am trying to solve involves evaluating the bs call when argument
values are contained in different environments.
Here are the functions needed to recreate the issue I am working on
library(splines)
extract_bmat <- function(form) {
B <- NULL
rr <- function(x) {
if (is.call(x) && grepl("bs", deparse(x[[1]]))) {
B <<- x
} else if (is.recursive(x)) {
as.call(lapply(as.list(x), rr))
} else {
x
}
}
z <- lapply(as.list(form), rr)
B
}
some_workhorse <- function(formula, data) {
# ... lots of cool stuff ...
# fit <- lm(formula, data)
bmat <- eval(extract_bmat(formula), data)
bmat
}
# The following works when evaluated in the .GlobalEnv
# The eval(extract_bmat(formula), data) call within the some_workhorse
# function works without errors
xi <- c(3, 4.5)
eg_data <- data.frame(x = 1:10, y = sin(1:10))
some_workhorse(y ~ bs(x, knots = xi), data = eg_data)
Now, if the function some_workhorse and the xi vector and eg_data
data.frame are generated within a function environment causes an error.
foo <- function() {
xi_in_foo <- c(2, 3)
eg_data_in_foo <- data.frame(x = 1:10, y = sin(1:10))
some_workhorse(y ~ bs(x, knots = xi_in_foo), data = eg_data_in_foo)
}
foo()
# Error in sort(c(rep(Boundary.knots, ord), knots)) :
# object 'xi_in_foo' not found
The location of the error is within the splines::bs call, but that is not the
important part; xi_in_foo not found is the important issue to address.
I know the issue is related to my poor handling of environments in R. My
primary question is
How should the call eval(extract_bmat(formula), data) within the
some_workhorse function be written so that it works correctly when called in
the .GlobalEnv or when called within a function environment?
Secondary question:
Within the extract_bmat function, I would prefer to define an environment
for B and use assign instead of <<-. I suspect that <<- is the best
option because of the uncertainty in the levels of recursion taking place.
That said, I would like to see other solutions.
Thanks for the help.
You should define your function as
some_workhorse <- function(formula, data) {
# ... lots of cool stuff ...
# fit <- lm(formula, data)
bmat <- eval(extract_bmat(formula), data, environment(formula))
bmat
}
Note that formulas in R capture the environment in which they were created. As long as xi_in_foo exists in the environment where the formula was defined, this should work. Variables will first be looked up in the data list/data.frame and then the formula environment would be used as the enclosing environment. If you weren't using formula,s sometimes people use parent.frame() as the enclos= parameter so that variables are looked for in the environment in which the function was called, rather than were the function was defined as is the default with R's lexical scoping.

User defined functions as formula input

Built-in functions in R can be used in formula objects, for example
reg1 = lm(y ~ log(x), data = data1)
How can I write my functions such that they can be used in formula objects?
fnMyFun = function(x) {
return(x^2)
}
reg2 = lm(y ~ fnMyFun(x), data = data1)
What you've got certainly works. One problem is that different modelling functions handle formulas in different ways. I think that as long as you return something that model.matrix can make sense of, you'll be fine. That would mean
The function is vectorised; ie given a vector of length N, it returns a result also of length N
It has to return an atomic vector or matrix (but not a list, or of type raw)

Proper method to append to a formula where both formula and stuff to be appended are arguments

I've done a fair amount of reading here on SO and learned that I should generally avoid manipulation of formula objects as strings, but I haven't quite found how to do this in a safe manner:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
# Arguments are unquoted and in the typical form for lm etc
# Do some plotting with lattice using formula & groups (works, not shown)
# Append 'groups' to 'formula':
# Change y ~ x as passed in argument 'formula' to
# y ~ x * gr where gr is the argument 'groups' with
# scoping so it will be understood by aov
new_formula <- y ~ x * gr
# Now do some anova (could do if formula were right)
model <- aov(formula = new_formula, data = data)
# And print the aov table on the plot (can do)
print(summary(model)) # this will do for testing
}
Perhaps the closest I came was to use reformulate but that only gives + on the RHS, not *. I want to use the function like this:
p <- tf(carat ~ color, groups = clarity, data = diamonds)
and have the aov results for carat ~ color * clarity. Thanks in Advance.
Solution
Here is a working version based on #Aaron's comment which demonstrates what's happening:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
print(deparse(substitute(groups)))
f <- paste(".~.*", deparse(substitute(groups)))
new_formula <- update.formula(formula, f)
print(new_formula)
model <- aov(formula = new_formula, data = data)
print(summary(model))
}
I think update.formula can solve your problem, but I've had trouble with update within function calls. It will work as I've coded it below, but note that I'm passing the column to group, not the variable name. You then add that column to the function dataset, then update works.
I also don't know if it's doing exactly what you want in the second equation, but take a look at the help file for update.formula and mess around with it a bit.
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html
tf <- function(formula,groups,d){
d$groups=groups
newForm = update(formula,~.*groups)
mod = lm(newForm,data=d)
}
dat = data.frame(carat=rnorm(10,0,1),color=rnorm(10,0,1),color2=rnorm(10,0,1),clarity=rnorm(10,0,1))
m = tf(carat~color,dat$clarity,d=dat)
m2 = tf(carat~color+color2,dat$clarity,d=dat)
tf2 <- function(formula, group, d) {
f <- paste(".~.*", deparse(substitute(group)))
newForm <- update.formula(formula, f)
lm(newForm, data=d)
}
mA = tf2(carat~color,clarity,d=dat)
m2A = tf2(carat~color+color2,clarity,d=dat)
EDIT:
As #Aaron pointed out, it's deparse and substitute that solve my problem: I've added tf2 as the better option to the code example so you can see how both work.
One technique I use when I have trouble with scoping and calling functions within functions is to pass the parameters as strings and then construct the call within the function from those strings. Here's what that would look like here.
tf <- function(formula, data, groups) {
f <- paste(".~.*", groups)
m <- eval(call("aov", update.formula(as.formula(formula), f), data = as.name(data)))
summary(m)
}
tf("mpg~vs", "mtcars", "am")
See this answer to one of my previous questions for another example of this: https://stackoverflow.com/a/7668846/210673.
Also see this answer to the sister question of this one, where I suggest something similar for use with xyplot: https://stackoverflow.com/a/14858661/210673

Using function arguments in update.formula

I am writing a function that takes two variables and separately regresses each of them on a set of controls expressed as a one-sided formula. Right now I'm using the following to make the formula for one of the regressions, but it feels a bit hacked-up:
foo <- function(x, y, controls) {
cl <- match.call()
xn <- cl[["x"]]
xf <- as.formula(paste(xn, deparse(controls)))
}
I'd prefer to do this using update.formula(), but of course update.formula(controls, x ~ .) and update.formula(controls, as.name(x) ~ .) don't work. What should I be doing?
Here's one approach:
right <- ~ a + b + c
left <- ~ y
left_2 <- substitute(left ~ ., list(left = left[[2]]))
update(right, left_2)
But I think you'll have to either paste text strings together, or use substitute. To the best of my knowledge, there are no functions to create one two sided formula from two one-sided formulas (or similar equivalents).
I am not sure about update.formula(), but I have used the approach you take here of pasting text and converting it via as.formula in the past with success. My reading of help(update.formula) does not make me think you can substitute the left-hand side as you desire.
Lastly, trust the dispatching mechanism. If you object is of type formula, just call update which is preferred over the explicit update.formula.

Resources