variable scope & resolution in R function - r

I want to loop through the vars in a dataframe, calling lm() on each one, and so I wrote this:
findvars <- function(x = samsungData, dv = 'activity', id = 'subject') {
# Loops through the possible predictor vars, does an lm() predicting the dv
# from each, and returns a data.frame of coefficients, one row per IV.
r <- data.frame()
# All varnames apart from the dependent var, and the case identifier
ivs <- setdiff(names(x), c(dv, id))
for (iv in ivs) {
print(paste("trying", iv))
m <- lm(dv ~ iv, data = x, na.rm = TRUE)
# Take the absolute value of the coefficient, then transpose.
c <- t(as.data.frame(sapply(m$coefficients, abs)))
c$iv <- iv # which IV produced this row?
r <- c(r, c)
}
return(r)
}
This doesn't work, I believe b/c the formula in the lm() call consists of function-local variables that hold strings naming vars in the passed-in dataframe (e.g., "my_dependant_var" and "this_iv") as opposed to pointers to the actual variable objects.
I tried wrapping that formula in eval(parse(text = )), but could not get that to work.
If I'm right about the problem, can someone explain to me how to get R to resolve the contents of those vars iv & dv into the pointers I need? Or if I'm wrong, can someone explain what else is going on?
Many thanks!
Here is some repro code:
library(datasets)
data(USJudgeRatings)
findvars(x = USJudgeRatings, dv = 'CONT', id = 'DILG')

So there's enough bad stuff happening in your function besides your trouble with the formula, that I think someone should walk you through it all. Here are some annotations, followed by a better version:
#For small examples, "growing" objects isn't a huge deal,
# but you will regret it very, very quickly. It's a bad
# habit. Learn to ditch it now. So don't inititalize
# empty lists and data frames.
r <- data.frame()
ivs <- setdiff(names(x), c(dv, id))
for (iv in ivs) {
print(paste("trying", iv))
#There is no na.rm argument to lm, only na.action
m <- lm(dv ~ iv, data = x, na.rm = TRUE)
#Best not to name variables c, its a common function, see two lines from now!
# Also, use the coef() extractor functions, not $. That way, if/when
# authors change the object structure your code won't break.
#Finally, abs is vectorized, no need for sapply
c <- t(as.data.frame(sapply(m$coefficients, abs)))
#This is probably best stored in the name
c$iv <- iv # which IV produced this row?
#Growing objects == bad! Also, are you sure you know what happens when
# you concatenate two data frames?
r <- c(r, c)
}
return(r)
}
Try something like this instead:
findvars <- function(x,dv,id){
ivs <- setdiff(names(x),c(dv,id))
#initialize result list of the appropriate length
result <- setNames(vector("list",length(ivs)),ivs)
for (i in seq_along(ivs)){
result[[i]] <- abs(coef(lm(paste(dv,ivs[i],sep = "~"),data = x,na.action = na.omit)))
}
result
}

Related

Looping over objects in R

I am trying to loop over objects in R.
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
The function here works perfectly as it should. It returns a vector of 48 rows and it comes from the object x. Now 'x' is nothing but the full regression model from a GLM function (think: mod.fit <- glm (dep~indep, data = data)). The problem is that I have 20 different such ('mod.fit') objects and need to find predictions for each of these. I could literally repeat the code, but I was looking to find a neater solution. So what I want is a matrix with 48 rows and 20 columns for the above function. This is probably basic for an advanced user, but I have only ever used "apply" and "for" loops for numbers and never objects. I looked into lapply but couldn't figure it out.
I tried: (and this is probably dumb)
allmodels <- c(mod.fit, mod.fit2, mod.fit3)
lpred.matrix <- matrix(data=NA, nrow=48, ncol=20)
for(i in allmodels){
lpred.matrix[i,] <- myfunc.linear.pred(i)
}
which obviously won't work because allmodels has a class of "list" and it contains all the stuff from the GLM function. Hope someone can help. Thanks!
In order to use lapply, you must have a list object not a vector object. Something like this should work:
## Load data
data("mtcars")
# fit models
mod.fit1 <- glm (mpg~disp, data = mtcars)
mod.fit2 <- glm (mpg~drat, data = mtcars)
mod.fit3 <- glm (mpg~wt, data = mtcars)
# build function
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
# put models in a list
allmodels <- list("mod1" = mod.fit1, "mod2" = mod.fit2, "mod2" =
mod.fit3)
# use lapply and do.call to generate matrix of prediction results
df <- do.call('cbind', lapply(allmodels, function(x){
a <- myfunc.linear.pred(x)
}))
Hope this helps

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

Deduplicate a list of lm objects in R

I have a list of lm models objects with possible repeated, so I'd like to find a way of checking if some of these lm objects are equal, if so them delete it. In words, I want to "deduplicate" my list.
I'd appreciate very much any help.
An example of the problem:
## Creates outcome and predictors
outcome <- c(names(mtcars)[1:3])
predictors <- c(names(mtcars)[4:11])
dataset <- mtcars
## Creates model list
model_list <- lapply(seq_along((predictors)), function(n) {
left_hand_side <- outcome[1]
right_hand_side <- apply(X = combn(predictors, n), MARGIN = 2, paste, collapse = " + ")
paste(left_hand_side, right_hand_side, sep = " ~ ")
})
## Convert model list into a verctor
model_vector <- unlist(model_list)
## Fit linear models to all itens from the vector of models
list_of_fit <- lapply(model_vector, function(x) {
formula <- as.formula(x)
fit <- step(lm(formula, data = dataset))
fit
})
# Exclude possible missing
list_of_fit <- Filter(Negate(function(x) is.null(unlist(x))), list_of_fit)
# These models are the same in my list
lm253 <- list_of_fit[[253]];lm253
lm254 <- list_of_fit[[254]];lm254
lm255 <- list_of_fit[[255]];lm255
I want to exclude duplicated entries in list_of_fit.
It seems wasteful to fit so many models and then throw away most of them. Your object names make your code hard to read for me, but it seems your models can be distinguished based on their formula. Maybe this helps:
lista_de_ajustes[!duplicated(vapply(lista_de_ajustes,
function(m) deparse(m$call),
FUN.VALUE = "a"))]
I made a simple correction in you code Roland, so it worked for me.
I changed from deparse(m$call) to deparse(formula(m)), due this I'm able to compare the complete formulas.
lista_de_ajustes[!duplicated(vapply(lista_de_ajustes, function(m) deparse(formula(m)), FUN.VALUE = "a"))]
Thank you very much!

Proper method to append to a formula where both formula and stuff to be appended are arguments

I've done a fair amount of reading here on SO and learned that I should generally avoid manipulation of formula objects as strings, but I haven't quite found how to do this in a safe manner:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
# Arguments are unquoted and in the typical form for lm etc
# Do some plotting with lattice using formula & groups (works, not shown)
# Append 'groups' to 'formula':
# Change y ~ x as passed in argument 'formula' to
# y ~ x * gr where gr is the argument 'groups' with
# scoping so it will be understood by aov
new_formula <- y ~ x * gr
# Now do some anova (could do if formula were right)
model <- aov(formula = new_formula, data = data)
# And print the aov table on the plot (can do)
print(summary(model)) # this will do for testing
}
Perhaps the closest I came was to use reformulate but that only gives + on the RHS, not *. I want to use the function like this:
p <- tf(carat ~ color, groups = clarity, data = diamonds)
and have the aov results for carat ~ color * clarity. Thanks in Advance.
Solution
Here is a working version based on #Aaron's comment which demonstrates what's happening:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
print(deparse(substitute(groups)))
f <- paste(".~.*", deparse(substitute(groups)))
new_formula <- update.formula(formula, f)
print(new_formula)
model <- aov(formula = new_formula, data = data)
print(summary(model))
}
I think update.formula can solve your problem, but I've had trouble with update within function calls. It will work as I've coded it below, but note that I'm passing the column to group, not the variable name. You then add that column to the function dataset, then update works.
I also don't know if it's doing exactly what you want in the second equation, but take a look at the help file for update.formula and mess around with it a bit.
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html
tf <- function(formula,groups,d){
d$groups=groups
newForm = update(formula,~.*groups)
mod = lm(newForm,data=d)
}
dat = data.frame(carat=rnorm(10,0,1),color=rnorm(10,0,1),color2=rnorm(10,0,1),clarity=rnorm(10,0,1))
m = tf(carat~color,dat$clarity,d=dat)
m2 = tf(carat~color+color2,dat$clarity,d=dat)
tf2 <- function(formula, group, d) {
f <- paste(".~.*", deparse(substitute(group)))
newForm <- update.formula(formula, f)
lm(newForm, data=d)
}
mA = tf2(carat~color,clarity,d=dat)
m2A = tf2(carat~color+color2,clarity,d=dat)
EDIT:
As #Aaron pointed out, it's deparse and substitute that solve my problem: I've added tf2 as the better option to the code example so you can see how both work.
One technique I use when I have trouble with scoping and calling functions within functions is to pass the parameters as strings and then construct the call within the function from those strings. Here's what that would look like here.
tf <- function(formula, data, groups) {
f <- paste(".~.*", groups)
m <- eval(call("aov", update.formula(as.formula(formula), f), data = as.name(data)))
summary(m)
}
tf("mpg~vs", "mtcars", "am")
See this answer to one of my previous questions for another example of this: https://stackoverflow.com/a/7668846/210673.
Also see this answer to the sister question of this one, where I suggest something similar for use with xyplot: https://stackoverflow.com/a/14858661/210673

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources