Calling mlogit() from inside another function, scoping problem with variables when using attach - r

I need to call the mlogit() R function from inside another function.
This is a function for demonstrative purposes:
#-------------------------
# DEMO FUNCTION
#-------------------------
# f = formula (string)
# fData = data.frame
# cVar = choice variable (string)
# optVar = alternative variable (string)
##########################
mlogitSum <- function(f, fData, cVar="choice", optVar="option"){
library(mlogit)
r2 <- mlogit(as.formula(f), shape = "long", data = fData, alt.var=optVar, choice = cVar)
return(summary(r2))
}
Apparently there is an environment problem, so that variables not declared globally are not found by the mlogit() function as arguments.
This example doesn't work:
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
library(mlogit)
mydata$brand<-as.factor(mydata$brand)
mlData<-mlogit.data(mydata, varying=NULL, choice="brand", shape="wide")
myFormula <-"brand~1|female+age"
var1 <- "brand"
var2 <- "alt"
mlogitSum(myFormula, fData = mlData, var1, var2)
While if the variables are assigned in the main environment it works:
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
library(mlogit)
mydata$brand<-as.factor(mydata$brand)
fData<-mlogit.data(mydata, varying=NULL, choice="brand", shape="wide")
myFormula <-"brand~1|female+age"
cVar <- "brand"
optVar <- "alt"
mlogitSum(myFormula, fData, cVar, optVar)
Alternatively it works if I assign the variables globally from inside the function
#-------------------------
# DEMO FUNCTION
#-------------------------
# f = formula (string)
# fData = data.frame
# cVar = choice variable (string)
# optVar = alternative variable (string)
##########################
mlogitSum_rev <- function(f, fData, cVar="choice", optVar="option"){
fData<<-fData
cVar<<-cVar
optVar<<-optVar
#return(head(lcmData))
library(mlogit)
#mi serve per poi estrarre model.matrix(r2), per il resto sarebbe ridondante
r2 <- mlogit(as.formula(f), shape = "long", data = fData, alt.var=optVar, choice = cVar)
return(summary(r2))
}
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
library(mlogit)
mydata$brand<-as.factor(mydata$brand)
mlData<-mlogit.data(mydata, varying=NULL, choice="brand", shape="wide")
myFormula <-"brand~1|female+age"
var1 <- "brand"
var2 <- "alt"
mlogitSum_rev(myFormula, mlData, var1, var2)
Any idea on how to avoid to assign the variables globally?

tl;dr this appears to be a bug in mlogit, which you can fix yourself (see below) or ask the maintainer to fix.
Deep inside mlogit, the function tries to evaluate the data as follows:
nframe <- length(sys.calls()) ## line 11
...
data <- eval(mldata, sys.frame(which = nframe)) ## line 44
This is moderately sophisticated messing about with R's scoping structures -- it's trying to evaluate mldata in the frame one above the current frame, and it will fail if someone does something tricky (but perfectly reasonable!) like call mlogit from within a function.
I solved the problem (sort of!) by running fix(mlogit), which will dump you into an editor and allow you to modify the function. I changed line 44 to
data <- eval(mldata, parent.frame())
after which the code seemed to work.
If this works for you, you can either (1) fix() mlogit every time you need to use it: (2) download a copy of the source (.tar.gz) package, modify it, and install it; or (3) [preferably!] contact the package maintainer, let them know about the issue, and ask them to release a patched version ...
PS depending on your general data analysis protocol, you may want to get out of the habit of using attach: Why is it not advisable to use attach() in R, and what should I use instead?

Related

What does "invalid type (closure) for variable 'variable1'" mean and how do I fix it?

I am trying to write a function in R, which contains a function from another package. The code works perfectly outside a function.
I am guessing, it might have got to do something with the package I am using (survey).
A self-contained code example:
#activating the package
library(survey)
#getting the dataset into R
tm <- read.spss("tm.sav", to.data.frame = T, max.value.labels = 5)
# creating svydesign object (it basically contains the weights to adjust the variables (~persgew: also a column variable contained in the tm-dataset))
tm_w <- svydesign(ids=~0, weights = ~persgew, data = tm)
#getting overview of the welle-variable
#this variable is part of the tm-dataset. it is needed to execute the following steps
table(tm$welle)
# data manipulation as in: taking the v12d_gr-variable as well as the welle-variable and the svydesign-object to create a longitudinal variable which is transformed into a data frame that can be passed to ggplot
t <- svytable(~v12d_gr+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
v12d <- tt[2,]
v12d <- as.data.frame(v12d)
this is the code outside the function, working perfectly. since I have to transform quite a few variables in the exact same way, I aim to create a function to save up some time.
The following function is supposed to take a variable that will be transformed as an argument (v12sd2_gr).
#making sure the survey-object is loaded
tm_w <- svydesign(ids=~0, weights = ~persgew, data = data)
#trying to write a function containing the code from above
ltd_zsw <- function(variable1){
t <- svytable(~variable1+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
var_ltd_zsw <- tt[2,]
var_ltd_zsw <- as.data.frame(var_ltd_zsw)
return(var_ltd_zsw)
}
Calling the function:
#as v12d has been altered already, I am trying to transform another variable v12sd2_gr
v12sd2 <- ltd_zsw(v12sd2_gr)
Console output:
Error in model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design)) :
invalid type (closure) for variable 'variable1'
Called from: model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design))
How do I fix it? And what does it mean to dynamically build a formula and reformulating?
PS: I hope it is the appropriate way to answer to the feedback in the comments.
Update: I think I was able to trace the problem back to the argument I am passing (variable1) and I am guessing it has got something to do with the fact, that I try to call a formula within the function. But when I try to call the svytable with as.formula(svytable(~variable1+welle, tm_w))it still doesn't work.
What to do?
I have found a solution to the problem.
Here is the tested and working function:
ltd_test <- function (var, x, string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,]
w12d <- as.data.frame(w12d)
}
The problem appeared to be caused by the svydesgin()-fun. In its output it produces an object which is then used by the formula for svytable()-fun. Thats why it is imperative to first create the x_w-object with svydesgin() and then use the svytable()-fun to create the t-object.
Within the code snippet I posted originally in the question the tm_w-object has been created and stored globally.
Thanks for the help to everyone. I hope this is gonna be of use to someone one day!

Avoid larger (bloated) size when saving (regression) model in R (environments)

I want to create a regression model within another function; but my problem is that when saving the model it becomes really, really big because other data in the environment is being saved with it. Thus, I think the solution might be to handle different environments; this helped me understand this better. Below I have explained the problems in a few steps.
# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
tf <- tempfile(fileext = ".RData")
on.exit(unlink(tf))
save(object, file = tf)
file.size(tf)
}
# Subset of columns to be used
subset = 1:4
# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965
# Function where there are other data that should NOT be saved.
Function2 <- function (subset){
data_not_to_be_saved <- 1:1e+15
model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2)
# Size = 1148 ; Problematic that size is larger that model 1.
# Solution to above is to create a new environment
Function3 <- function (subset){
data_not_to_be_saved <- 1:1e+15
# New environment
env <- new.env(parent = globalenv())
env$subset <- subset
with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3)
# 1002 # Success: considerably smaller than in Function 2.
# PROBLEM: Getting solution in Function 3 to work within another function.
# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call.
Function5 <- function (subset){
data_not_to_be_saved <- 1:1e+15
Function5 <- function (subset) {
env <- new.env(parent = globalenv())
env$subset <- subset
env$datainenvorment <- iris
with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
}
model5 <- Function5(subset)
}
model5 <- Function5(subset)
saveSize(model5)
Thanks in advance
The solution you are using works correctly. You do not see it as in new R versions sequential integer vectors are very memory efficient. This small differences comes from a small overhead of additional variables like env variable. Where most important is that data_not_to_be_saved variable is skipped.
Use some bigger data to see it more clearly.
data_not_to_be_saved <- rnorm(10**5)
What is the source of this problem. The lm returns an object which contains reference to other environments (e.g. function environments provide an access to all variables from the place where it was defined). Additionally save function with default parameters looking for needed variables across all possible envs.
str(model5)
# like .. .. ..- attr(*, ".Environment")=<environment: 0x7fdc9e6c2b68>
Another solution might be to use lm.fit function which returning only base structures. Here no additional reference will be taken
model_fit <- lm.fit(cbind(1,iris$Sepal.Width[subset]), iris$Sepal.Length[subset])

Proper method to append to a formula where both formula and stuff to be appended are arguments

I've done a fair amount of reading here on SO and learned that I should generally avoid manipulation of formula objects as strings, but I haven't quite found how to do this in a safe manner:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
# Arguments are unquoted and in the typical form for lm etc
# Do some plotting with lattice using formula & groups (works, not shown)
# Append 'groups' to 'formula':
# Change y ~ x as passed in argument 'formula' to
# y ~ x * gr where gr is the argument 'groups' with
# scoping so it will be understood by aov
new_formula <- y ~ x * gr
# Now do some anova (could do if formula were right)
model <- aov(formula = new_formula, data = data)
# And print the aov table on the plot (can do)
print(summary(model)) # this will do for testing
}
Perhaps the closest I came was to use reformulate but that only gives + on the RHS, not *. I want to use the function like this:
p <- tf(carat ~ color, groups = clarity, data = diamonds)
and have the aov results for carat ~ color * clarity. Thanks in Advance.
Solution
Here is a working version based on #Aaron's comment which demonstrates what's happening:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
print(deparse(substitute(groups)))
f <- paste(".~.*", deparse(substitute(groups)))
new_formula <- update.formula(formula, f)
print(new_formula)
model <- aov(formula = new_formula, data = data)
print(summary(model))
}
I think update.formula can solve your problem, but I've had trouble with update within function calls. It will work as I've coded it below, but note that I'm passing the column to group, not the variable name. You then add that column to the function dataset, then update works.
I also don't know if it's doing exactly what you want in the second equation, but take a look at the help file for update.formula and mess around with it a bit.
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html
tf <- function(formula,groups,d){
d$groups=groups
newForm = update(formula,~.*groups)
mod = lm(newForm,data=d)
}
dat = data.frame(carat=rnorm(10,0,1),color=rnorm(10,0,1),color2=rnorm(10,0,1),clarity=rnorm(10,0,1))
m = tf(carat~color,dat$clarity,d=dat)
m2 = tf(carat~color+color2,dat$clarity,d=dat)
tf2 <- function(formula, group, d) {
f <- paste(".~.*", deparse(substitute(group)))
newForm <- update.formula(formula, f)
lm(newForm, data=d)
}
mA = tf2(carat~color,clarity,d=dat)
m2A = tf2(carat~color+color2,clarity,d=dat)
EDIT:
As #Aaron pointed out, it's deparse and substitute that solve my problem: I've added tf2 as the better option to the code example so you can see how both work.
One technique I use when I have trouble with scoping and calling functions within functions is to pass the parameters as strings and then construct the call within the function from those strings. Here's what that would look like here.
tf <- function(formula, data, groups) {
f <- paste(".~.*", groups)
m <- eval(call("aov", update.formula(as.formula(formula), f), data = as.name(data)))
summary(m)
}
tf("mpg~vs", "mtcars", "am")
See this answer to one of my previous questions for another example of this: https://stackoverflow.com/a/7668846/210673.
Also see this answer to the sister question of this one, where I suggest something similar for use with xyplot: https://stackoverflow.com/a/14858661/210673

Using predict in a function call with NLME objects and a formula

I have a problem with the package NLME using the following code:
library(nlme)
x <- rnorm(100)
z <- rep(c("a","b"),each=50)
y <- rnorm(100)
test.data <- data.frame(x,y,z)
test.fun <- function(test.dat)
{
form <- as.formula("y~x")
ran.form <- as.formula("~1|z")
modell <- lme(fixed = form, random=ran.form, data=test.dat)
pseudo.newdata <- test.dat[1,]
predict(modell, newdata= pseudo.newdata) ###THIS CAUSES THE ERROR!
}
test.fun(test.data)
The predict causes an error and I already found what basically causes it.
The modell object saves how it was called and predict seems to use that to make prediction but is unable to find the formula objects form and ran.form becauses it does not look for them in the right namespace. In fact, I can avoid the problem by doing this:
attach(environment(form), warn.conflicts = FALSE)
predict(modell, newdata= pseudo.newdata)
detach()
My long term goal however is to save the modell to disk and use them later. I suppose I could try saving the formula objects as well, but this strikes me as a very annoying and cumbersome way to deal with the problem.
I work with automatically generated formula objects instead of writing them down explicitly because I create many models with different definitions in a sort of batch process so I can not avoid them. So my ideal solution would be a way to create the lme object so that I can forget about the formula object afterwards and predict "just works". Thanks for any help.
Try replacing lme(arg1, arg2, arg3) with do.call(lme, list(arg1, arg2, arg3)).
library(nlme)
x <- rnorm(100)
z <- rep(c("a","b"),each=50)
y <- rnorm(100)
test.data <- data.frame(x,y,z)
test.fun <- function(test.dat)
{
form <- as.formula("y~x")
ran.form <- as.formula("~1|z")
## JUST NEED TO CHANGE THE FOLLOWING LINE
## modell <- lme(fixed = form, random=ran.form, data=test.dat)
modell <- do.call(lme, list(fixed=form, random=ran.form, data=test.data))
pseudo.newdata <- test.dat[1,]
predict(modell, newdata= pseudo.newdata) ###THIS CAUSES THE ERROR!
}
test.fun(test.data)
# a
# 0.07547742
# attr(,"label")
# [1] "Predicted values"
This works because do.call() evaluates its argument list in the calling frame, before evaluating the call to lme() that it constructs. To see why that helps, type debug(predict), and then run your code and mine, comparing the debugging messages printed when you are popped into the browser.

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources