Avoid larger (bloated) size when saving (regression) model in R (environments) - r

I want to create a regression model within another function; but my problem is that when saving the model it becomes really, really big because other data in the environment is being saved with it. Thus, I think the solution might be to handle different environments; this helped me understand this better. Below I have explained the problems in a few steps.
# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
tf <- tempfile(fileext = ".RData")
on.exit(unlink(tf))
save(object, file = tf)
file.size(tf)
}
# Subset of columns to be used
subset = 1:4
# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965
# Function where there are other data that should NOT be saved.
Function2 <- function (subset){
data_not_to_be_saved <- 1:1e+15
model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2)
# Size = 1148 ; Problematic that size is larger that model 1.
# Solution to above is to create a new environment
Function3 <- function (subset){
data_not_to_be_saved <- 1:1e+15
# New environment
env <- new.env(parent = globalenv())
env$subset <- subset
with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3)
# 1002 # Success: considerably smaller than in Function 2.
# PROBLEM: Getting solution in Function 3 to work within another function.
# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call.
Function5 <- function (subset){
data_not_to_be_saved <- 1:1e+15
Function5 <- function (subset) {
env <- new.env(parent = globalenv())
env$subset <- subset
env$datainenvorment <- iris
with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
}
model5 <- Function5(subset)
}
model5 <- Function5(subset)
saveSize(model5)
Thanks in advance

The solution you are using works correctly. You do not see it as in new R versions sequential integer vectors are very memory efficient. This small differences comes from a small overhead of additional variables like env variable. Where most important is that data_not_to_be_saved variable is skipped.
Use some bigger data to see it more clearly.
data_not_to_be_saved <- rnorm(10**5)
What is the source of this problem. The lm returns an object which contains reference to other environments (e.g. function environments provide an access to all variables from the place where it was defined). Additionally save function with default parameters looking for needed variables across all possible envs.
str(model5)
# like .. .. ..- attr(*, ".Environment")=<environment: 0x7fdc9e6c2b68>
Another solution might be to use lm.fit function which returning only base structures. Here no additional reference will be taken
model_fit <- lm.fit(cbind(1,iris$Sepal.Width[subset]), iris$Sepal.Length[subset])

Related

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

Save an object in R using a function parameter as name

I looked all over the website and could not get the correct answer for this dilemma:
I have an UDF for evaluating some classification models, with different datasets, and i wanted to have a single function for evaluating them. I want to have something like the following, that given the name of the model and the data, it computes some metrics (confusion matrix for example) and saves them to an object outside the function.
The problem here is that I want to create this object using the name of the model I am evaluating.
I ended up with something like this:
foo <- function(x) {return(as.character(substitute(x)))}
model1 <- lm(Sepal.Width ~ Sepal.Length, iris)
Validation.func <- function(model_name, dataset){
Pred_Train = predict(model_name, dataset)
assign(paste("Pred_Train_",foo(model_name), sep=''), Pred_Train, envir=globalenv())
Pred_Train_prob = predict(model_name, dataset, type = "prob")
MC_Train = confusionMatrix(Pred_Train, dataset$target_salto)
}
Running it for Validation.func(model1,iris) We would want to get the variable stored as "Pred_Train_model1".
As model_name is not a string we had to try to convert it using the foo function (which is the answer i found in here) foo = function(x)deparse(substitute(x)) I do not get what I want, since it saves the object as: "Pred_Train_model_name" instead of "Pred_Train_model1".
Does anyone know how to solve it?
model_name in your function must be a model object, hence cannot be used in paste function, which expects characters.
I think you want your function to know that the model object is actually called "model1" in the environment where it comes from. I think this is quite tricky attempt since your model object may be called by various names.
The easiest implementation would be to give both model object and the name separately, and the use the former for prediction and the latter for naming the outcome.
func1 <- function(model, model_str, dataset)
{
p <- predict(model, dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model1 <- lm(mpg ~ cyl, data=mtcars)
func1(model1, "model1", mtcars)
predict_model1
Another implementation, tricky but works if used with care, would be to give only the character name of the model and obtain the model object by get function from the parent environment.
func2 <- function(model_str, dataset)
{
p <- predict(get(model_str, envir=parent.env(environment())), dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model2 <- lm(mpg ~ cyl, data=mtcars)
func2("model2", mtcars)
predict_model2
Finally, in order to give the model object to the function and let the function to find the variable name, then you can use match.call function to recover how the function has been called.
func3 <- function(model, dataset)
{
s <- match.call()
model_str <- as.character(s)[2]
p <- predict(model, dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model3 <- lm(mpg ~ cyl, data=mtcars)
func3(model3, mtcars)
predict_model3
So here's a suggestion, that does not exactly solve the problem, but does make the function work.
Validation.func <- function(model_name, dataset){
model_name_obj<- eval(parse(text = model_name))
Pred_Train = predict(model_name_obj, dataset)
assign(paste("Pred_Train_",model_name, sep=''), Pred_Train, envir=globalenv())
Pred_Train_prob = predict(model_name_obj, dataset, type = "prob")
MC_Train = confusionMatrix(Pred_Train, dataset$target_salto)
}
Validation.func("model1", data)
What I did is pretty much the opposite of what you were trying. I passed model_name as a string, and then evaluate it using parse(text = model_name). Note that the evaluated object is now called model_name_obj and it is passed in the predict function.
I got some errors later on in the function, but they are irrelevant to the issue at hand. They had to do with the type argument in predict and about not recognizing the confusionMatrix, because I assume I didn't load the corresponding package.

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

Calculated values on imputed data

I'd like to do something like the following: (myData is a data table)
#create some data
myData = data.table(invisible.covariate=rnorm(50),
visible.covariate=rnorm(50),
category=factor(sample(1:3,50, replace=TRUE)),
treatment=sample(0:1,50, replace=TRUE))
myData[,outcome:=invisible.covariate+visible.covariate+treatment*as.integer(category)]
myData[,invisible.covariate:=NULL]
#process it
myData[treatment == 0,untreated.outcome:=outcome]
myData[treatment == 1,treated.outcome:=outcome]
myPredictors = matrix(0,ncol(myData),ncol(myData))
myPredictors[5,] = c(1,1,0,0,0,0)
myPredictors[6,] = c(1,1,0,0,0,0)
myImp = mice(myData,predictorMatrix=myPredictors)
fit1 = with(myImp, lm(treated.outcome ~ category)) #this works fine
for_each_imputed_dataset(myImp, #THIS IS NOT A REAL FUNCTION but I hope you get the idea
function(imputed_data_table) {
imputed_data_table[,treatment.effect:=treated.outcome-untreated.outcome]
})
fit2 = with(myImp, lm(treatment.effect ~ category))
#I want fit2 to be an object similar to fit1
...
I would like to add a calculated value to each imputed data set, then do statistics using that calculated value. Obviously the structure above is probably not how you'd do it. I'd be happy with any solution, whether it involves preparing the data table somehow before the mice, a step before the "fit =" as sketched above, or some complex function inside the "with" call.
The complete() function will generate the "complete" imputed data set for each of the requested iterations. But note that mice expects to work with data.frames, so it returns data.frames and not data.tables. (Of course you can convert if you like). But here is one way to fit all those models
imp = mice(myData,predictorMatrix=predictors)
fits<-lapply(seq.int(imp$m), function(i) {
lm(I(treated.outcome-untreated.outcome)~category, complete(imp, i))
})
fits
The results will be in a list and you can extract particular lm objects via fits[[1]], fits[[2]], etc

Proper method to append to a formula where both formula and stuff to be appended are arguments

I've done a fair amount of reading here on SO and learned that I should generally avoid manipulation of formula objects as strings, but I haven't quite found how to do this in a safe manner:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
# Arguments are unquoted and in the typical form for lm etc
# Do some plotting with lattice using formula & groups (works, not shown)
# Append 'groups' to 'formula':
# Change y ~ x as passed in argument 'formula' to
# y ~ x * gr where gr is the argument 'groups' with
# scoping so it will be understood by aov
new_formula <- y ~ x * gr
# Now do some anova (could do if formula were right)
model <- aov(formula = new_formula, data = data)
# And print the aov table on the plot (can do)
print(summary(model)) # this will do for testing
}
Perhaps the closest I came was to use reformulate but that only gives + on the RHS, not *. I want to use the function like this:
p <- tf(carat ~ color, groups = clarity, data = diamonds)
and have the aov results for carat ~ color * clarity. Thanks in Advance.
Solution
Here is a working version based on #Aaron's comment which demonstrates what's happening:
tf <- function(formula = NULL, data = NULL, groups = NULL, ...) {
print(deparse(substitute(groups)))
f <- paste(".~.*", deparse(substitute(groups)))
new_formula <- update.formula(formula, f)
print(new_formula)
model <- aov(formula = new_formula, data = data)
print(summary(model))
}
I think update.formula can solve your problem, but I've had trouble with update within function calls. It will work as I've coded it below, but note that I'm passing the column to group, not the variable name. You then add that column to the function dataset, then update works.
I also don't know if it's doing exactly what you want in the second equation, but take a look at the help file for update.formula and mess around with it a bit.
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html
tf <- function(formula,groups,d){
d$groups=groups
newForm = update(formula,~.*groups)
mod = lm(newForm,data=d)
}
dat = data.frame(carat=rnorm(10,0,1),color=rnorm(10,0,1),color2=rnorm(10,0,1),clarity=rnorm(10,0,1))
m = tf(carat~color,dat$clarity,d=dat)
m2 = tf(carat~color+color2,dat$clarity,d=dat)
tf2 <- function(formula, group, d) {
f <- paste(".~.*", deparse(substitute(group)))
newForm <- update.formula(formula, f)
lm(newForm, data=d)
}
mA = tf2(carat~color,clarity,d=dat)
m2A = tf2(carat~color+color2,clarity,d=dat)
EDIT:
As #Aaron pointed out, it's deparse and substitute that solve my problem: I've added tf2 as the better option to the code example so you can see how both work.
One technique I use when I have trouble with scoping and calling functions within functions is to pass the parameters as strings and then construct the call within the function from those strings. Here's what that would look like here.
tf <- function(formula, data, groups) {
f <- paste(".~.*", groups)
m <- eval(call("aov", update.formula(as.formula(formula), f), data = as.name(data)))
summary(m)
}
tf("mpg~vs", "mtcars", "am")
See this answer to one of my previous questions for another example of this: https://stackoverflow.com/a/7668846/210673.
Also see this answer to the sister question of this one, where I suggest something similar for use with xyplot: https://stackoverflow.com/a/14858661/210673

Resources