Why does lm keep the whole environment when called within a function - r

If you call lm or glm inside a function it returns the whole environment.
Example:
fit_lm = function(dt){
# Do some heavy data processing
tmp = data.frame(x = rnorm(10000000))
# fit and return model
return(lm(y~x, data = dt))
}
dt = data.frame(x = runif(100))
dt$y = 4 * dt$x + rnorm(100, sd = 0.5)
fit = fit_lm(dt)
If I look at the environment attr(fit$terms, ".Environment") it will contain the data used for model fitting dt, but also contain the data frame tmp even if was not used by lm.
Does anyone know why?

Related

erroneous function in function: r.squaredGLMM is not able to take over arguments in function

I have trouble with embedding a function in another function I am about to write. The particular function that shows some misbehaviour (or at least, it seems so to me), is the r.squaredGLMM function. It works fine on its own, but when called within the function I am writing, there is an error happening that I, honestly, do not understand. Here comes some example data:
library (lme4)
library (MuMIn)
my_table <- data.frame ("type" = c (rep ("low", 15), rep ("mid", 15), rep ("high", 15)),
"group" = rep (c ("A", "B", "C"),5),
"dependent" = sort (rpois (45, 3)),
"var_A" = rnorm (45),
"var_B" = c (rnorm (15), rnorm (15, 6, 0.1), rnorm (15, 18, 2)),
"var_C" = rep (c(3,1,6,2,8,2,1,6,7,3,20,12,23,15,15), 3) * c (1:3)
)
I use this dataframe calculate the following GLMM:
large_model <- glmer (dependent ~ var_A + var_B + var_C + (1|group),
data = my_table,
family = "poisson")
The r.squaredGLMM function yields the following result:
r.squaredGLMM (large_model)
> R2m R2c
> delta 0.4171626 0.4171626
> lognormal 0.4485549 0.4485549
> trigamma 0.3824522 0.3824522
Now comes the interesting part. I create a function to alter the large_model. Within the function, I create an object containing the formula of the large_model via getCall and an object containing the family used in the large_model. An if-statement is then used to alter the model; here, it produces a set of (perhaps senseless) shorter models. Note, that the the_family - object works fine here in order to tell glmer which family to use. The resulting short_model can then also be used to calculate, for example, the resuiduals of the respective models.
my_function <- function (global.model = NULL, my_i = 2) {
the_call <- getCall (global.model)
the_family <- family (global.model)$family
for (i in length(global.model#frame[[1]]): my_i) {
short_model <- glmer (the_call, data = global.model#frame[2:i,], family = the_family)
short_model_resids <- resid (short_model)
}
}
This works fine:
my_function (global_model, 40)
However, when I try to pass the short_model to r.squaredGLMM, I get an error:
my_function <- function (global.model = NULL, my_i = 2) {
the_call <- getCall (global.model)
the_family <- family (global.model)$family
for (i in length(global.model#frame[[1]]): my_i) {
short_model <- glmer (the_call, data = global.model#frame[2:i,], family = the_family)
short_model_resids <- resid (short_model)
short_model_R <- r.squaredGLMM (short_model)
}
}
my_function (global_model, 40)
> Error in glmer(formula = dependent ~ 1 + (1 | group), data = global.model#frame[2:i, :
> object 'the_family' not found
When I do it without the function and step by step by hand, it works fine (I suppose this is due to R storing the_family in the environment):
my_i <- 40
the_call <- getCall (large_model)
the_family <- family (large_model)$family
for (i in length(large_model#frame[[1]]): my_i) {
short_model <- glmer (the_call, data = global.model#frame[2:i,], family = the_family)
short_model_resids <- resid (short_model)
short_model_R <- r.squaredGLMM (short_model)
}
I have really no idea what is going wrong here. I hope very much, someone can help - it'd be so very good to use r.squaredGLMM within a function and to feed it with a family-object created within the same function.

Regression in R using a function

I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.
You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}

Function to regress chosen variable against all others

In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))

List Indexing in R over a loop

I'm new to using lists in R and am trying to run a loop over various data frames that stores multiple models for each frame. I would like the models that correspond to a given data frame within the first index of the list; e.g. [[i]][1], [[i]][2]. The following example overwrites the list:
f1 <- data.frame(x = seq(1:6), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
f2 <- data.frame(x = seq(6,11), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
data.frames <- list(f1,f2)
fit <- list()
for(i in 1:length(data.frames)){
fit[[i]] <- lm(y ~ x, data = data.frames[[i]])
fit[[i]] <- lm(y ~ x + z, data = data.frames[[i]])
}
Any idea how to set up the list or the indexing in the loop such that it generates an output that has the two models for the first frame referenced as [[1]][1] and [[1]][2] and the second frame as [[2]][1] and [[2]][2]? Thanks for any and all help.
Calculate both models in a single lapply call applied to each part of the data.frames list:
lapply(data.frames, function(i) {
list(lm(y ~ x, data = i),
lm(y ~ x + z, data=i))
})

How to use a distinct data set per chain in Stan?

I have a data set with many missing observations and I used the Amelia package to create imputed data sets. I'd like to know if it's possible to run the same model in parallel with a different data set per chain and combine the results into a single Stan object.
# Load packages
library(Amelia)
library(rstan)
# Load built-in data
data(freetrade)
# Create 2 imputed data sets (polity is an ordinal variable)
df.imp <- amelia(freetrade, m = 2, ords = "polity")
# Check the first data set
head(df.imp$imputations[[1]])
# Run the model in Stan
code <- '
data {
int<lower=0> N;
vector[N] tariff;
vector[N] polity;
}
parameters {
real b0;
real b1;
real<lower=0> sigma;
}
model {
b0 ~ normal(0,100);
b1 ~ normal(0,100);
tariff ~ normal(b0 + b1 * polity, sigma);
}
'
# Create a list from the first and second data sets
df1 <- list(N = nrow(df.imp$imputations[[1]]),
tariff = df.imp$imputations[[1]]$tariff,
polity = df.imp$imputations[[1]]$polity)
df2 <- list(N = nrow(df.imp$imputations[[2]]),
tariff = df.imp$imputations[[2]]$tariff,
polity = df.imp$imputations[[2]]$polity)
# Run the model
m1 <- stan(model_code = code, data = df1, chains = 1, iter = 1000)
My question is how to run the last line of code on both data sets at the same time, running 2 chains and combining the output with the same stan() function. Any suggestions?
You can run the models separately, and then combine them using sflist2stanfit().
E.g.
seed <- 12345
s1 <- stan_model(model_code = code) # compile the model
m1 <- sampling(object = s1, data = df1, chains = 1,
seed = seed, chain_id = 1, iter = 1000)
m2 <- sampling(object = s1, data = df2, chains = 1,
seed = seed, chain_id = 2, iter = 1000)
f12 <- sflist2stanfit(list(m1, m2))
You will have to use one of the packages for Parallel computing in R.
According to this post, it should then work:
Will RStan run on a supercomputer?
Here is an example that may work (I use this code with JAGS, will test it with Stan later):
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processes
registerDoParallel( cl )
library(rstan)
# make a function to combine the results
stan.combine <- function(...) { return( sflist2stanfit( list(...) ) ) }
mydatalist <- list(df1 , df2)
myseeds <- c(123, 456)
# now start the chains
nchains <- 2
m_both <- foreach(i=1:nchains ,
.packages = c( 'rstan' ),
.combine = "stan.combine") %dopar% {
result <- stan(model_code = code,
data = mydatalist[[i]], # use the right dataset
seed=myseeds[i], # use different seeds
chains = 1, iter = 1000)
return(result) }
Let me know whether it works with Stan. As I said, I haven't tested it yet.

Resources