using lm function in R with a variable name in a loop - r

I am trying to create a simple linear model in R a for loop where one of the variables will be specified as a parameter and thus looped through, creating a different model for each pass of the loop. The following does NOT work:
model <- lm(test_par[i] ~ weeks, data=all_data_plant)
If I tried the same model with the "test_par[i]" replaced with the variable's explicit name, it works just as expected:
model <- lm(weight_dry ~ weeks, data=all_data_plant)
I tried reformulate and paste ineffectively. Any thoughts?

Maybe try something like this:
n <- #add the column position of first variable
m <- #add the column position of last variable
lm_models <- lapply(n:m, function(x) lm(all_data_plant[,x] ~ weeks, data=all_data_plant))

You can pass the argument "formula" in lm() as character using paste(). Here a working example:
data("trees")
test_par <- names(trees)
model <- lm(Girth ~ Height, data = trees)
model <- lm("Girth ~ Height", data = trees) # character formula works
model <- lm(paste(test_par[1], "~ Height"), data=trees)

Related

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

Use string of independent variables within the lm function

I have a dataframe with many variables. I want to apply a linear regression to explain the last one with the others. So as I had to much to write I thought about creating a string with the independent variables e.g. Var1 + Var2 +...+ VarK. I achieved it pasting "+" to all column names except for the last one with this code:
ExVar <- toString(paste(names(datos)[1:11], "+ ", collapse = ''))
I also had to remove the last "+":
ExVar <- substr(VarEx, 1, nchar(ExVar)-2)
So I copied and pasted the ExVar string within the lm() function and the result looked like this:
m1 <- lm(calidad ~ Var1 + Var 2 +...+ Var K)
The question is: Is there any way to use "ExVar" within the lm() function as a string, not as a variable, to have a cleaner code?
For better understanding:
If I use this code:
m1 <- lm(calidad ~ ExVar)
It is interpreting ExVar as a independent variable.
The following will all produce the same results. I am providing multiple methods because there is are simpler ways of doing what you are asking (see examples 2 and 3) instead of writing the expression as a string.
First, I will generate some example data:
n <- 100
p <- 11
dat <- array(rnorm(n*p),c(n,p))
dat <- as.data.frame(dat)
colnames(dat) <- paste0("X",1:p)
If you really want to specify the model as a string, this example code will help:
ExVar <- toString(paste(names(dat[2:11]), "+ ", collapse = ''))
ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
model1 <- paste("X1 ~ ",ExVar)
fit1 <- lm(eval(parse(text = model1)),data = dat)
Otherwise, note that the 'dot' notation will specify all other variables in the model as predictors.
fit2 <- lm(X1 ~ ., data = dat)
Or, you can select the predictors and outcome variables by column, if your data is structured as a matrix.
dat <- as.matrix(dat)
fit3 <- lm(dat[,1] ~ dat[,-1])
All three of these fit objects have the same estimates:
fit1
fit2
fit3
if you have a dataframe, and you want to explain the last one using all the rest then you can use the code below:
lm(calidad~.,dat)
or you can use
lm(rev(dat))#Only if the last column is your response variable
Any of the two above will give you the results needed.
To do it your way:
EXV=as.formula(paste0("calidad~",paste0(names(datos)[-12],collapse = '+')))
lm(EXV,dat)
There is no need to do it this way since the lm function itself will do this by using the first code above.

lm function gives estimate for the y-variable also

I am trying to run a simple lm model. I am using the following
dt <- data.table(
y=rnorm(100,0,1),
x1=rnorm(100,0,1),
x2=rnorm(100,0,1),
x3=rnorm(100,0,1))
y_var2 <- names(dt)[names(dt)%like%"y"]
x_var2 <- names(dt)[names(dt)%like%"x"]
tmp2 <- summary(a <- lm(get(y_var2)~.,dt[,c(x_var2,y_var2),with=F]))
coefs2 <- as.data.table(tmp2$coefficients,keep.rownames = T)
So in the end, coefs2 should contain the estimates, p-values etc. But in the last row of the coefs2 i also see the y-variable.
But if I use
tmp2 <- summary(a <- lm(y~.,dt[,c(x_var2,y_var2),with=F]))
Then this does not happen. Why is that ?
This has to do with how R stores variables. y_var2 is a character "y" and you fill it into the formula as a character variable which you wish to model with all variables in your data.table dt. However, you have to tell R that you wish to evaluate the formula y~. and not "y"~. which are two different expressions for R.
lm( formula(paste(y_var2,"~.")),dt[,c(x_var2,y_var2),with=F])
will do the trick. formula constructs a formula out of the string variable with which a contructed the expression.
Actually it would probably be cleaner just to make the formula with reformulate() and the data= parameter of lm
tmp2 <- summary(a <- lm(reformulate(x_var2, y_var2), dt))

Using list of LM estimates as stargazer input

I'm trying to use stargazer over a several LM estimates at once, say "OLS1",...,"OLS5".
I would usually insert them as separate arguments at the beginning of the stargazer input. What I'm looking for is a way to input them all with a list that contains them all, being one argument. Something like
stargazer(list,...)
stargazer arguments explanation states that
one or more model objects (for regression analysis tables) or data frames/vectors/matrices (for summary statistics, or direct output of content). They can also be included as lists (or even lists within lists).
I was wondering what is the correct way to gather LM estimates in a list so that this would work. When I just save the results in a list I get the following error
Error in list.of.objects[[i]] : subscript out of bounds
I will mention that I create the elements storing the estimate using assign. E.G:
assign(some_string,lm(...))
So what I have is a string, called some_string, and I want to put the LM result names some_string inside a list. Using get doesn't help with that.
EDIT: I think you want mget
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
assign("string_1", lm(Y ~ X))
assign("string_2", lm(Y ~ X))
my_list <- mget(x = c("string_1", "string_2"))
stargazer(my_list)
works for me?
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
fit_1 <- lm(Y ~ X)
fit_2 <- lm(Y ~ X)
stargazer(list(fit_1, fit_2))
did you name your list list? maybe it's grabbing the function?

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

Resources