Use string of independent variables within the lm function - r

I have a dataframe with many variables. I want to apply a linear regression to explain the last one with the others. So as I had to much to write I thought about creating a string with the independent variables e.g. Var1 + Var2 +...+ VarK. I achieved it pasting "+" to all column names except for the last one with this code:
ExVar <- toString(paste(names(datos)[1:11], "+ ", collapse = ''))
I also had to remove the last "+":
ExVar <- substr(VarEx, 1, nchar(ExVar)-2)
So I copied and pasted the ExVar string within the lm() function and the result looked like this:
m1 <- lm(calidad ~ Var1 + Var 2 +...+ Var K)
The question is: Is there any way to use "ExVar" within the lm() function as a string, not as a variable, to have a cleaner code?
For better understanding:
If I use this code:
m1 <- lm(calidad ~ ExVar)
It is interpreting ExVar as a independent variable.

The following will all produce the same results. I am providing multiple methods because there is are simpler ways of doing what you are asking (see examples 2 and 3) instead of writing the expression as a string.
First, I will generate some example data:
n <- 100
p <- 11
dat <- array(rnorm(n*p),c(n,p))
dat <- as.data.frame(dat)
colnames(dat) <- paste0("X",1:p)
If you really want to specify the model as a string, this example code will help:
ExVar <- toString(paste(names(dat[2:11]), "+ ", collapse = ''))
ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
model1 <- paste("X1 ~ ",ExVar)
fit1 <- lm(eval(parse(text = model1)),data = dat)
Otherwise, note that the 'dot' notation will specify all other variables in the model as predictors.
fit2 <- lm(X1 ~ ., data = dat)
Or, you can select the predictors and outcome variables by column, if your data is structured as a matrix.
dat <- as.matrix(dat)
fit3 <- lm(dat[,1] ~ dat[,-1])
All three of these fit objects have the same estimates:
fit1
fit2
fit3

if you have a dataframe, and you want to explain the last one using all the rest then you can use the code below:
lm(calidad~.,dat)
or you can use
lm(rev(dat))#Only if the last column is your response variable
Any of the two above will give you the results needed.
To do it your way:
EXV=as.formula(paste0("calidad~",paste0(names(datos)[-12],collapse = '+')))
lm(EXV,dat)
There is no need to do it this way since the lm function itself will do this by using the first code above.

Related

using lm function in R with a variable name in a loop

I am trying to create a simple linear model in R a for loop where one of the variables will be specified as a parameter and thus looped through, creating a different model for each pass of the loop. The following does NOT work:
model <- lm(test_par[i] ~ weeks, data=all_data_plant)
If I tried the same model with the "test_par[i]" replaced with the variable's explicit name, it works just as expected:
model <- lm(weight_dry ~ weeks, data=all_data_plant)
I tried reformulate and paste ineffectively. Any thoughts?
Maybe try something like this:
n <- #add the column position of first variable
m <- #add the column position of last variable
lm_models <- lapply(n:m, function(x) lm(all_data_plant[,x] ~ weeks, data=all_data_plant))
You can pass the argument "formula" in lm() as character using paste(). Here a working example:
data("trees")
test_par <- names(trees)
model <- lm(Girth ~ Height, data = trees)
model <- lm("Girth ~ Height", data = trees) # character formula works
model <- lm(paste(test_par[1], "~ Height"), data=trees)

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

using non-standard evaluation with formula Bis

Non-Standard evaluation in R. I want to send a formula to a function that uses lm.
I have a data frame with one response: y and many predictors. I will fit a model inside a function. The function will receive a filtering criteria as a string and the name of the predictor variable as a string. The response will enter as a name. The function will filter on the filter criteria, then fit the a model using the predictor variable that was sent to it as a string. I can't get the predictor string to work correctly.
This is very close to using non-standard evaluation with formula.
In fact I illustrate that solution, which gets me part of the way there.
Difference: I want to send a string with the value of my predictor instead of sending the predictor to the function.
Use Case: Eventually I will put this in a shiny ap and let the user select the predictor and response as well as the filter.
Here is what works:
# create a data frame.
n <- 100
levels_1 <- sample(c("a","b","c"),n,replace=TRUE)
levels_2 <- sample(c("a","b","c"),n,replace=TRUE)
d <-tibble(l_1 = levels_1 ,l_2 = levels_2, y = rnorm(n))
# A function that works
my_lm <- function(d,predictor,response,filter_criteria){
d1 <- d %>% filter(l_2 == 'a')
lm(y ~ l_1,data=d1)
}
my_lm(d,l_1,y,'a')
my_lm2 <- function(d,predictor,response,filter_criteria){
enquo_predictor <- enquo(predictor)
enquo_response <- enquo(response)
enquo_filter_criteria <- enquo(filter_criteria)
d1 <- d %>% filter(l_2 == !!filter_criteria)
form <- as.formula(paste(enquo_response, " ~ ", predictor)[2])
# form <- as.formula(paste(enquo_response, " ~ ", enquo_predictor)[2]) wrong way to do it.
lm(form,data=d1)
#lm(!!enqu_preditor ~ !!enquo_response,data=d1)
}
selected_var <- names(d)[1]
selected_var
filter_value <- 'a'
my_lm2(d,l_1,y,filter_value) # This works but is not what I want.
my_lm2(d,selected_var,y,filter_value) # This does not work but is what I want to work.

lm function gives estimate for the y-variable also

I am trying to run a simple lm model. I am using the following
dt <- data.table(
y=rnorm(100,0,1),
x1=rnorm(100,0,1),
x2=rnorm(100,0,1),
x3=rnorm(100,0,1))
y_var2 <- names(dt)[names(dt)%like%"y"]
x_var2 <- names(dt)[names(dt)%like%"x"]
tmp2 <- summary(a <- lm(get(y_var2)~.,dt[,c(x_var2,y_var2),with=F]))
coefs2 <- as.data.table(tmp2$coefficients,keep.rownames = T)
So in the end, coefs2 should contain the estimates, p-values etc. But in the last row of the coefs2 i also see the y-variable.
But if I use
tmp2 <- summary(a <- lm(y~.,dt[,c(x_var2,y_var2),with=F]))
Then this does not happen. Why is that ?
This has to do with how R stores variables. y_var2 is a character "y" and you fill it into the formula as a character variable which you wish to model with all variables in your data.table dt. However, you have to tell R that you wish to evaluate the formula y~. and not "y"~. which are two different expressions for R.
lm( formula(paste(y_var2,"~.")),dt[,c(x_var2,y_var2),with=F])
will do the trick. formula constructs a formula out of the string variable with which a contructed the expression.
Actually it would probably be cleaner just to make the formula with reformulate() and the data= parameter of lm
tmp2 <- summary(a <- lm(reformulate(x_var2, y_var2), dt))

How to use one variable in regression with many independent variables in lm()

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')
Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.
The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

Resources