Loop for multiple linear regression - r

Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}

Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.

With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719

Related

Auto VIF (Variable importance factor) for variable analysis

I want to calculate an "Auto" VIF for variables, one vs the others. For example in the iris dataset, I want Sepal.Width to act as target value and the others as explanatory variables.
First I remove the Species column, so only the variables stay. Then I want to loop over each variable and test it again the others. Finally I want the VIF resulto to be stored in a list.
This is what I have tried:
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>%
select(-Species)
col_names <- colnames(iris)
i <- 1
for(col in col_names) {
regr <- lm(col ~ ., data=iris_clean)
list[i] <- vif(regr)
i <- i+1
}
For some reason I get an error:
Error in model.frame.default(formula = col ~ ., data = iris_clean, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Sepal.Length')
Which I don't understand because the variables have the same length. Please, any help will be greatly appreciated.
You can try this -
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>% select(-Species)
col_names <- colnames(iris_clean)
result <- vector('list', length(col_names))
for(i in seq_along(col_names)) {
regr <- lm(paste0(col_names[i], '~ .'), data=iris_clean)
result[[i]] <- vif(regr)
}
result
#[[1]]
# Sepal.Width Petal.Length Petal.Width
# 1.270815 15.097572 14.234335
#[[2]]
#Sepal.Length Petal.Length Petal.Width
# 4.278282 19.426391 14.089441
#[[3]]
#Sepal.Length Sepal.Width Petal.Width
# 3.415733 1.305515 3.889961
#[[4]]
#Sepal.Length Sepal.Width Petal.Length
# 6.256954 1.839639 7.557780

Residuals from R regression of a data subset

I would like to get regression residuals but only a residuals subset from data:
My R code:
reg = lm(Y ~ X1+X2+.....+Xn,data=fic)
step_reg = step(reg, direction= "both")
summary(step_reg)
fic is a dataframe with n columns called X1, X2, ...Xn.
To get all residuals: step_reg2$residuals
But I would like to get residuals only for rows which respect criteria like for example X1 = 'xxxx'
What could be the solution, please?
You can use the data you used for the regression to subset the residuals like:
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
step_reg <- step(reg, direction= "both")
step_reg$residuals[iris$Species=="setosa"]
In case there are missing values:
x <- iris
x[1,2] <- NA
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=x)
reg$residuals[iris[names(reg$residuals), "Species"] == "setosa"]

R: subsetting within a function

Suppose I have a data frame in the environment, mydata, with three columns, A, B, C.
mydata = data.frame(A=c(1,2,3),
B=c(4,5,6),
C=c(7,8,9))
I can create a linear model with
lm(C ~ A, data=mydata)
I want a function to generalize this, to regress B or C on A, given just the name of the column, i.e.,
f = function(x){
lm(x ~ A, data=mydata)
}
f(B)
f(C)
or
g = function(x){
lm(mydata$x ~ mydata$A)
}
g(B)
g(C)
These solutions don't work. I know there is something wrong with the evaluation, and I have tried permutations of quo() and enquo() and !!, but no success.
This is a simplified example, but the idea is, when I have dozens of similar models to build, each fairly complicated, with only one variable changing, I want to do so without repeating the entire formula each time.
If we want to pass unquoted column name, and option is {{}} from tidyverse. With select, it can take both string and unquoted
library(dplyr)
printcol2 <- function(data, x) {
data %>%
select({{x}})
}
printcol2(mydata, A)
# A
#1 1
#2 2
#3 3
printcol2(mydata, 'A')
# A
#1 1
#2 2
#3 3
If the OP wanted to pass unquoted column name to be passed in lm
f1 <- function(x){
rsp <- deparse(substitute(x))
fmla <- reformulate("A", response = rsp)
out <- lm(fmla, data=mydata)
out$call <- as.symbol(paste0("lm(", deparse(fmla), ", data = mydata)"))
out
}
f1(B)
#Call:
#lm(B ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 3 1
f1(C)
#Call:
#lm(C ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 6 1
Maybe you are looking for deparse(substitute(.)). It accepts arguments quoted or not quoted.
f = function(x, data = mydata){
y <- deparse(substitute(x))
fmla <- paste(y, 'Species', sep = '~')
lm(as.formula(fmla), data = data)
}
mydata <- iris
f(Sepal.Length)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 5.006 0.930 1.582
f(Petal.Width)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 0.246 1.080 1.780
I think generally, you might be looking for:
printcol <- function(x){
print(x)
}
printcol(mydata$A)
This doesn't involve any fancy evaluation, you just need to specify the variable you'd like to subset in your function call.
This gives us:
[1] 1 2 3
Note that you're only printing the vector A, and not actually subsetting column A from mydata.

How to convert `c('a','b')` to `cbind(a,b)` in R formula?

myfun<-function(c('a','b'),c('g'),df){
manova(cbind(a,b)~g,data=df)
}
myfun(c('Sepal.Length','Sepal.Width'),c('Species'),iris)
If I want to make myfun(c('Sepal.Length','Sepal.Width'),c('Species'),iris) to get manova result,I need to revise myfun.
I tried but failed:
myfun<-function(var,group,df){
manova(as.formula(cbind(print(var,quote = FALSE))~group),data=df)
}
I don't know how to convert c('a','b') to cbind(a,b),any thought?
Thanks for any answer in advance.
We can create a formula with paste
myfun <- function(colnms1, group, dat) {
fmla <- as.formula(paste0("cbind(",
paste(colnms1, collapse=","), ")", " ~ ", group))
mva <- manova(fmla, data = dat)
mva$call <- fmla
mva
}
myfun(c('Sepal.Length','Sepal.Width'), 'Species' ,iris)
#Call:
# cbind(Sepal.Length, Sepal.Width) ~ Species
#Terms:
# Species Residuals
#Sepal.Length 63.21213 38.95620
#Sepal.Width 11.34493 16.96200
#Deg. of Freedom 2 147
#Residual standard errors: 0.5147894 0.3396877
#Estimated effects may be unbalanced

Extracting a list of R2 from within lm() based on variable in multiple regression in R

I've performed a multiple regression analysis on a dataset in R using lm() and I am able to extract the coefficients for each day of year using the function below. I would also like to extract the R2 for each day of year but this doesn't seem to work in the same way.
This is pretty much the same question as:
Print R-squared for all of the models fit with lmList
but when I try this I get 'Error: $ operator is invalid for atomic vectors'. I would also like to include it in the same function if possible. How can I extract the R2 for each doy in this way?
#Create MR function for extracting coefficients
getCoef <- function(df) {
coefs <- lm(y ~ T + P + L + T * L + P * L, data = df)$coef
names(coefs) <- c("intercept", "T", "P", "L", "T_L", "P_L")
coefs
}
#Extract coefficients for each doy
coefs.MR_uM <- ddply(MR_uM, ~ doy, getCoef)```
The point is r.squared is stored in summary(lm(...)) not in lm(...). Here is another version of your function to extract R2:
library(plyr)
df <- iris
#Create MR function for extracting coefficients and R2
getCoef <- function(df) {
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = df)
coefs <- model$coef
names(coefs) <- c("intercept", "Sepal.Width", "Petal.Length", "Petal.Width")
R2 <- summary(model)$r.squared
names(R2) <- c("R2")
c(coefs, R2)
}
#Extract coefficients and R2 for each Species
coefs.MR_uM <- ddply(df, ~ Species, getCoef)
coefs.MR_uM # output
Species intercept Sepal.Width Petal.Length Petal.Width R2
1 setosa 2.351890 0.6548350 0.2375602 0.2521257 0.5751375
2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238 0.6050314
3 virginica 0.699883 0.3303370 0.9455356 -0.1697527 0.7652193
As suggested by Parfait, you don't need plyr::ddply(), you can use do.call(rbind, by(df, df$Species, getCoef))
Hope this helps !

Resources