Looping regression commands - r

I'm going to simplify my problem as much as to prove that I'm not just throwing my assignment at you guys. I really want to learn how to get loop to work with regressions.
Let's suppose I want to run two OLS but I don't want to type the same ols command twice OR add another series of commands into my script. This is because a) I actually have way more than two regressions and b) I want to try to code this as efficient as I can (I have tried copying and pasting same ols commands). Also, I'm not just running a simple OLS as I'm also running HAC estimator depending on the serial correlation and heteroskedasticity test.
The code that i have came up with so far is,
Packages
if (!require("lmtest")) install.packages("lmtest")
library("lmtest")
if (!require("sandwich")) install.packages("sandwich")
library("sandwich")
Data
data<-read.csv(file.choose())
x1<-data$x1
x2<-data$x2
x3<-data$x4
x5<-data$x5
x6<-data$x6
x7<-data$x7
y1<-data$y1
Regressions
reg1<-(y1 ~ x1 + x2 + x3 + x4)
reg2<-(y1 ~ x2 + x4 + x6 + x7)
p<-0.05
Loop
for (i in 1:2) {
#OLS#
ols[i]<-lm(reg[i])
#Breusch-Pagan Test#
bptest(ols[i],varformula = NULL, studentize = TRUE)
bpp<-bptest(ols[i])$p.value
if(bpp>p) hs<-F else hs<-T
#Breusch-Godfrey Serial Correlation Test#
bgtest(ols[i],order=2,order.by=NULL,type=c("Chisq"))
bgp<-bgtest(ols[i])$p.value
if(bgp>p) sc<-F else sc<-T
#HAC Estimator#
HAC<-vcovHAC(ols[i],order.by=NULL,prewhite=FALSE,adjust=TRUE,diagnostics=FALSE,sandwich = TRUE,ar.method = "ols")
if (sc==T|hs==T) coeftest(ols[i],vcov.=HAC) else ols[i]
if (sc==T|hs==T) write.csv(coeftest(ols[i],vcov.=HAC),file="ols[i]HAC.csv") else write.csv(summary(ols[i])$coefficient,file="ols1.csv")
}
When I run this I get
Error in stats::model.frame(formula = reg[i], drop.unused.levels = TRUE) : object 'reg' not found
I have also tried the above code with
for (i in reg[1]:reg[2]) {
}
but it only returned
Error: object 'reg' not found.
Where did I go wrong?

This is too long for a comment, so I post it as a partial answer.
The difference seems to be the formula, and you are asking for a way to make your code more efficient. One way is to use a list of formulae and then combine the list with lapply. For instance,
reg <- list(
reg1 = as.formula(y1 ~ x1 + x2 + x3 + x4),
reg2 = as.formula(y1 ~ x2 + x4 + x6 + x7)
)
ols <- lapply(reg, function(x) lm(x, data=data))
Here, ols is a list of two elements, each of which is a regression corresponding to the formula-list. You can use the same principle for other functions, for instance:
bgtests <- lapply(ols, function(x)
bgtest(x,order=2,order.by=NULL,type=c("Chisq")))
This executes your bgtest function for each regression stored in ols. In a similar fashion, you can write it up so that it executes your heteroskedasticity corrections etc. The important point is this: you supply a list to lapply, and each element of that list is what is passed onto the function that you provide. The output of lapply is then a list with the output of that function.
In case you don't want to use lapply and to adress your actual question: the problem in your code is that there is no object called reg. Subsetting a non-existing object such as reg[1] hence does not work. If you execute the first lines of my code above, reg[1] and reg[2] become defined so that your loop should work.

The 'get' function is what you want, in conjunction with 'paste'. Below I fit two regressions using the cars data in R. Then I want to write a loop that extracts its coefficients. The 'get' function goes and find the object that matches the object name you specified.
> (reg1 <- lm(dist ~ speed, data = cars))
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
> (reg2 <- lm(dist ~ 1 + I(speed^2), data = cars))
Call:
lm(formula = dist ~ 1 + I(speed^2), data = cars)
Coefficients:
(Intercept) I(speed^2)
8.860 0.129
> coeff <- matrix(0, nrow = 2, ncol = 2)
> for (i in 1:2)
+ {
+
+ # Main step
+ model <- get(paste("reg", i, sep = ""))
+ coeff[i,] <- coefficients(model)
+ }
> coeff
[,1] [,2]
[1,] -17.579095 3.9324088
[2,] 8.860049 0.1289687
>

Related

How to run regression when formula is given by a string?

Let's consider data following:
set.seed(42)
y <- runif(100)
df <- data.frame("Exp" = rexp(100), "Norm" = rnorm(100), "Wei" = rweibull(100, 1))
I want to perform linear regression but when formula is a string in format:
form <- "Exp + Norm + Wei"
I thought that I only have to use:
as.formula(lm(y~form, data = df))
However it's not working. The error is about variety in length of variables. (it seems like it still treats form as a string vector of length 1, but I have no idea why).
Do you know how I can do it ?
We can use paste to construct the formula, and use it directly on lm
lm(paste('y ~', form), data = df)
-output
#Call:
#lm(formula = paste("y ~", form), data = df)
#Coefficients:
#(Intercept) Exp Norm Wei
# 0.495861 0.026988 0.046689 0.003612

R order lapply output from a function with multiple outputs by variable (column) rather than by function

I have a function in R which includes multiple other functions, including a custom one. I then use lapply to run the combined function across multiple variables. However, when the output is produced it is in the order of
function1: variable a, variable b, variable c
function2: variable a, variable b, variable c
When what I would like is for it to be the other way around:
variable a: function 1, function 2...
variable b: function 1, function 2...
I have recreated an example below using the mtcars dataset, with number of cylinders as a predictor variable, and vs and am as outcome variables.
library(datasets)
library(tidyverse)
library(skimr)
library(car)
data(mtcars)
mtcars_binary <- mtcars %>%
dplyr::select(cyl, vs, am)
# logistic regression function
logistic.regression <- function(logmodel) {
dev <- logmodel$deviance
null.dev <- logmodel$null.deviance
modelN <- length(logmodel$fitted.values)
R.lemeshow <- 1 - dev / null.dev
R.coxsnell <- 1 - exp ( -(null.dev - dev) / modelN)
R.nagelkerke <- R.coxsnell / ( 1 - ( exp (-(null.dev / modelN))))
cat("Logistic Regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.lemeshow, 3), "\n")
cat("Cox and Snell R^2 ", round(R.coxsnell, 3), "\n")
cat("Nagelkerke R^2" , round(R.nagelkerke, 3), "\n")
}
# all logistic regression results
log_regression_tests1 <- function(df_vars, df_data) {
glm_summary <- glm(df_data[,df_vars] ~ df_data[,1], data = df_data, family = binomial, na.action = "na.omit")
glm_print <- print(glm_summary)
log_results <- logistic.regression(glm_summary)
blr_coefficients <- exp(glm_summary$coefficients)
blr_confint <- exp(confint(glm_summary))
list(glm_summary = glm_summary, glm_print = glm_print, log_results = log_results, blr_coefficients = blr_coefficients, blr_confint = blr_confint)
}
log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE)
log_regression_results1
When I do this, the output is being produced as:
glm_summary: vs, am
log_results: vs, am
etc. etc.
When what I would like for the output to be ordered is:
vs: all function outputs
am: all function outputs
In addition, when I run this line of code, log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE) I get only the results of the logistic regression function, but when I print the overall results log_regression_results1 I get the remaining output, could anyone explain why?
Finally, the glm_summary function is not producing all of the output which it should. When I run the functions independently on a single variable, like so
glm_vs <- glm(vs ~ cyl, data = mtcars_binary, family = binomial, na.action = "na.omit")
summary(glm_vs)
logistic.regression(glm_vs)
exp(glm_vs$vs)
exp(confint(glm_vs))
it also produces the standard error, z value, and p value for summary(glm_vs) which it does not do embedded in the function, even though I have ```glm_print <- print(glm_summary)' included. Is there a way to get the output for the full summary function within the log_regression_tests1 function?
when I run your code up to log_regression_results1 I got exactly what you ask for:
summary(log_regression_results1)
Length Class Mode
vs 5 -none- list
am 5 -none- list
maybe you meant to ask the other way round?

Is there a function for substituting (or removing at all) explaining variables in a linear model (lm)?

I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}

Writing function to identify confounding variables

I am trying to model an outcome as a function of several exposures, adjusting the models for any covariates that may be confounders (≥ 10% ∆ to outcome coefficient when added to model). I am looking at many covariates as potential confounders, so have created a dataframe with all of them and am using lapply (the outcome and exposures are in a separate dataframe which has already been attached). To make sorting through all my outputs easier, I have tried to write a function which will only display the output if the covariate is a confounder. The exposures and number of them are different in each model, so I find myself having to write code like bellow each time I run my analyses, but know there must be an easier way. Would there be a function I could write to just lapply with, using the model without confounders and the Covariates dataframe as arguments? Thanks!
lapply(Covariates, function(x) {
model <- summary(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
if ((model$coefficients[2, 1] - summary(lm(Outcome ~ Exposure))$coefficients[2, 1])/
model$coefficients[2, 1] >= .1)
return(model)
})
I have written a function to solve this problem!
confounder <- function(model) {
model.sum <- summary(model)
model.b <- model.sum$coefficients[2, 1]
oldmodel <- update(model, . ~ . -x)
oldmodel.sum <- summary(oldmodel)
oldmodel.b <- oldmodel.sum$coefficients[2, 1]
model.frame <- tidy(model.sum)
model.sub <- subset(model.frame, term = "x")
model.sub.b <- model.sub[, 5]
if ((model.b - oldmodel.b)/model.b >= .1 |
model.sub.b < .05)
return(model.sum)
}
I then lapply this function to the model:
lapply(Covariates, function(x) {
confounder(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
})

R: cant get a lme{nlme} to fit when using self-constructed interaction variables

I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.

Resources