Writing function to identify confounding variables - r

I am trying to model an outcome as a function of several exposures, adjusting the models for any covariates that may be confounders (≥ 10% ∆ to outcome coefficient when added to model). I am looking at many covariates as potential confounders, so have created a dataframe with all of them and am using lapply (the outcome and exposures are in a separate dataframe which has already been attached). To make sorting through all my outputs easier, I have tried to write a function which will only display the output if the covariate is a confounder. The exposures and number of them are different in each model, so I find myself having to write code like bellow each time I run my analyses, but know there must be an easier way. Would there be a function I could write to just lapply with, using the model without confounders and the Covariates dataframe as arguments? Thanks!
lapply(Covariates, function(x) {
model <- summary(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
if ((model$coefficients[2, 1] - summary(lm(Outcome ~ Exposure))$coefficients[2, 1])/
model$coefficients[2, 1] >= .1)
return(model)
})

I have written a function to solve this problem!
confounder <- function(model) {
model.sum <- summary(model)
model.b <- model.sum$coefficients[2, 1]
oldmodel <- update(model, . ~ . -x)
oldmodel.sum <- summary(oldmodel)
oldmodel.b <- oldmodel.sum$coefficients[2, 1]
model.frame <- tidy(model.sum)
model.sub <- subset(model.frame, term = "x")
model.sub.b <- model.sub[, 5]
if ((model.b - oldmodel.b)/model.b >= .1 |
model.sub.b < .05)
return(model.sum)
}
I then lapply this function to the model:
lapply(Covariates, function(x) {
confounder(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
})

Related

Is there a function for substituting (or removing at all) explaining variables in a linear model (lm)?

I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}

output table for multiple regressions

I ran several time-series regressions (one for every year) and now I'd like to generate a table similar to what coef() returns but also with level of significance (the stars), R-squared and F-statistic for each year which will look somewhat like this:
b0 b1 b2 b3 b4 R-sq. F-stat.
2010 ...*
2011 ...
2012 ...**
So far I tried mtable() from the memisc-package which gives me years as columns and coeffecients as rows but I'd prefer the result to be "transposed" (like above).
Since we don't have access to your data or the code you used to run your models, I created my own dummy models using the mtcars dataset:
data("mtcars")
model1 <- lm(mpg ~ wt + cyl, data = mtcars)
model2 <- lm(mpg ~ wt + cyl + hp, data = mtcars)
For future reference, you'll always want to supply some of your data using, for example, dput(head(my_dataframe, 20)). You should also put up more of the code you used to get where you're at; in fact, the minimum amount of code needed to reproduce your problem. You may want to read How to Create a Great R Reproducible Example for more information; it just helps others help you.
Then I rigged up the following (clumsy) function that I think does roughly what you're looking for. In any event, it should get you started in the right direction:
get_row <- function(x, coef_names) {
coef_mat <- coef(summary(x))
these_coef_names <- rownames(coef_mat)
rows <- match(coef_names, these_coef_names)
p <- coef_mat[rows, 4]
stars <- c("", "*", "**", "***")[(p < 0.05) + (p < 0.01) + (p < 0.001) + 1]
coefs <- round(coef_mat[rows, 1], 3)
output <- paste0(coefs, stars)
output <- ifelse(grepl("NA", output), NA, output)
return(output)
}
get_table <- function(...) {
models <- list(...)
if ( any(lapply(models, class) != "lm" ) ) {
stop("This function has only been tested with lm objects.")
}
coef_names <- unique(unlist(sapply(models, variable.names)))
coef_table <- t(sapply(models, get_row, coef_names))
colnames(coef_table) <- coef_names
return(coef_table)
}
get_table(model1, model2)
# (Intercept) wt cyl hp
# [1,] "39.686***" "-3.191***" "-1.508**" NA
# [2,] "38.752***" "-3.167***" "-0.942" "-0.018"

Use all variables in a model with {plm} in R

Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>

How to extract residuals from the regressed data?

This might be a very easy question, but I do need some help with R.
I have an expression data, for which I have run a linear regression to correct for the covariates, and I would like to extract the residuals in a file.
So following is the loop I have
for (i in 1:n) {
geneProbe <- z.na[,i]
lm1 <- lm(geneProbe ~ phenotype + covariate1 + covariate2 + covariate3)
write.table(lm1$residuals, file="residuals.txt", sep="\t")
}
Ofcourse when I do the following
write.table(lm1$residuals, file="residuals.txt", sep="\t")
I am able to retrieve the residuals only for one loop as follows (residuals.txt):
Res1
-0.00224226
0.005144119
0.011142788
1.90E-05
-0.003698019
I would rather like to have them for all the is, or loops, into a single file as follows (residuals.txt). In other words, every loop should add a column:
Res1 Res2 Res3 Res4
-0.00224226 0.009583449 0.000538104 0.012497267
0.005144119 0.015632242 -0.000104554 -0.009199898
0.011142788 -0.012912383 -0.004363051 -0.010270967
1.90E-05 -0.038716093 0.004149837 0.011071139
-0.003698019 0.015219847 -0.002486236 -0.009230721
Save the residuals in a matrix or dataframe within your loop and then save the table when finished. E.g.:
resdat = matrix(NA, 5,n)
for (i in 1:n) {
geneProbe <- z.na[,i]
lm1 <- lm(geneProbe ~ phenotype + covariate1 + covariate2 + covariate3)
resdat[,i] = lm1$residuals
}
write.table(resdat, file="residuals.txt", sep="\t")

Efficiency in time-series regression in R: How can I do this better?

I am working on time series, and want to check all the lagged differences for significance(and essentially doing a dickey-fuller test by hand) but that's not important. I can do it, but it's really mechanical, and there must be a way to do this more elegantly. Or at least more efficiently. Any ideas?
y <- log.real.gdp.ts
delta.y.t <- diff(y,differences=1)
lag.y <- lag(y, -1)
L1Dy <- lag(delta.y.t, k=-1)
L2Dy <- lag(delta.y.t, k=-2)
L3Dy <- lag(delta.y.t, k=-3)
L4Dy <- lag(delta.y.t, k=-4)
L5Dy <- lag(delta.y.t, k=-5)
L6Dy <- lag(delta.y.t, k=-6)
L7Dy <- lag(delta.y.t, k=-7)
L8Dy <- lag(delta.y.t, k=-8)
L9Dy <- lag(delta.y.t, k=-9)
L10Dy <- lag(delta.y.t, k=-10)
L11Dy <- lag(delta.y.t, k=-11)
L12Dy <- lag(delta.y.t, k=-12)
d = ts.union(delta.y.t, lag.y, L1Dy, L2Dy, L3Dy, L4Dy, L5Dy, L6Dy, L7Dy, L8Dy, L9Dy, L10Dy, L11Dy, L12Dy) ## takes care of NA's
lm.model.III <- lm(delta.y.t~ lag.y + time(lag.y) + L1Dy + L2Dy + L3Dy + L4Dy + L5Dy + L6Dy + L7Dy + L8Dy + L9Dy + L10Dy + L11Dy + L12Dy, data=d)
I'd really like some kind of loop where I can generate 1:n lagged differences, and then some way to insert all n into my linear model, like
lm.model.III <- lm(delta.y.t ~ lag.y + time(lag.y) + lagged.diffs.mts)
how about
require(zoo)
delta.y.t <- diff(y,differences=1)
lag.y <- lag(y, -1)
L1Dy <- lag(delta.y.t, -(0:12), na.pad=T)
#for any regression you can then access the number of lags you want:
# 0 lag and na.pad=T are crucial
lm(lag.y ~ L1Dy[,1:5])
Hope this helps
-Chris
The package dynlm adds handling of time series operators to R formulas:
The interface and internals of dynlm are very similar to lm, but
currently dynlm offers three advantages over the direct use of lm: 1.
extended formula processing, 2. preservation of time series
attributes, 3. instrumental variables regression (via two-stage least
squares). For specifying the formula of the model to be fitted, there
are additional functions available which allow for convenient
specification of dynamics (via d() and L()) or linear/cyclical
patterns (via trend(), season(), and harmon()). All new formula
functions require that their arguments are time series objects (i.e.,
"ts" or "zoo").
Here is an example:
library(foreign)
library(zoo)
library(dynlm)
dfKlein = read.dta('http://www.stata-press.com/data/r12/klein.dta')
summary(dfKlein)
zooKlein = as.zoo(dfKlein, order.by = dfKlein$year)
lmKlein = dynlm(consump ~ L(profits, 1) + profits + wagetot,
data = zooKlein)
summary(lmKlein)
Note, in particular, that it allows you to specify a vector of lags in the formula object, such as y ~ L(y, 1:4).

Resources