Regression of variables in a dataframe - r

I have a dataframe:
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
I would like to regress each variable versus all the other variables, for instance:
fit1 <- lm(x1 ~ ., data = df)
fit2 <- lm(x2 ~ ., data = df)
etc. (Of course, the real dataframe has a lot more variables).
I tried putting them in a loop, but it didn't work. I also tried using lapply but couldn't produce the desired result either. Does anyone know the trick?

You can use reformulate to dynamically build formuals
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
vars <- names(df)
result <- lapply(vars, function(resp) {
lm(reformulate(".",resp), data=df)
})
alternatively you could use do.call to get "prettier" formauls in each of the models
vars <- names(df)
result <- lapply(vars, function(resp) {
do.call("lm", list(reformulate(".",resp), data=quote(df)))
})
each of these methods returns a list. You can extract individual models with result[[1]], result[[2]], etc

Or you can try this...
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
models = list()
for (i in (1: ncol(df))){
formula = paste(colnames(df)[i], "~ .", sep="")
models[[i]] = lm(formula, data = df)
}
This will save all models as a list
To retrieve stored models:
eg : model regressed on x4
#retrieve model - replace modelName with the name of the required column
modelName = "x4"
out = models[[which( colnames(df)== modelName )]]
Output :
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-0.17383 0.07602 -0.09759 -0.23920

Related

How to have output from lm() include std. error and others without using summary() for stargazer

I'm fitting several linear models in r in the following way:
set.seed(12345)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)+0.1
y = x + rnorm(n)
df <- data.frame(x1, x2, y)
x_str <- c("x1", "x1+x2")
regf_lm <- function(df,y_var, x_str ) {
frmla <- formula(paste0(y_var," ~ ", x_str ))
fit <- lm(frmla, data = df )
summary(fit) #fit
}
gbind_lm <- function(vv) {
n <- vv %>% length()
fits <- list()
coefs <- list()
ses <- list()
for (i in 1:n ) {
coefs[[i]] <- vv[[i]]$coefficients[,1]
ses[[i]] <- vv[[i]]$coefficients[,2]
fits[[i]] <- vv[[i]]
}
list("fits" = fits, "coefs" = coefs, "ses" = ses)
}
stargazer_lm <- function(mylist, fname, title_str,m_type = "html",...) {
stargazer(mylist$fits, coef = mylist$coefs,
se = mylist$ses,
type = m_type, title = title_str,
out = paste0("~/projects/outputs",fname), single.row = T ,...)
}
p_2 <- map(x_str,
~ regf_lm (df = df ,
y_var = "y", x_str = .))
m_all <- do.call(c, list(p_2)) %>% gbind_lm()
stargazer_lm(m_all,"name.html","My model", m_type = "html")
In regf_lm, if I use summary(fit) on the last line, I'm able to generate reg output with columns for estimated coefficients, std. error, etc. But Stargazer() does not work with summary(lm()) (returns error $ operator is invalid for atomic vectors). However, if I just use "fit" on the last line in regf_lm, the output shows only the estimated coefficients and not std error, R sq...and gbind_lm() won't work because I cannot extract ses or fit.
Any advice is greatly appreciated.
You can directly export model statistics in tidy format with the package broom
library(broom)
set.seed(12345)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)+0.1
y = x1 + rnorm(n)
df <- data.frame(x1, x2, y)
x_str <- c("x1", "x1+x2")
regf_lm <- function(df,y_var, x_str ) {
frmla <- formula(paste0(y_var," ~ ", x_str ))
fit <- lm(frmla, data = df )
return(list(fit,select(broom::tidy(fit),std.error))) #fit
}
exm_model <- regf_lm(iris,'Sepal.Width','Sepal.Length')
stargazer(exm_model[[1]], coef = exm_model[[2]], title = 'x_model',
out ='abc', single.row = T)
This piece of code worked on my local with no problem, I think you can apply this in your workflow.

Linear regression with ongoing data, in R

Modell
y ~ x1 + x2 + x3
about 1000 rows
What Iwant to do is to do an prediction "step-by-step"
Using Row 0:20 to predict y of 21:30 and then using 11:30 to predict y of 31:40 and so on.
You can use the predict function:
mod = lm(y ~ ., data=df[1:990,])
pred = predict(mod, newdata=df[991:1000,2:4])
Edit: to change the range of training data in a loop:
index = seq(10,990,10)
pred = matrix(nrow=10, ncol=length(index))
for(i in index){
mod = lm(y ~ ., data=df[1:i,])
pred[,i/10] = predict(mod, newdata=df[(i+1):(i+10),2:4])
MSE[i/10] = sum((df$y[(i+1):(i+10)]-pred[,i/10])^2)}
mean(MSE)
Are you looking for something like this?
# set up mock data
set.seed(1)
df <- data.frame(y = rnorm(1000),
x1 = rnorm(1000),
x2 = rnorm(1000),
x3 = rnorm(1000))
# for loop
prd <- list()
for(i in 1:970){
# training data
trn <- df[i:(i+20), ]
# test data
tst <- df[(i+21):(i+30), ]
# lm model
mdl <- lm(y ~ x1 + x2 + x3, trn)
# append a list of data.frame with both predicted and actual values
# for later confrontation
prd[[i]] <- data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
# your list
prd
You can also try something fancier with the package slider:
# define here your model and how you wanna handle the preditions
sliding_lm <- function(..., frm, n_trn, n_tst){
df <- data.frame(...)
trn <- df[1:n_trn, ]
tst <- df[n_trn+1:n_tst, ]
mdl <- lm(y ~ x1 + x2 + x3, trn)
data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
n_trn <- 20 # number of training obs
n_tst <- 10 # number of test obs
frm <- y ~ x1 + x2 + x3 # formula of your model
prd <- slider::pslide(df, sliding_lm,
frm = frm,
n_trn = n_trn,
n_tst = n_tst,
.after = n_trn + n_tst,
.complete = TRUE)
Note that the last 30 entries in the list are NULL, because you look only at complete windows [30 observations with training and test]

Including t statistics in regression output using stargazer in R

I have three regressions that I am trying to include into one table using the -stargazer- function. I have the following code:
library(Jmisc)
library(tidyverse)
library(sandwich)
library(lmtest)
library(multiwayvcov)
library(stargazer)
set.seed(123)
df <- data.frame(
x1 = rnorm(10, mean=0, sd=1),
x2 = rnorm(10, mean=0, sd=1),
y = rnorm(10, mean=0, sd=1)
)
r1 <- lm(y ~ x1 + x2, df)
cov1 <- vcovHC(r1, type="HC1", cluster="clustervar")
robust.se1 <- sqrt(diag(cov1))
t1 <- coef(r1)/robust.se1
r2 <- lm(y ~ x1, df)
cov2 <- vcovHC(r2, type="HC1", cluster="clustervar")
robust.se2 <- sqrt(diag(cov2))
t2 <- coef(r2)/robust.se2
r3 <- lm(y ~ x2, df)
cov3 <- vcovHC(r3, type="HC1", cluster="clustervar")
robust.se3 <- sqrt(diag(cov3))
t3 <- coef(r3)/robust.se2
stargazer(r1, r2, r3,
se = NULL,
t = list(t1, t2, t3),
align=TRUE,
type="html",
nobs=TRUE,
out="StargazerTest.txt")
The table that is produced reports standard errors as opposed to the t-statistics I created. This is most likely due to the -stargazer- function at the bottom. I have looked up the directory for it and still don't understand how to get it to do what I want.
As explained here, you can specify the values you want with report (since stargazer 5.0). In your case, remove se = NULL and t = list(t1, t2, t3) and put:
report = ('c*t')
such as:
stargazer(r1, r2, r3,
report = ('c*t'),
align=TRUE,
type="html",
nobs=TRUE,
out="StargazerTest.txt")
Edit: since you need to use the robust standard error, you should use the function coeftest (library lmtest) instead of computing the robust standard error manually. Below is an example on one of your regressions:
library(Jmisc)
library(tidyverse)
library(sandwich)
library(lmtest)
library(multiwayvcov)
library(stargazer)
set.seed(123)
df <- data.frame(
x1 = rnorm(10, mean=0, sd=1),
x2 = rnorm(10, mean=0, sd=1),
y = rnorm(10, mean=0, sd=1)
)
r1 <- lm(y ~ x1 + x2, df)
cov1 <- vcovHC(r1, type="HC1", cluster="clustervar")
robust.se1 <- sqrt(diag(cov1))
t1 <- coef(r1)/robust.se1
foo <- coeftest(r1, vcov = vcovHC(r1, type = "HC1"))
stargazer(foo,
report = ('c*t'),
align=TRUE,
type="html",
nobs=TRUE,
out="StargazerTest.txt")
Notice that foo gives the same t-values than t1 but also displays coefficients, se, etc. which allows stargazer to work properly

Loop through various data subsets in lm() in R

I would like to loop over various regressions referencing different data subsets, however I'm unable to appropriately call different subsets. For example:
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
x.list <- list(dat$x1,dat$x2,dat$x3)
dat1 <- dat[-9,]
fit <- list()
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat))}
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat1))}
Is there a way to call in "dat1" such that it subsets the other variables accordingly? Thanks for any recs.
I'm not sure it makes sense to copy your covariates into a new list like that. Here's a way to loop over columns and to dynamically build formulas
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
dat1 <- dat[-9,]
#x.list not used
fit <- list()
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat))}
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat1))}
How about this?
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
mods <- lapply(list(y ~ x1, y ~ x2, y ~ x3), lm, data = dat1)
If you have lots of predictors, create the formulas something like this:
lapply(paste('y ~ ', 'x', 1:10, sep = ''), as.formula)
If your data was in long format, it would be similarly simple to do by using lapply on a split data.frame.
dat <- data.frame(y = rnorm(30), x = rnorm(30), f = rep(1:3, each = 10))
lapply(split(dat, dat$f), function(x) lm(y ~ x, data = x))
Sorry being late - but have you tried to apply the data.table solution similar to yours in:
R data.table loop subset by factor and do lm()
I have just applied the links solution by altering your data which should illustrate how I understood your question:
set.seed(1)
df <- data.frame(x1 = letters[1:3],
x2 = sample(c("a","b","c"), 30, replace = TRUE),
x3 = sample(c(20:50), 30, replace = TRUE),
y = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="x1")
fits <- lapply(unique(dt$x1),
function(z)lm(y~x2+x3, data=dt[J(z),], y=T))
fit <- dt[, lm(y ~ x2 + x3)]
# Using id as a "by" variable you get a model per id
coef_tbl <- dt[, as.list(coef(lm(y ~ x2 + x3))), by=x1]
# coefficients
sapply(fits,coef)
anova_tbl = dt[, as.list(anova(lm(y ~ x2 + x3))), by=x1]
row_names = dt[, row.names(anova(lm(y ~ x2 + x3))), by=x1]
anova_tbl[, variable := row_names$V1]
It extends your solution.

Rename terms in fitted model object

I have a list of regression models which all have the same number of terms (that is, the same number of predictive variables). Substantively, that they all have different model terms is right. But when it comes to putting them in a regression table, I want them all the models to share a single formula, simply for the sake of presentation.
Some indicative data
library(plyr)
d1 <- data.frame(y = rnorm(100),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
x4 = runif(100))
Fit the models
mods.form <- paste("y ~ x", 1:4, sep = "")
mod.list <- llply(mods.form, function(i) lm(i, d1))
Here are the terms I want to modify
llply(mod.list, function(i) attr(terms(i), "variables"))
[[1]]
list(y, x1)
[[2]]
list(y, x2)
[[3]]
list(y, x3)
[[4]]
list(y, x4)
I want every model in the list to have the same variable names as the first model, so I tried:
mod.list2 <- llply(mod.list, function(i) attr(terms(i), "variables") = list("y", "x1"))
which provides this error
Error in attr(terms(i), "variables") = list("y", "x1") :
could not find function "terms<-"
Is there a simple solution here?
Perhaps this is what you are looking for:
Using the dataframe that you provided
d1 <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100))
First, rename each x variable to some desired name "x"
names(d1) <- c("y", rep("x", times=length(d1)-1))
Then, use lapply on list d1 for each x variable, passing y as an argument
in to an anonymous function
mod.list <- lapply(d1[2:ncol(d1)], function(x,y){
lm("y ~ x",d1)
}, y=d1[, 'y'])
Finally, calling llply on the mod.list we get:
> llply(mod.list, function(x){
+ attr(terms(x), "variables")
+ })
$x
list(y, x)
$x.1
list(y, x)
$x.2
list(y, x)
$x.3
list(y, x)

Resources