How to run the same regression but replacing the dataframe used in R? - r

I have 3 dataframes (df1, df2, df3) with the same variable names, and I would like to perform essentially the same regressions on all 3 dataframes. My regressions currently look like this:
m1 <- lm(y ~ x1 + x2, df1)
m2 <- lm(y~ x1 + x2, df2)
m3<- lm(y~ x1 + x2, df3)
Is there a way I can use for-loops in order to perform these regressions by just swapping out dataframe used?
Thank you

or add the dataframes to a list and map the lm function over the list.
library(tidyverse)
df1 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df2 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df3 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df_list <- list(df1, df2, df3)
m <- map(df_list, ~lm(y ~ x, data = .))

Using update.
(fit <- lm(Y ~ X1 + X2 + X3, df1))
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df1)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
update(fit, data=df2)
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df2)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
Or lapply
lapply(mget(ls(pattern='^df\\d$')), lm, formula=Y ~ X1 + X2 + X3)
# $df1
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
#
#
# $df2
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
#
#
# $df3
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.5720 0.6106 -0.1576 1.1391
Data:
set.seed(42)
f <- \() transform(data.frame(X1=rnorm(10), X2=rnorm(10), X3=rnorm(10)),
Y=1 + .2*X1 + .4*X2 + .8*X3 + rnorm(10))
set.seed(42); df1 <- f(); df2 <- f()

Related

Implementing a function in R to find the coefficients of a linear regression model from the ground up

# but cannot handle categorical variables
my_lm <- function(explanatory_matrix, response_vec) {
exp_mat <- as.matrix(explanatory_matrix)
intercept <- rep(1, nrow(exp_mat))
exp_mat <- cbind(exp_mat, intercept)
solve(t(exp_mat) %*% exp_mat) %*% (t(exp_mat) %*% response_vec)
}
The above code will not work when there are categorical variables in the explanatory_matrix.
How can I implement that?
Here is an example for a data set with one categorical variable:
set.seed(123)
x <- 1:10
a <- 2
b <- 3
y <- a*x + b + rnorm(10)
# categorical variable
x2 <- sample(c("A", "B"), 10, replace = T)
# one-hot encoding
x2 <- as.integer(d$x2 == "A")
xm <- matrix(c(x, x2, rep(1, length(x))), ncol = 3, nrow = 10)
ym <- matrix(y, ncol = 1, nrow = 10)
beta_hat <- MASS::ginv(t(xm) %*% xm) %*% t(xm) %*% ym
beta_hat
This gives (note the order of coefficients - it matches the order of the predictor columns):
[,1]
[1,] 1.9916754
[2,] -0.7594809
[3,] 3.2723071
which is identical to the output of lm:
d <- data.frame(x = x,
x2 = x2,
y = y)
lm(y ~ ., data = d)
Output
# Call:
# lm(formula = y ~ ., data = d)
#
# Coefficients:
# (Intercept) x x2
# 3.2723 1.9917 -0.7595
For categorical handling you should use one-hot encoding.
Do something like
formula <- dep_var ~ indep_var
exp_mat <- model.matrix(formula, explanatory_matrix)
solve(t(exp_mat) %*% exp_mat) %*% (t(exp_mat) %*% response_vec)

R: merge strings into one formula object

I have a character object that describes the control variables for a regression model. I fail to dynamically reference those correctly, whenever there is more than one control variable. Consider the following example:
x1 = runif(1000); x2 = runif(1000); x3 = runif(1000); e = runif(1000)
y = 2*x1+3*x2+x3+ e
df = data.frame(y, x1,x2,x3)
# define formula inputs
depvar =as.symbol("y")
variableofinterest = as.symbol("x1")
control1 = as.symbol('x2')
control2 = as.symbol('x2+x3')
# this works
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control1) , data = df)))
# this does not
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
It does not work, since the dataframe obviously contains no variable x2+x3, but how can I disentangle those to reference correctly, when the input control = x2+x3 is a given character (beyond my control)
We can quote instead of as.symbol
control2 <- quote(x2 + x3)
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
#Call:
#lm(formula = y ~ x1 + (x2 + x3), data = df)
#Coefficients:
#(Intercept) x1 x2 x3
# 0.450 2.056 3.007 1.056
Note that when we do as.symbol, it adds a backquote
as.symbol('x2 + x3')
#`x2 + x3`
compare it with quote which returns a language object instead of symbol
quote(x2 + x3)
#x2 + x3
If it is already a string, then we can use parse_expr from rlang
control2 <- rlang::parse_expr('x2 + x3')
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
#Call:
#lm(formula = y ~ x1 + (x2 + x3), data = df)
#Coefficients:
#(Intercept) x1 x2 x3
# 0.450 2.056 3.007 1.056
If your objective is to have just one coefficient for the x2+x3you should use I (Inhibit Interpretation/Conversion of Objects).
Futhermore, you would need what #Roland has said:
control2 = parse(text = 'x2+x3')[[1]]
eval(bquote(lm(.(depvar)~ .(variableofinterest) + I(.(control2)) , data = df)))
Call:
lm(formula = y ~ x1 + I(x2 + x3), data = df)
Coefficients:
(Intercept) x1 I(x2 + x3)
0.4899 2.0157 2.0342
Otherwise, if you don't want to work with eval, as.symbol, bquote and .( ) you can use as.formula and paste0.
# define formula inputs
depvar = "y"
variableofinterest = "x1"
control1 = 'x2'
control2 = 'I(x2+x3)'
lm(as.formula(paste0(depvar,
"~",
paste0(c(variableofinterest, control2), collapse = "+"))),
data = df)
Call:
lm(formula = as.formula(paste0(depvar, "~", paste0(c(variableofinterest,
control2), collapse = "+"))), data = df)
Coefficients:
(Intercept) x1 I(x2 + x3)
0.4899 2.0157 2.0342

Using updating a linear model with lagged new variables

I have a base model y ~ x1 + x2.
I want to update the model to contain y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2).
x3 and x4 are also dynamically selected.
fmla <- as.formula(paste('.~.', paste(c(x3, x4), collapse = '+')))
My update formula: update(fit, fmla)
I get a error saying x3/x4 is not found from the as.formula function. I understand the error just not how to get around to what I want to do.
A possible solution for your problem can be:
# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))
# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)
# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265
Below an example with the dynlm package.
data("USDistLag", package = "lmtest")
# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )
# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448
# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067

Using lapply on a list of models

I have generated a list of models, and would like to create a summary table.
As and example, here are two models:
x <- seq(1:10)
y <- sin(x)^2
model1 <- lm(y ~ x)
model2 <- lm(y ~ x + I(x^2) + I(x^3))
and two formulas, the first generating the equation from components of formula
get.model.equation <- function(x) {
x <- as.character((x$call)$formula)
x <- paste(x[2],x[1],x[3])
}
and the second generating the name of model as a string
get.model.name <- function(x) {
x <- deparse(substitute(x))
}
With these, I create a summary table
model.list <- list(model1, model2)
AIC.data <- lapply(X = model.list, FUN = AIC)
AIC.data <- as.numeric(AIC.data)
model.models <- lapply(X = model.list, FUN = get.model)
model.summary <- cbind(model.models, AIC.data)
model.summary <- as.data.frame(model.summary)
names(model.summary) <- c("Model", "AIC")
model.summary$AIC <- unlist(model.summary$AIC)
rm(AIC.data)
model.summary[order(model.summary$AIC),]
Which all works fine.
I'd like to add the model name to the table using get.model.name
x <- get.model.name(model1)
Which gives me "model1" as I want.
So now I apply the function to the list of models
model.names <- lapply(X = model.list, FUN = get.model.name)
but now instead of model1 I get X[[1L]]
How do I get model1 rather than X[[1L]]?
I'm after a table that looks like this:
Model Formula AIC
model1 y ~ x 11.89136
model2 y ~ x + I(x^2) + I(x^3) 15.03888
Do you want something like this?
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
I'd do something like this:
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
# changed Reduce('rbind', ...) to do.call(rbind, ...) (Hadley's comment)
do.call(rbind,
lapply(names(model.list), function(x)
data.frame(model = x,
formula = get.model.equation(model.list[[x]]),
AIC = AIC(model.list[[x]])
)
)
)
# model formula AIC
# 1 model1 y ~ x 11.89136
# 2 model2 y ~ x + I(x^2) + I(x^3) 15.03888
Another option, with ldply, but see hadley's comment below for a more efficient use of ldply:
# prepare data
x <- seq(1:10)
y <- sin(x)^2
dat <- data.frame(x,y)
# create list of named models obviously these are not suited to the data here, just to make the workflow work...
models <- list(model1=lm(y~x, data = dat),
model2=lm(y~I(1/x), data=dat),
model3=lm(y ~ log(x), data = dat),
model4=nls(y ~ I(1/x*a) + b*x, data = dat, start = list(a = 1, b = 1)),
model5=nls(y ~ (a + b*log(x)), data=dat, start = setNames(coef(lm(y ~ log(x), data=dat)), c("a", "b"))),
model6=nls(y ~ I(exp(1)^(a + b * x)), data=dat, start = list(a=0,b=0)),
model7=nls(y ~ I(1/x*a)+b, data=dat, start = list(a=1,b=1))
)
library(plyr)
library(AICcmodavg) # for small sample sizes
# build table with model names, function, AIC and AICc
data.frame(cbind(ldply(models, function(x) cbind(AICc = AICc(x), AIC = AIC(x))),
model = sapply(1:length(models), function(x) deparse(formula(models[[x]])))
))
.id AICc AIC model
1 model1 15.89136 11.89136 y ~ x
2 model2 15.78480 11.78480 y ~ I(1/x)
3 model3 15.80406 11.80406 y ~ log(x)
4 model4 16.62157 12.62157 y ~ I(1/x * a) + b * x
5 model5 15.80406 11.80406 y ~ (a + b * log(x))
6 model6 15.88937 11.88937 y ~ I(exp(1)^(a + b * x))
7 model7 15.78480 11.78480 y ~ I(1/x * a) + b
It's not immediately obvious to me how to replace the .id with a column name in the ldply function, any tips?

multivariate regression

I have two dependents that both depent on two variables AND on each other, can this be modelled in R (must be!) but I can't figure out how, anyone a hint?
In clear terms:
I want to model my data with the following model:
Y1=X1*coef1+X2*coef2
Y2=X1*coef2+X2*coef3
Note: coef2 appears in both lines
Xi, Yi is input and output data respectively
I got this far:
lm(Y1~X1+X2,mydata)
now how do I add the second line of the model including the cross dependency?
Your help is greatly appreciated!
Cheers, Bastiaan
Try this:
# sample data - true coefs are 2, 3, 4
set.seed(123)
n <- 35
DF <- data.frame(X1 = 1, X2 = 1:n, X3 = (1:n)^2)
DF <- transform(DF, Y1 = X1 * 2 + X2 * 3 + rnorm(n),
Y2 = X1 * 3 + X2 * 4 + rnorm(n))
# construct data frame for required model
DF2 <- with(DF, data.frame(y = c(Y1, Y2),
x1 = c(X1, 0*X1),
x2 = c(X2, X1),
x3 = c(0*X2, X2)))
lm(y ~. - 1, DF2)
We see it does, indeed, recover the true coefs of 2, 3, 4:
> lm(y ~. - 1, DF2)
Call:
lm(formula = y ~ . - 1, data = DF2)
Coefficients:
x1 x2 x3
2.084 2.997 4.007

Resources