Using updating a linear model with lagged new variables - r

I have a base model y ~ x1 + x2.
I want to update the model to contain y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2).
x3 and x4 are also dynamically selected.
fmla <- as.formula(paste('.~.', paste(c(x3, x4), collapse = '+')))
My update formula: update(fit, fmla)
I get a error saying x3/x4 is not found from the as.formula function. I understand the error just not how to get around to what I want to do.

A possible solution for your problem can be:
# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))
# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)
# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265
Below an example with the dynlm package.
data("USDistLag", package = "lmtest")
# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )
# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448
# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067

Related

How to run the same regression but replacing the dataframe used in R?

I have 3 dataframes (df1, df2, df3) with the same variable names, and I would like to perform essentially the same regressions on all 3 dataframes. My regressions currently look like this:
m1 <- lm(y ~ x1 + x2, df1)
m2 <- lm(y~ x1 + x2, df2)
m3<- lm(y~ x1 + x2, df3)
Is there a way I can use for-loops in order to perform these regressions by just swapping out dataframe used?
Thank you
or add the dataframes to a list and map the lm function over the list.
library(tidyverse)
df1 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df2 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df3 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df_list <- list(df1, df2, df3)
m <- map(df_list, ~lm(y ~ x, data = .))
Using update.
(fit <- lm(Y ~ X1 + X2 + X3, df1))
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df1)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
update(fit, data=df2)
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df2)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
Or lapply
lapply(mget(ls(pattern='^df\\d$')), lm, formula=Y ~ X1 + X2 + X3)
# $df1
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
#
#
# $df2
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
#
#
# $df3
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.5720 0.6106 -0.1576 1.1391
Data:
set.seed(42)
f <- \() transform(data.frame(X1=rnorm(10), X2=rnorm(10), X3=rnorm(10)),
Y=1 + .2*X1 + .4*X2 + .8*X3 + rnorm(10))
set.seed(42); df1 <- f(); df2 <- f()

How to use a variable in lm() function in R?

Let us say I have a dataframe (df) with two columns called "height" and "weight".
Let's say I define:
x = "height"
How do I use x within my lm() function? Neither df[x] nor just using x works.
Two ways :
Create a formula with paste
x = "height"
lm(paste0(x, '~', 'weight'), df)
Or use reformulate
lm(reformulate("weight", x), df)
Using reproducible example with mtcars dataset :
x = "Cyl"
lm(paste0(x, '~', 'mpg'), data = mtcars)
#Call:
#lm(formula = paste0(x, "~", "mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
and same with
lm(reformulate("mpg", x), mtcars)
We can use glue to create the formula
x <- "height"
lm(glue::glue('{x} ~ weight'), data = df)
Using a reproducible example with mtcars
x <- 'cyl'
lm(glue::glue('{x} ~ mpg'), data = mtcars)
#Call:
#lm(formula = glue::glue("{x} ~ mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
When you run x = "height" your are assigning a string of characters to the variable x.
Consider this data frame:
df <- data.frame(
height = c(176, 188, 165),
weight = c(75, 80, 66)
)
If you want a regression using height and weight you can either do this:
lm(height ~ weight, data = df)
# Call:
# lm(formula = height ~ weight, data = df)
#
# Coefficients:
# (Intercept) weight
# 59.003 1.593
or this:
lm(df$height ~ df$weight)
# Call:
# lm(formula = df$height ~ df$weight)
#
# Coefficients:
# (Intercept) df$weight
# 59.003 1.593
If you really want to use x instead of height, you must have a variable called x (in your df or in your environment). You can do that by creating a new variable:
x <- df$height
y <- df$weight
lm(x ~ y)
# Call:
# lm(formula = x ~ y)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593
Or by changing the names of existing variables:
names(df) <- c("x", "y")
lm(x ~ y, data = df)
# Call:
# lm(formula = x ~ y, data = df)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593

Is there any way to construct real regression equation by taking parameters from models in R?

data is:
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
function to fit 5 models
library(splines)
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
mods[[1]]
stargazer(mods, type="text)
I want to construct real regression equations in real format of each one of the models by taking parameters from fitting models and ind variables automatically inside of R if it is possible. For example: for fit1 model, intercept = -0.20612, x = 0.17443, x = 0.03203. Then equation will be something like this: y = -0.206 + 0.174x + 0.032z etc and wanna list these equations of all models in a table along with very common useful statistics like R2, P value, adj.R2, observations etc. stargazer is not showing me my desired output. So I wanna make sure if there is any way to do this in R without doing it manually in excel?
Thanks in advance!
We can map through mods using #J.R.'s function here and broom::glance to the model R2, P-value, and adj.R2.
library(purrr)
library(broom)
map_dfr(mods,
function(x) data.frame('Eq'=regEq(lmObj = x, dig = 3), broom::glance(x), stringsAsFactors = FALSE),
.id='Model')
Model Eq r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
1 fit1 y = 0.091 - 0.022*x - 0.027*z 0.0012601436 -0.01933243 1.028408 0.06119408 0.9406769 3 -143.1721 294.3441 304.7648
2 fit2 y = 0.093 - 0.022*x - 0.003*I(z^2) 0.0006154188 -0.01999045 1.028740 0.02986619 0.9705843 3 -143.2043 294.4087 304.8294
3 fit3 y = 0.093 - 0.248*poly(x, 3)1 - 0.186*poly(x, 3)2 - 0.581*poly(x, 3)3 - 0.031*z 0.0048717358 -0.03702840 1.037296 0.11627016 0.9764662 5 -142.9909 297.9819 313.6129
4 fit4 y = 0.201 + 0.08*ns(x, 3)1 - 0.385*ns(x, 3)2 - 0.281*ns(x, 3)3 - 0.031*z 0.0032813558 -0.03868575 1.038125 0.07818877 0.9887911 5 -143.0708 298.1416 313.7726
deviance df.residual
1 102.5894 97
2 102.6556 97
3 102.2184 95
4 102.3818 95
The problem is that each of your models is not exactly ideal for tabular data, for example fit 3 returns 4 estimates while fit 1 returns just 3
If you are comfortable with lists I would suggest they are a great way of storing this kind of information
library(broom)
library(tidyverse)
library(splines)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
list_representation<- map(mods,tidy)
Assuming mods shown in the Note at the end and that what is wanted is a character vector of a text representation of the formulas with the coefficients substituted we have the following.
The fit2text function takes a fitted object and outputs a character string with the text representation of the formula. The round argument gives the number of digits that the coefficients are rounded to in the result. The rmI argument, if TRUE, removes any I(...) and just leaves the ... inside assuming, for ease of implementation, that the expression inside does not contain any parentheses. If FALSE then I is not removed.
Other statistics can be extracted from summary(mods[[1]]) or broom::glance(mods[[1]])
fit2text <- function(fit, round = 2, rmI = TRUE) {
fo <- formula(fit)
resp <- all.vars(fo)[1]
co <- round(coef(fit), round)
labs <- c(if (terms(fit, "intercept") == 1) "", labels(fit))
p <- gsub("\\+ *-", "- ", paste(resp, "~ ", paste(paste(co, labs), collapse = " + ")))
p2 <- if (rmI) gsub("I\\(([^)]+)\\)", "\\1", p) else p
gsub(" +", " ", p2)
}
sapply(mods, fit2text)
giving:
fit1
"y ~ -0.11 - 0.05 x + 0.03 z"
fit2
"y ~ -0.07 - 0.05 x - 0.04 z^2"
fit3
"y ~ -0.11 - 0.43 poly(x, 3) - 1.05 z + 0.27 + 0.04 poly(x, 3)"
fit4
"y ~ -0.55 + 0.23 ns(x, 3) + 0.79 z - 0.25 + 0.04 ns(x, 3)"
Note
The code in the question was not reproducible because the library calls were missing, it used random numbers without a set.seed and there were some further errors in the code. For clarity, we provide the following reproducible code that we used to provide the input for the above answer.
library(splines)
set.seed(123)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
# function to fit 5 models
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)

R: merge strings into one formula object

I have a character object that describes the control variables for a regression model. I fail to dynamically reference those correctly, whenever there is more than one control variable. Consider the following example:
x1 = runif(1000); x2 = runif(1000); x3 = runif(1000); e = runif(1000)
y = 2*x1+3*x2+x3+ e
df = data.frame(y, x1,x2,x3)
# define formula inputs
depvar =as.symbol("y")
variableofinterest = as.symbol("x1")
control1 = as.symbol('x2')
control2 = as.symbol('x2+x3')
# this works
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control1) , data = df)))
# this does not
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
It does not work, since the dataframe obviously contains no variable x2+x3, but how can I disentangle those to reference correctly, when the input control = x2+x3 is a given character (beyond my control)
We can quote instead of as.symbol
control2 <- quote(x2 + x3)
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
#Call:
#lm(formula = y ~ x1 + (x2 + x3), data = df)
#Coefficients:
#(Intercept) x1 x2 x3
# 0.450 2.056 3.007 1.056
Note that when we do as.symbol, it adds a backquote
as.symbol('x2 + x3')
#`x2 + x3`
compare it with quote which returns a language object instead of symbol
quote(x2 + x3)
#x2 + x3
If it is already a string, then we can use parse_expr from rlang
control2 <- rlang::parse_expr('x2 + x3')
eval(bquote(lm(.(depvar)~ .(variableofinterest) + .(control2) , data = df)))
#Call:
#lm(formula = y ~ x1 + (x2 + x3), data = df)
#Coefficients:
#(Intercept) x1 x2 x3
# 0.450 2.056 3.007 1.056
If your objective is to have just one coefficient for the x2+x3you should use I (Inhibit Interpretation/Conversion of Objects).
Futhermore, you would need what #Roland has said:
control2 = parse(text = 'x2+x3')[[1]]
eval(bquote(lm(.(depvar)~ .(variableofinterest) + I(.(control2)) , data = df)))
Call:
lm(formula = y ~ x1 + I(x2 + x3), data = df)
Coefficients:
(Intercept) x1 I(x2 + x3)
0.4899 2.0157 2.0342
Otherwise, if you don't want to work with eval, as.symbol, bquote and .( ) you can use as.formula and paste0.
# define formula inputs
depvar = "y"
variableofinterest = "x1"
control1 = 'x2'
control2 = 'I(x2+x3)'
lm(as.formula(paste0(depvar,
"~",
paste0(c(variableofinterest, control2), collapse = "+"))),
data = df)
Call:
lm(formula = as.formula(paste0(depvar, "~", paste0(c(variableofinterest,
control2), collapse = "+"))), data = df)
Coefficients:
(Intercept) x1 I(x2 + x3)
0.4899 2.0157 2.0342

What is the difference between x^2 and I(x^2) in R?

What is the difference between these two models in R?
model1 <- glm(y~ x + x^2, family=binomial(link=logit), weights=numbers))
model2 <- glm(y~ x + I(x^2),family=binomial(link=logit), weights=numbers))
Also what is the equvalent of I(x^2) in SAS?
The I() function means 'as is' whereas the ^n (to the power of n) operator means 'include these variables and all interactions up to n way'
This means:
I(X^2) is literally regressing Y against X squared and
X^2 means include X and the 2 way interaction of X but since it is only one variable there is no interaction so it returns only itself i.e. X. Note that in your formula you say X + X^2 which translates to X + X which in the formula syntax is only taken into account once. I.e. one of the two Xs will be removed.
Demonstration:
Y <- runif(100)
X2 <- runif(100)
df <- data.frame(Y,X1,X2)
b <- lm( Y ~ X2 + X2^2 + X2,data=df)
> b
Call:
lm(formula = Y ~ X2 + X2^2 + X2, data = df)
Coefficients:
(Intercept) X2
0.48470 0.05098
a <- lm( Y ~ X2 + I(X2^2),data=df)
> a
Call:
lm(formula = Y ~ X2 + I(X2^2), data = df)
Coefficients:
(Intercept) X2 I(X2^2)
0.47545 0.11339 -0.06682
Hope it helps!

Resources