What is the difference between x^2 and I(x^2) in R? - r

What is the difference between these two models in R?
model1 <- glm(y~ x + x^2, family=binomial(link=logit), weights=numbers))
model2 <- glm(y~ x + I(x^2),family=binomial(link=logit), weights=numbers))
Also what is the equvalent of I(x^2) in SAS?

The I() function means 'as is' whereas the ^n (to the power of n) operator means 'include these variables and all interactions up to n way'
This means:
I(X^2) is literally regressing Y against X squared and
X^2 means include X and the 2 way interaction of X but since it is only one variable there is no interaction so it returns only itself i.e. X. Note that in your formula you say X + X^2 which translates to X + X which in the formula syntax is only taken into account once. I.e. one of the two Xs will be removed.
Demonstration:
Y <- runif(100)
X2 <- runif(100)
df <- data.frame(Y,X1,X2)
b <- lm( Y ~ X2 + X2^2 + X2,data=df)
> b
Call:
lm(formula = Y ~ X2 + X2^2 + X2, data = df)
Coefficients:
(Intercept) X2
0.48470 0.05098
a <- lm( Y ~ X2 + I(X2^2),data=df)
> a
Call:
lm(formula = Y ~ X2 + I(X2^2), data = df)
Coefficients:
(Intercept) X2 I(X2^2)
0.47545 0.11339 -0.06682
Hope it helps!

Related

Is there any way to construct real regression equation by taking parameters from models in R?

data is:
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
function to fit 5 models
library(splines)
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
mods[[1]]
stargazer(mods, type="text)
I want to construct real regression equations in real format of each one of the models by taking parameters from fitting models and ind variables automatically inside of R if it is possible. For example: for fit1 model, intercept = -0.20612, x = 0.17443, x = 0.03203. Then equation will be something like this: y = -0.206 + 0.174x + 0.032z etc and wanna list these equations of all models in a table along with very common useful statistics like R2, P value, adj.R2, observations etc. stargazer is not showing me my desired output. So I wanna make sure if there is any way to do this in R without doing it manually in excel?
Thanks in advance!
We can map through mods using #J.R.'s function here and broom::glance to the model R2, P-value, and adj.R2.
library(purrr)
library(broom)
map_dfr(mods,
function(x) data.frame('Eq'=regEq(lmObj = x, dig = 3), broom::glance(x), stringsAsFactors = FALSE),
.id='Model')
Model Eq r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
1 fit1 y = 0.091 - 0.022*x - 0.027*z 0.0012601436 -0.01933243 1.028408 0.06119408 0.9406769 3 -143.1721 294.3441 304.7648
2 fit2 y = 0.093 - 0.022*x - 0.003*I(z^2) 0.0006154188 -0.01999045 1.028740 0.02986619 0.9705843 3 -143.2043 294.4087 304.8294
3 fit3 y = 0.093 - 0.248*poly(x, 3)1 - 0.186*poly(x, 3)2 - 0.581*poly(x, 3)3 - 0.031*z 0.0048717358 -0.03702840 1.037296 0.11627016 0.9764662 5 -142.9909 297.9819 313.6129
4 fit4 y = 0.201 + 0.08*ns(x, 3)1 - 0.385*ns(x, 3)2 - 0.281*ns(x, 3)3 - 0.031*z 0.0032813558 -0.03868575 1.038125 0.07818877 0.9887911 5 -143.0708 298.1416 313.7726
deviance df.residual
1 102.5894 97
2 102.6556 97
3 102.2184 95
4 102.3818 95
The problem is that each of your models is not exactly ideal for tabular data, for example fit 3 returns 4 estimates while fit 1 returns just 3
If you are comfortable with lists I would suggest they are a great way of storing this kind of information
library(broom)
library(tidyverse)
library(splines)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
list_representation<- map(mods,tidy)
Assuming mods shown in the Note at the end and that what is wanted is a character vector of a text representation of the formulas with the coefficients substituted we have the following.
The fit2text function takes a fitted object and outputs a character string with the text representation of the formula. The round argument gives the number of digits that the coefficients are rounded to in the result. The rmI argument, if TRUE, removes any I(...) and just leaves the ... inside assuming, for ease of implementation, that the expression inside does not contain any parentheses. If FALSE then I is not removed.
Other statistics can be extracted from summary(mods[[1]]) or broom::glance(mods[[1]])
fit2text <- function(fit, round = 2, rmI = TRUE) {
fo <- formula(fit)
resp <- all.vars(fo)[1]
co <- round(coef(fit), round)
labs <- c(if (terms(fit, "intercept") == 1) "", labels(fit))
p <- gsub("\\+ *-", "- ", paste(resp, "~ ", paste(paste(co, labs), collapse = " + ")))
p2 <- if (rmI) gsub("I\\(([^)]+)\\)", "\\1", p) else p
gsub(" +", " ", p2)
}
sapply(mods, fit2text)
giving:
fit1
"y ~ -0.11 - 0.05 x + 0.03 z"
fit2
"y ~ -0.07 - 0.05 x - 0.04 z^2"
fit3
"y ~ -0.11 - 0.43 poly(x, 3) - 1.05 z + 0.27 + 0.04 poly(x, 3)"
fit4
"y ~ -0.55 + 0.23 ns(x, 3) + 0.79 z - 0.25 + 0.04 ns(x, 3)"
Note
The code in the question was not reproducible because the library calls were missing, it used random numbers without a set.seed and there were some further errors in the code. For clarity, we provide the following reproducible code that we used to provide the input for the above answer.
library(splines)
set.seed(123)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
# function to fit 5 models
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)

Using updating a linear model with lagged new variables

I have a base model y ~ x1 + x2.
I want to update the model to contain y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2).
x3 and x4 are also dynamically selected.
fmla <- as.formula(paste('.~.', paste(c(x3, x4), collapse = '+')))
My update formula: update(fit, fmla)
I get a error saying x3/x4 is not found from the as.formula function. I understand the error just not how to get around to what I want to do.
A possible solution for your problem can be:
# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))
# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)
# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265
Below an example with the dynlm package.
data("USDistLag", package = "lmtest")
# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )
# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448
# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067

How to plot ols with r.c. splines

I'd like to plot the predicted line of the regression that contains a restricted cubic spline due to non-linearity in the model and the standard error bands. I can get the predicted points, but am not sure to to just plot the lines and error bands. ggplot is preferred, or base graphics is fine also. Thanks.
Here is an example from the documentation:
library(rms)
# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
pred <- fitted(f) # or predict(f) or f$linear.predictors
f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
fastbw(f2, aics=100000)
options(datadist=NULL)
And a plot of the predicted values of the model:
plot(predict(f2))
The rms package has a number of helpful functions for this purpose. It is worth looking at http://biostat.mc.vanderbilt.edu/wiki/Main/RmS
In this instance, you can simple set datadist (which set up distribution summaries for predictor variables) appropriately and then use plot(Predict(f) or ggplot(Predict(f))
set.seed(5)
# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
ddist <- datadist(x1,x2,x3,x4)
options(datadist='ddist')
plot(Predict(f))
ggplot(Predict(f))

Stack coefficient plots in R

I'm running a set of models with the same independent variables but different dependent variables and would like to create a set of coefficient plots in one figures in which each model gets its own panel. The following code provides intuition but in this all of the models are integrated into one figure rather than have 3 unique panels side-by-side in one figure:
require("coefplot")
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100), y1 = rnorm(100), y2 = rnorm(100), y3 = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
mod2 <- lm(y2 ~ x + z, data = dat)
mod3 <- lm(y3 ~ x + z, data = dat)
multiplot(mod1,mod2, mod3)
Which generates this plot:
Any thoughts on how to get them to panel next to each other in one figure? Thanks!
I haven't used the coefplot package before, but you can create a coefficient plot directly in ggplot2.
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100), y1 = rnorm(100), y2 = rnorm(100), y3 = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
mod2 <- lm(y2 ~ x + z, data = dat)
mod3 <- lm(y3 ~ x + z, data = dat)
## Create data frame of model coefficients and standard errors
# Function to extract what we need
ce = function(model.obj) {
extract = summary(get(model.obj))$coefficients[ ,1:2]
return(data.frame(extract, vars=row.names(extract), model=model.obj))
}
# Run function on the three models and bind into single data frame
coefs = do.call(rbind, sapply(paste0("mod",1:3), ce, simplify=FALSE))
names(coefs)[2] = "se"
# Faceted coefficient plot
ggplot(coefs, aes(vars, Estimate)) +
geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se, colour=vars),
lwd=1, width=0) +
geom_point(size=3, aes(colour=vars)) +
facet_grid(. ~ model) +
coord_flip() +
guides(colour=FALSE) +
labs(x="Coefficient", y="Value") +
theme_grey(base_size=15)

Using lapply on a list of models

I have generated a list of models, and would like to create a summary table.
As and example, here are two models:
x <- seq(1:10)
y <- sin(x)^2
model1 <- lm(y ~ x)
model2 <- lm(y ~ x + I(x^2) + I(x^3))
and two formulas, the first generating the equation from components of formula
get.model.equation <- function(x) {
x <- as.character((x$call)$formula)
x <- paste(x[2],x[1],x[3])
}
and the second generating the name of model as a string
get.model.name <- function(x) {
x <- deparse(substitute(x))
}
With these, I create a summary table
model.list <- list(model1, model2)
AIC.data <- lapply(X = model.list, FUN = AIC)
AIC.data <- as.numeric(AIC.data)
model.models <- lapply(X = model.list, FUN = get.model)
model.summary <- cbind(model.models, AIC.data)
model.summary <- as.data.frame(model.summary)
names(model.summary) <- c("Model", "AIC")
model.summary$AIC <- unlist(model.summary$AIC)
rm(AIC.data)
model.summary[order(model.summary$AIC),]
Which all works fine.
I'd like to add the model name to the table using get.model.name
x <- get.model.name(model1)
Which gives me "model1" as I want.
So now I apply the function to the list of models
model.names <- lapply(X = model.list, FUN = get.model.name)
but now instead of model1 I get X[[1L]]
How do I get model1 rather than X[[1L]]?
I'm after a table that looks like this:
Model Formula AIC
model1 y ~ x 11.89136
model2 y ~ x + I(x^2) + I(x^3) 15.03888
Do you want something like this?
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
I'd do something like this:
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
# changed Reduce('rbind', ...) to do.call(rbind, ...) (Hadley's comment)
do.call(rbind,
lapply(names(model.list), function(x)
data.frame(model = x,
formula = get.model.equation(model.list[[x]]),
AIC = AIC(model.list[[x]])
)
)
)
# model formula AIC
# 1 model1 y ~ x 11.89136
# 2 model2 y ~ x + I(x^2) + I(x^3) 15.03888
Another option, with ldply, but see hadley's comment below for a more efficient use of ldply:
# prepare data
x <- seq(1:10)
y <- sin(x)^2
dat <- data.frame(x,y)
# create list of named models obviously these are not suited to the data here, just to make the workflow work...
models <- list(model1=lm(y~x, data = dat),
model2=lm(y~I(1/x), data=dat),
model3=lm(y ~ log(x), data = dat),
model4=nls(y ~ I(1/x*a) + b*x, data = dat, start = list(a = 1, b = 1)),
model5=nls(y ~ (a + b*log(x)), data=dat, start = setNames(coef(lm(y ~ log(x), data=dat)), c("a", "b"))),
model6=nls(y ~ I(exp(1)^(a + b * x)), data=dat, start = list(a=0,b=0)),
model7=nls(y ~ I(1/x*a)+b, data=dat, start = list(a=1,b=1))
)
library(plyr)
library(AICcmodavg) # for small sample sizes
# build table with model names, function, AIC and AICc
data.frame(cbind(ldply(models, function(x) cbind(AICc = AICc(x), AIC = AIC(x))),
model = sapply(1:length(models), function(x) deparse(formula(models[[x]])))
))
.id AICc AIC model
1 model1 15.89136 11.89136 y ~ x
2 model2 15.78480 11.78480 y ~ I(1/x)
3 model3 15.80406 11.80406 y ~ log(x)
4 model4 16.62157 12.62157 y ~ I(1/x * a) + b * x
5 model5 15.80406 11.80406 y ~ (a + b * log(x))
6 model6 15.88937 11.88937 y ~ I(exp(1)^(a + b * x))
7 model7 15.78480 11.78480 y ~ I(1/x * a) + b
It's not immediately obvious to me how to replace the .id with a column name in the ldply function, any tips?

Resources