Is there any way to construct real regression equation by taking parameters from models in R? - r

data is:
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
function to fit 5 models
library(splines)
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
mods[[1]]
stargazer(mods, type="text)
I want to construct real regression equations in real format of each one of the models by taking parameters from fitting models and ind variables automatically inside of R if it is possible. For example: for fit1 model, intercept = -0.20612, x = 0.17443, x = 0.03203. Then equation will be something like this: y = -0.206 + 0.174x + 0.032z etc and wanna list these equations of all models in a table along with very common useful statistics like R2, P value, adj.R2, observations etc. stargazer is not showing me my desired output. So I wanna make sure if there is any way to do this in R without doing it manually in excel?
Thanks in advance!

We can map through mods using #J.R.'s function here and broom::glance to the model R2, P-value, and adj.R2.
library(purrr)
library(broom)
map_dfr(mods,
function(x) data.frame('Eq'=regEq(lmObj = x, dig = 3), broom::glance(x), stringsAsFactors = FALSE),
.id='Model')
Model Eq r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
1 fit1 y = 0.091 - 0.022*x - 0.027*z 0.0012601436 -0.01933243 1.028408 0.06119408 0.9406769 3 -143.1721 294.3441 304.7648
2 fit2 y = 0.093 - 0.022*x - 0.003*I(z^2) 0.0006154188 -0.01999045 1.028740 0.02986619 0.9705843 3 -143.2043 294.4087 304.8294
3 fit3 y = 0.093 - 0.248*poly(x, 3)1 - 0.186*poly(x, 3)2 - 0.581*poly(x, 3)3 - 0.031*z 0.0048717358 -0.03702840 1.037296 0.11627016 0.9764662 5 -142.9909 297.9819 313.6129
4 fit4 y = 0.201 + 0.08*ns(x, 3)1 - 0.385*ns(x, 3)2 - 0.281*ns(x, 3)3 - 0.031*z 0.0032813558 -0.03868575 1.038125 0.07818877 0.9887911 5 -143.0708 298.1416 313.7726
deviance df.residual
1 102.5894 97
2 102.6556 97
3 102.2184 95
4 102.3818 95

The problem is that each of your models is not exactly ideal for tabular data, for example fit 3 returns 4 estimates while fit 1 returns just 3
If you are comfortable with lists I would suggest they are a great way of storing this kind of information
library(broom)
library(tidyverse)
library(splines)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
list_representation<- map(mods,tidy)

Assuming mods shown in the Note at the end and that what is wanted is a character vector of a text representation of the formulas with the coefficients substituted we have the following.
The fit2text function takes a fitted object and outputs a character string with the text representation of the formula. The round argument gives the number of digits that the coefficients are rounded to in the result. The rmI argument, if TRUE, removes any I(...) and just leaves the ... inside assuming, for ease of implementation, that the expression inside does not contain any parentheses. If FALSE then I is not removed.
Other statistics can be extracted from summary(mods[[1]]) or broom::glance(mods[[1]])
fit2text <- function(fit, round = 2, rmI = TRUE) {
fo <- formula(fit)
resp <- all.vars(fo)[1]
co <- round(coef(fit), round)
labs <- c(if (terms(fit, "intercept") == 1) "", labels(fit))
p <- gsub("\\+ *-", "- ", paste(resp, "~ ", paste(paste(co, labs), collapse = " + ")))
p2 <- if (rmI) gsub("I\\(([^)]+)\\)", "\\1", p) else p
gsub(" +", " ", p2)
}
sapply(mods, fit2text)
giving:
fit1
"y ~ -0.11 - 0.05 x + 0.03 z"
fit2
"y ~ -0.07 - 0.05 x - 0.04 z^2"
fit3
"y ~ -0.11 - 0.43 poly(x, 3) - 1.05 z + 0.27 + 0.04 poly(x, 3)"
fit4
"y ~ -0.55 + 0.23 ns(x, 3) + 0.79 z - 0.25 + 0.04 ns(x, 3)"
Note
The code in the question was not reproducible because the library calls were missing, it used random numbers without a set.seed and there were some further errors in the code. For clarity, we provide the following reproducible code that we used to provide the input for the above answer.
library(splines)
set.seed(123)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
# function to fit 5 models
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)

Related

How to perform a nonlinear regression of a complex function that has a summation using R?

I have the following function:
Of this function, the parameter R is a constant with a value of 22.5. I want to estimate parameters A and B using nonlinear regression (nls() function). I made a few attempts, but all were unsuccessful. I'm not very familiar with this type of operations in R, so I would like your help.
Additionally, if possible, I would also like to plot this function using ggplot2.
# Initial data
x <- c(0, 60, 90, 120, 180, 240)
y <- c(0, 0.967676, 1.290101, 1.327099, 1.272404, 1.354246)
R <- 22.5
df <- data.frame(x, y)
f <- function(x) (1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2))
# First try
nls(formula = y ~ A*(1-(6/(pi^2))*sum(f, seq(1, Inf, 1))),
data = df,
start = list(A = 1,
B = 0.7))
Error in seq.default(1, Inf, 1) : 'to' must be a finite number
# Second try
nls(formula = y ~ A*(1-(6/(pi^2))*integrate(f, 1, Inf)),
data = df,
start = list(A = 1,
B = 0.7))
Error in f(x, ...) : object 'n' not found
You can use a finite sum approximation. Using 25 terms:
f <- function(x, B, n = 1:25) sum((1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2)))
fm <- nls(formula = y ~ cbind(A = (1-(6/pi^2))* Vectorize(f)(x, B)),
data = df,
start = list(B = 0.7),
alg = "plinear")
fm
giving:
Nonlinear regression model
model: y ~ cbind(A = (1 - (6/pi^2)) * Vectorize(f)(x, B))
data: df
B .lin.A
-0.00169 1.39214
residual sum-of-squares: 1.054
Number of iterations to convergence: 12
Achieved convergence tolerance: 9.314e-06
The model does not seem to fit the data very well (solid line in graph below); however, a logistic model seems to work well (dashed line).
fm2 <- nls(y ~ SSlogis(x, Asym, xmid, scal), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
legend("bottomright", c("fm", "fm2"), lty = 1:2)

Calculating RSS manually with given pairs of beta0 and beta1

I am trying to manually calculate the RSS for a dataset with given pairs of beta0 and beta1. For each (beta_0,beta_1) pair of values, I need to calculate the residual sum of squares. Store it as a vector in data called RSS. Here's the code provided.
x = pinotnoir$Aroma
y = pinotnoir$Quality
fit = lm(y ~ x)
summary(fit)
b0s <- seq(0, 10, .1)
b1s <- seq(0, 4, .01)
data <- expand.grid(beta0=b0s, beta1=b1s)
Here's what I have so far. I think the residual calculation is wrong but I'm not sure how to fix it.
rows = length(b1s)
rsd <- rep(NA,rows)
for (i in 1:rows){
residual = (y - (b0s[i] + b1s[i] * x))^2
rsd[i] <- residual
}
data <- expand.grid(beta0=b0s, beta1=b1s, RSS=rsd)
Any help would be appreciated. Thanks in advance!
I am not sure this is exactly what you aim but adapting your code slightly you can get the sum of squared residuals and which betas minimizes them. (using mtcars data for the example)
mtcars
x = mtcars$drat
y = mtcars$wt
(fit = lm(y ~ x))
summary(fit)
grid_len <- 20
b0s <- seq(5, 10, length.out = grid_len)
b1s <- seq(-3, -1, length.out = grid_len)
(data <- expand.grid(beta0=b0s, beta1=b1s))
rows = nrow(data)
resids <- rep(NA,rows)
for (i in 1:rows) {
fitted <- (data$beta0[i] + (data$beta1[i] * x))
squared_resid <- (y - fitted)^2
SSR <- sum(squared_resid)
resids[i] <- SSR
cat(i, ": ", SSR, "\n")
}
data[which.min(resids), ]
fit
results:
> data[which.min(resids), ]
beta0 beta1
332 7.894737 -1.315789
> fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.906 -1.304

natural cubic spline regression with R

I seem to have a problem with the splines::ns() function in R.
I created a simple dummy problem
dat <- data.frame(t <- seq(0, 6, .01),
x <- rnorm(length(t), sd = 1),
y <- 5 + t - x^2 + rnorm(length(t), sd = .33))
lm(y ~ t + I(x^2), data = dat)
library(splines)
lm(y ~ t + ns(x, knots = c(0), Boundary.knots = c(-3, 3)), data = dat)
While the first model works fine, the second one fails to identify the intercept correctly. What am I missing here?
There is nothing wrong, because you are not fitting exactly the same model, and they are not even equivalent.
To explain the different result you see, it is sufficient to use a simpler example with a single covariate x. We generate data from a quadratic polynomial: 5 + x + x^2, then fit several models.
set.seed(0)
x <- rnorm(500, mean = 1) ## `x` with non-zero mean
y <- 5 + x + x * x + rnorm(500, sd = 0.5)
library(splines)
fit1 <- lm(y ~ x + I(x^2))
#(Intercept) x I(x^2)
# 4.992 1.032 0.980
fit2 <- lm(y ~ poly(x, degree = 2))
#(Intercept) poly(x, degree = 2)1 poly(x, degree = 2)2
# 7.961 70.198 28.720
fit3 <- lm(y ~ bs(x, degree = 2, df = 2))
#(Intercept) bs(x, degree = 2, df = 2)1 bs(x, degree = 2, df = 2)2
# 6.583 -8.337 20.650
fit4 <- lm(y ~ ns(x, df = 2))
#(Intercept) ns(x, df = 2)1 ns(x, df = 2)2
# 5.523 10.737 21.265
The first 3 models are not the same, in terms of parameterization, but they are equivalent: they are all fitting a quadratic polynomial with 3 degree of freedom. To see their equivalence, we check their fitted values:
sum(abs(fit1$fitted - fit2$fitted))
# [1] 1.54543e-13
sum(abs(fit1$fitted - fit3$fitted))
# [1] 2.691181e-13
To see the difference in parameterization, we look at the design matrix:
X1 <- model.matrix(~ x + I(x^2))
X2 <- model.matrix(~ poly(x, degree = 2))
X3 <- model.matrix(~ bs(x, degree = 2, df = 2))
par(mfrow = c(3,3), oma = rep.int(1,4), mar = c(4, 4, 0, 0))
plot(x, X1[, 1], cex = 0.2)
plot(x, X1[, 2], cex = 0.2)
plot(x, X1[, 3], cex = 0.2)
plot(x, X2[, 1], cex = 0.2)
plot(x, X2[, 2], cex = 0.2)
plot(x, X2[, 3], cex = 0.2)
plot(x, X3[, 1], cex = 0.2)
plot(x, X3[, 2], cex = 0.2)
plot(x, X3[, 3], cex = 0.2)
Since the design matrix are not the same (either in shapes or in scales), you will not end up with the same set of coefficients. In case you are surprised, let's try a even simpler example:
x1 <- x - mean(x)
test <- lm(y ~ x1 + I(x1^2))
#(Intercept) x1 I(x1^2)
# 7.003 2.991 0.980
sum(abs(fit1$fitted - test$fitted))
# [1] 1.24345e-13
Here, I have just taken some simple transform for x, then the result is different (but still equivalent).
The 4th model fit4, is fitting a cubic polynomial with 3 degree of freedom, so it is not equivalent to all previous models. We can check fitted values:
sum(abs(fit1$fitted - fit4$fitted))
# [1] 39.36563
Ignoring ns() completely you are missing two things:
1) The comment above explaining how to define a dataframe:
t <- seq(0, 6, .01)
x <- rnorm(length(t), sd = 1)
y <- 5 + t - x^2 + rnorm(length(t), sd = .33)
df <- data.frame(t, x, y)
rm(t, x, y)
2) The way you are calling your models:
lm(y ~ t + I(t^2), data=df)
lm(y ~ splines::ns(t, knots = c(0), Boundary.knots = c(-3, 3)), data=df)
The first model does not properly identify what you think it does.

Using lapply on a list of models

I have generated a list of models, and would like to create a summary table.
As and example, here are two models:
x <- seq(1:10)
y <- sin(x)^2
model1 <- lm(y ~ x)
model2 <- lm(y ~ x + I(x^2) + I(x^3))
and two formulas, the first generating the equation from components of formula
get.model.equation <- function(x) {
x <- as.character((x$call)$formula)
x <- paste(x[2],x[1],x[3])
}
and the second generating the name of model as a string
get.model.name <- function(x) {
x <- deparse(substitute(x))
}
With these, I create a summary table
model.list <- list(model1, model2)
AIC.data <- lapply(X = model.list, FUN = AIC)
AIC.data <- as.numeric(AIC.data)
model.models <- lapply(X = model.list, FUN = get.model)
model.summary <- cbind(model.models, AIC.data)
model.summary <- as.data.frame(model.summary)
names(model.summary) <- c("Model", "AIC")
model.summary$AIC <- unlist(model.summary$AIC)
rm(AIC.data)
model.summary[order(model.summary$AIC),]
Which all works fine.
I'd like to add the model name to the table using get.model.name
x <- get.model.name(model1)
Which gives me "model1" as I want.
So now I apply the function to the list of models
model.names <- lapply(X = model.list, FUN = get.model.name)
but now instead of model1 I get X[[1L]]
How do I get model1 rather than X[[1L]]?
I'm after a table that looks like this:
Model Formula AIC
model1 y ~ x 11.89136
model2 y ~ x + I(x^2) + I(x^3) 15.03888
Do you want something like this?
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
I'd do something like this:
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
# changed Reduce('rbind', ...) to do.call(rbind, ...) (Hadley's comment)
do.call(rbind,
lapply(names(model.list), function(x)
data.frame(model = x,
formula = get.model.equation(model.list[[x]]),
AIC = AIC(model.list[[x]])
)
)
)
# model formula AIC
# 1 model1 y ~ x 11.89136
# 2 model2 y ~ x + I(x^2) + I(x^3) 15.03888
Another option, with ldply, but see hadley's comment below for a more efficient use of ldply:
# prepare data
x <- seq(1:10)
y <- sin(x)^2
dat <- data.frame(x,y)
# create list of named models obviously these are not suited to the data here, just to make the workflow work...
models <- list(model1=lm(y~x, data = dat),
model2=lm(y~I(1/x), data=dat),
model3=lm(y ~ log(x), data = dat),
model4=nls(y ~ I(1/x*a) + b*x, data = dat, start = list(a = 1, b = 1)),
model5=nls(y ~ (a + b*log(x)), data=dat, start = setNames(coef(lm(y ~ log(x), data=dat)), c("a", "b"))),
model6=nls(y ~ I(exp(1)^(a + b * x)), data=dat, start = list(a=0,b=0)),
model7=nls(y ~ I(1/x*a)+b, data=dat, start = list(a=1,b=1))
)
library(plyr)
library(AICcmodavg) # for small sample sizes
# build table with model names, function, AIC and AICc
data.frame(cbind(ldply(models, function(x) cbind(AICc = AICc(x), AIC = AIC(x))),
model = sapply(1:length(models), function(x) deparse(formula(models[[x]])))
))
.id AICc AIC model
1 model1 15.89136 11.89136 y ~ x
2 model2 15.78480 11.78480 y ~ I(1/x)
3 model3 15.80406 11.80406 y ~ log(x)
4 model4 16.62157 12.62157 y ~ I(1/x * a) + b * x
5 model5 15.80406 11.80406 y ~ (a + b * log(x))
6 model6 15.88937 11.88937 y ~ I(exp(1)^(a + b * x))
7 model7 15.78480 11.78480 y ~ I(1/x * a) + b
It's not immediately obvious to me how to replace the .id with a column name in the ldply function, any tips?

Extract prediction band from lme fit

I have following model
x <- rep(seq(0, 100, by=1), 10)
y <- 15 + 2*rnorm(1010, 10, 4)*x + rnorm(1010, 20, 100)
id <- NULL
for(i in 1:10){ id <- c(id, rep(i,101)) }
dtfr <- data.frame(x=x,y=y, id=id)
library(nlme)
with(dtfr, summary( lme(y~x, random=~1+x|id, na.action=na.omit)))
model.mx <- with(dtfr, (lme(y~x, random=~1+x|id, na.action=na.omit)))
pd <- predict( model.mx, newdata=data.frame(x=0:100), level=0)
with(dtfr, plot(x, y))
lines(0:100, predict(model.mx, newdata=data.frame(x=0:100), level=0), col="darkred", lwd=7)
with predict and level=0 i can plot the mean population response. How can I extract and plot the 95% confidence intervals / prediction bands from the nlme object for the whole population?
Warning: Read this thread on r-sig-mixed models before doing this. Be very careful when you interpret the resulting prediction band.
From r-sig-mixed models FAQ adjusted to your example:
set.seed(42)
x <- rep(0:100,10)
y <- 15 + 2*rnorm(1010,10,4)*x + rnorm(1010,20,100)
id<-rep(1:10,each=101)
dtfr <- data.frame(x=x ,y=y, id=id)
library(nlme)
model.mx <- lme(y~x,random=~1+x|id,data=dtfr)
#create data.frame with new values for predictors
#more than one predictor is possible
new.dat <- data.frame(x=0:100)
#predict response
new.dat$pred <- predict(model.mx, newdata=new.dat,level=0)
#create design matrix
Designmat <- model.matrix(eval(eval(model.mx$call$fixed)[-2]), new.dat[-ncol(new.dat)])
#compute standard error for predictions
predvar <- diag(Designmat %*% model.mx$varFix %*% t(Designmat))
new.dat$SE <- sqrt(predvar)
new.dat$SE2 <- sqrt(predvar+model.mx$sigma^2)
library(ggplot2)
p1 <- ggplot(new.dat,aes(x=x,y=pred)) +
geom_line() +
geom_ribbon(aes(ymin=pred-2*SE2,ymax=pred+2*SE2),alpha=0.2,fill="red") +
geom_ribbon(aes(ymin=pred-2*SE,ymax=pred+2*SE),alpha=0.2,fill="blue") +
geom_point(data=dtfr,aes(x=x,y=y)) +
scale_y_continuous("y")
p1
Sorry for coming back to such an old topic, but this might address a comment here:
it would be nice if some package could provide this functionality
This functionality is included in the ggeffects-package, when you use type = "re" (which will then include the random effect variances, not only residual variances, which is - however - the same in this particular example).
library(nlme)
library(ggeffects)
x <- rep(seq(0, 100, by = 1), 10)
y <- 15 + 2 * rnorm(1010, 10, 4) * x + rnorm(1010, 20, 100)
id <- NULL
for (i in 1:10) {
id <- c(id, rep(i, 101))
}
dtfr <- data.frame(x = x, y = y, id = id)
m <- lme(y ~ x,
random = ~ 1 + x | id,
data = dtfr,
na.action = na.omit)
ggpredict(m, "x") %>% plot(rawdata = T, dot.alpha = 0.2)
ggpredict(m, "x", type = "re") %>% plot(rawdata = T, dot.alpha = 0.2)
Created on 2019-06-18 by the reprex package (v0.3.0)

Resources