I'm trying to demonstrate that there is an important difference between two ways of making linear model predictions. The first way, which my heart tells me is more correct, uses predict.lm which as I understand preserves the correlations between coefficients. The second approach tries to use the parameters independently.
Is this the correct way to show the difference? The two approaches seem somewhat close.
Also, is the StdErr of the coefficients the same as the standard deviation of their distributions? Or have I misunderstood what the model table is saying.
Below is a quick reprex to show what I mean:
# fake dataset
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
# predictions
coef_sterr <- summary(model)$coefficients
inters <- rnorm(500, mean = coef_sterr[1,1], sd = coef_sterr[1,2])
slopes <- rnorm(500, mean = coef_sterr[2,1], sd = coef_sterr[2,2])
newx <- seq(from = -1, to= 1, length.out = 20)
avg_predictions <- cbind(1, newx) %*% rbind(inters, slopes)
conf_predictions <- apply(avg_predictions, 1, quantile, probs = c(.25, .975), simplify = TRUE)
# from confint
conf_interval <- predict(model, newdata=data.frame(xs = newx),
interval="confidence",
level = 0.95)
# plot to visualize
plot(ys~xs)
# averages are exactly the same
abline(model)
abline(a = coef(model)[1], b = coef(model)[2], col = "red")
# from predict, using parameter covariance
matlines(newx, conf_interval[,2:3], col = "blue", lty=1, lwd = 3)
# from simulated lines, ignoring parameter covariance
matlines(newx, t(conf_predictions), col = "orange", lty = 1, lwd = 2)
Created on 2022-04-05 by the reprex package (v2.0.1)
In this case, they would be close because there is very little correlation between the model parameters, so drawing them from two independent normals versus a multivariate normal is not that different:
set.seed(519)
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
cov2cor(vcov(model))
# (Intercept) xs
# (Intercept) 1.00000000 -0.08054106
# xs -0.08054106 1.00000000
Also, it is probably worth calculating both of the intervals the same way, though it shouldn't make that much difference. That said, 500 observations may not be enough to get reliable estimates of the 2.5th and 97.5th percentiles of the distribution. Let's consider a slightly more complex example. Here, the two X variables are correlated - the correlation of the parameters derives in part from the correlation of the columns of the design matrix, X.
set.seed(519)
X <- MASS::mvrnorm(200, c(0,0), matrix(c(1,.65,.65,1), ncol=2))
b <- c(-1.3, 3.1, 2.5)
ytrue <- cbind(1,X) %*% b
y <- ytrue + rnorm(200, 0, .5*sd(ytrue))
dat <- data.frame(y=y, x1=X[,1], x2=X[,2])
model <- lm(y ~ x1 + x2, data=dat)
cov2cor(vcov(model))
# (Intercept) x1 x2
# (Intercept) 1.00000000 0.02417386 -0.01515887
# x1 0.02417386 1.00000000 -0.73228003
# x2 -0.01515887 -0.73228003 1.00000000
In this example, the coefficients for x1 and x2 are correlated around -0.73. As you'll see, this still doesn't result in a huge difference. Let's calculate the relevant statistics.
First, we draw B1 using the multivariate method that you rightly suspect is correct. Then, we'll draw B2 from a bunch of independent normals (actually, I'm using a multivariate normal with a diagonal variance-covariance matrix, which is the same thing).
b_est <- coef(model)
v <- vcov(model)
B1 <- MASS::mvrnorm(2500, b_est, v, empirical=TRUE)
B2 <- MASS::mvrnorm(2500, b_est, diag(diag(v)), empirical = TRUE)
Now, let's make a hypothetical X matrix and generate the relevant predictions:
hypX <- data.frame(x1=seq(-3,3, length=50),
x2 = mean(dat$x2))
yhat1 <- as.matrix(cbind(1, hypX)) %*% t(B1)
yhat2 <- as.matrix(cbind(1, hypX)) %*% t(B2)
Then we can calculate confidence intervals, etc...
yh1_ci <- t(apply(yhat1, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh2_ci <- t(apply(yhat2, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh1_ci <- as.data.frame(yh1_ci)
yh2_ci <- as.data.frame(yh2_ci)
names(yh1_ci) <- names(yh2_ci) <- c("lwr", "upr")
yh1_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh2_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh1_ci$method <- factor(1, c(1,2), labels=c("Multivariate", "Independent"))
yh2_ci$method <- factor(2, c(1,2), labels=c("Multivariate", "Independent"))
yh1_ci$x1 <- hypX[,1]
yh2_ci$x1 <- hypX[,1]
yh <- rbind(yh1_ci, yh2_ci)
We could then plot the two confidence intervals as you did.
ggplot(yh, aes(x=x1, y=fit, ymin=lwr, ymax=upr, fill=method)) +
geom_ribbon(colour="transparent", alpha=.25) +
geom_line() +
theme_classic()
Perhaps a better visual would be to compare the widths of the intervals.
w1 <- yh1_ci$upr - yh1_ci$lwr
w2 <- yh2_ci$upr - yh2_ci$lwr
ggplot() +
geom_point(aes(x=hypX[,1], y=w2-w1)) +
theme_classic() +
labs(x="x1", y="Width (Independent) - Width (Multivariate)")
This shows that for small values of x1, the independent confidence intervals are wider than the multivariate ones. For values of x1 above 0, it's a more mixed bag.
This tells you that there is some difference, but you don't need the simulation to know which one is 'right'. That's because the prediction is a linear combination of constants and random variables.
In this case, the b terms are the random variables and the x values are the constants. We know that the variance of a linear combination can be calculated this way:
All that is to say that your intuition is correct.
Data consists of 4 variable, id, x1 and x2, continuous variables which are correlated with y, a binary variable. 0 and 1 in the binary variable represent different states. Is it possible to use Markov chain models to calculate and plot state transition probability along the gradient of covariate values for each id and subsequently for the pooled data?
set.seed(1)
id =rep(1, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
a<-data.frame(id,x1,x2, y)
set.seed(2)
id =rep(2, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
b<-data.frame(id,x1,x2, y)
set.seed(3)
id =rep(3, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
c<-data.frame(id,x1,x2, y)
d<-rbind(a,b,c)
I try to calculate incremental variance explained by variables in multivariate multiple linear regression model, but I don't have Sum of squares parameters like multiple linear regression. I'd like something like:
library(car)
#Create variables and adjusted the model
set.seed(123)
N <- 100
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
X3 <- abs(rnorm(N, 60, 30))
Y1 <- 0.2*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 10)
Y2 <- -0.3*X2 + 0.2*X3 + rnorm(N, 10)
Y <- cbind(Y1, Y2)
dfRegr <- data.frame(X1, X2, X3, Y1, Y2)
(fit <- lm(cbind(Y1, Y2) ~ X1 + X2 + X3, data=dfRegr))
#How do we get the proportion now?
af <- Anova(fit)
afss <- af$"test stat"
print(cbind(af,PctExp=afss/sum(afss)*100))
#
Obviously doesn't work. There are some kind of approach for this?
I'd like to plot the predicted line of the regression that contains a restricted cubic spline due to non-linearity in the model and the standard error bands. I can get the predicted points, but am not sure to to just plot the lines and error bands. ggplot is preferred, or base graphics is fine also. Thanks.
Here is an example from the documentation:
library(rms)
# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
pred <- fitted(f) # or predict(f) or f$linear.predictors
f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
fastbw(f2, aics=100000)
options(datadist=NULL)
And a plot of the predicted values of the model:
plot(predict(f2))
The rms package has a number of helpful functions for this purpose. It is worth looking at http://biostat.mc.vanderbilt.edu/wiki/Main/RmS
In this instance, you can simple set datadist (which set up distribution summaries for predictor variables) appropriately and then use plot(Predict(f) or ggplot(Predict(f))
set.seed(5)
# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
ddist <- datadist(x1,x2,x3,x4)
options(datadist='ddist')
plot(Predict(f))
ggplot(Predict(f))
I have two dependents that both depent on two variables AND on each other, can this be modelled in R (must be!) but I can't figure out how, anyone a hint?
In clear terms:
I want to model my data with the following model:
Y1=X1*coef1+X2*coef2
Y2=X1*coef2+X2*coef3
Note: coef2 appears in both lines
Xi, Yi is input and output data respectively
I got this far:
lm(Y1~X1+X2,mydata)
now how do I add the second line of the model including the cross dependency?
Your help is greatly appreciated!
Cheers, Bastiaan
Try this:
# sample data - true coefs are 2, 3, 4
set.seed(123)
n <- 35
DF <- data.frame(X1 = 1, X2 = 1:n, X3 = (1:n)^2)
DF <- transform(DF, Y1 = X1 * 2 + X2 * 3 + rnorm(n),
Y2 = X1 * 3 + X2 * 4 + rnorm(n))
# construct data frame for required model
DF2 <- with(DF, data.frame(y = c(Y1, Y2),
x1 = c(X1, 0*X1),
x2 = c(X2, X1),
x3 = c(0*X2, X2)))
lm(y ~. - 1, DF2)
We see it does, indeed, recover the true coefs of 2, 3, 4:
> lm(y ~. - 1, DF2)
Call:
lm(formula = y ~ . - 1, data = DF2)
Coefficients:
x1 x2 x3
2.084 2.997 4.007