Individual terms in prediction of linear regression - r

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?

All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.

The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

Related

Replace lm coefficients and calculate results of lm new in R

I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...

Random forest variable importance AND direction of correlation for binomial response

I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

why do ggplot2 95%CI and prediction 95%CI calculated manually differ?

I'd like to know why when calculating 95% confidence bands from a linear mixed effects model does ggplot2 produces narrower bands than when calculated manually, e.g. by following Ben Bolker's method here confidence intervals on predictions. That is, is ggplot2 giving an inaccurate representation of the model?
Here is a reproducible example using the sleepstudy dataset (modified to be structurally similar to a df that I'm working on):
data("sleepstudy") # load dataset
height <- seq(165, 185, length.out = 18) # create vector called height
Treatment <- rep(c("Control", "Drug"), 9) # create vector called treatment
Subject <- levels(sleepstudy$Subject) # get vector of Subject
ht.subject <- data.frame(height, Subject, Treatment)
sleepstudy <- dplyr::left_join(sleepstudy, ht.subject, by="Subject") # Append df so that each subject has its own height and treatment
sleepstudy$Treatment <- as.factor(sleepstudy$Treatment)
Generate model, add predictions to original df, and plot
m.sleep <- lmer(Reaction ~ Treatment*height + (1 + Days|Subject), data=sleepstudy)
sleepstudy$pred <- predict(m.sleep)
ggplot(sleepstudy, aes(height, pred, col=Treatment)) + geom_smooth(method="lm")[2]
Calculate confidence intervals following Bolker method
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"))
newdf$Reaction <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1+VarCorr(m.sleep)$Subject[1]
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
# plot confidence intervals
ggplot(newdf, aes(x=height, y=Reaction, colour=Treatment)) +
geom_point() +
geom_ribbon(aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4)[2]
With a few tweaks, this seems consistent. The confidence intervals are indeed larger, but not enormously much larger. Keep in mind that ggplot is fitting a very different model; it is fitting separate linear (not linear mixed) models by treatment that ignore (1) repeated measures and (2) the effect of day.
It seems weird to fit a model with random slopes but no population-level slope (e.g.see here), so I added a fixed effect of Days:
m.sleep <- lmer(Reaction ~ Treatment*height + Days +
(1 + Days|Subject),
data=sleepstudy)
I reorganized the plotting code a little bit:
theme_set(theme_bw())
gg0 <- ggplot(sleepstudy, aes(height, colour=Treatment)) +
geom_point(aes(y=Reaction))+
geom_smooth(aes(y=pred), method="lm")
If you want to compute confidence intervals (which would be comparable with what lm()/ggplot2 is doing), then you probably should not add VarCorr(m.sleep)$Subject[1] to the variance (the tvar1 variable from the FAQ example is for creating prediction intervals rather than confidence intervals ...)
since I had Days in the model above, I added mean(sleepstudy$Days) to the prediction data frame.
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"),
Days=mean(sleepstudy$Days))
newdf$Reaction <- newdf$pred <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
gg0 +
geom_point(data=newdf,aes(y=Reaction)) +
geom_ribbon(data=newdf,
aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4,
colour=NA)
Comparing with the estimated slopes and standard errors:
m0 <- lm(Reaction~height*Treatment,sleepstudy)
ff <- function(m) {
print(coef(summary(m))[-1,c("Estimate","Std. Error")],digits=2)
}
> ff(m0)
## Estimate Std. Error
## height -0.3 0.94
## TreatmentDrug -602.2 234.01
## height:TreatmentDrug 3.5 1.34
ff(m.sleep)
## Estimate Std. Error
## TreatmentDrug -55.03 425.3
## height 0.41 1.7
## Days 10.47 1.5
## TreatmentDrug:height 0.33 2.4
This looks consistent/about right: the mixed model is giving larger standard errors for the slope with respect to height and the height:treatment interaction. (The main effects of TreatmentDrug look crazy because they're the expected effects of treatment at height==0 ...)
As a cross-check, I can get similar answers from sjPlot::plot_model() ...
library(sjPlot)
plot_model(m.sleep, type="pred", terms=c("height","Treatment"))

Back-transform coefficients from glmer with scaled independent variables for prediction

I've fitted a mixed model using the lme4 package. I transformed my independent variables with the scale() function prior to fitting the model. I now want to display my results on a graph using predict(), so I need the predicted data to be back on the original scale. How do I do this?
Simplified example:
database <- mtcars
# Scale data
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)
# Make model
model.1 <- glmer(vs ~ scale(wt) + scale(am) + (1|carb), database, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))
# make new values for wt
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))
# predict from new values
a <- predict(model.1, newdata = xweight, type="response", re.form=NA)
# returns scaled prediction
I've tried using this example to back-transform the predictions:
# save scale and center values
scaleList <- list(scale = attr(database$wt, "scaled:scale"),
center = attr(database$wt, "scaled:center"))
# back-transform predictions
a.unscaled <- a * scaleList$scale + scaleList$center
# Make model with unscaled data to compare
un.model.1 <- glmer(vs ~ wt + am + (1|carb), mtcars, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
un.xweight <- as.data.frame(lapply(lapply(mtcars[, -1], mean), rep, 100))
# make new values for wt
un.xweight$wt <- (wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100))
# predict from new values
b <- predict(un.model.1, newdata = xweight, type="response", re.form=NA)
all.equal(a.unscaled,b)
# [1] "Mean relative difference: 0.7223061"
This doesn't work - there shouldn't be any difference. What have I done wrong?
I've also looked at a number of similar questions but not managed to apply any to my case (How to unscale the coefficients from an lmer()-model fitted with a scaled response, unscale and uncenter glmer parameters, Scale back linear regression coefficients in R from scaled and centered data, https://stats.stackexchange.com/questions/302448/back-transform-mixed-effects-models-regression-coefficients-for-fixed-effects-f).
The problem with your approach is that it only "unscales" based on the wt variable, whereas you scaled all of the variables in your regression model. One approach that works is to adjust all of the variables in your new (prediction) data frame using the centering/scaling values that were used on the original data frame:
## scale variable x using center/scale attributes
## of variable y
scfun <- function(x,y) {
scale(x,
center=attr(y,"scaled:center"),
scale=attr(y,"scaled:scale"))
}
## scale prediction frame
xweight_sc <- transform(xweight,
wt = scfun(wt, database$wt),
am = scfun(am, database$am))
## predict
p_unsc <- predict(model.1,
newdata=xweight_sc,
type="response", re.form=NA)
Comparing this p_unsc to your predictions from the unscaled model (b in your code), i.e. all.equal(b,p_unsc), gives TRUE.
Another reasonable approach would be to
unscale/uncenter all of your parameters using the "unscaling" approaches presented in one of the linked question (such as this one), generating a coefficient vector beta_unsc
construct the appropriate model matrix from your prediction frame:
X <- model.matrix(formula(model,fixed.only=TRUE),
newdata=pred_frame)
compute the linear predictor and back-transform:
pred <- plogis(X %*% beta_unsc)

Obtaining fitted values from second-order quantile regressions

I'm sure this is easily resolvable, but I have a question regarding quantile regression.
Say I have a data frame which follows the trend of a second-order polynomial curve and I construct a quantile regression fitted through different parts of the data:
##Data preperation
set.seed(5)
d <- data.frame(x=seq(-5, 5, len=51))
d$y <- 50 - 0.3*d$x^2 + rnorm(nrow(d))
##Quantile regression
Taus <- c(0.1,0.5,0.9)
QUA<-rq(y ~ 1 + x + I(x^2), tau=Taus, data=d)
plot(y~x,data=d)
for (k in 1:length(Taus)){
curve((QUA$coef[1,k])+(QUA$coef[2,k])*(x)+(QUA$coef[3,k])*(x^2),lwd=2,lty=1, add = TRUE)
}
I can obtain the maximum y value through the 'predict.rq' function and you can see this the following plot.
##Maximum prediction
Pred_df<- as.data.frame(predict.rq(QUA))
apply(Pred_df,2,max)
So my question is how do I obtain the x-value which corresponds to the maximum y-value (i.e. the break in slope) for each quantile?
The package broom could be very useful here:
library(broom)
library(dplyr)
augment(QUA) %>%
group_by(.tau) %>%
filter(.fitted == max(.fitted))

Resources