Obtaining fitted values from second-order quantile regressions - r

I'm sure this is easily resolvable, but I have a question regarding quantile regression.
Say I have a data frame which follows the trend of a second-order polynomial curve and I construct a quantile regression fitted through different parts of the data:
##Data preperation
set.seed(5)
d <- data.frame(x=seq(-5, 5, len=51))
d$y <- 50 - 0.3*d$x^2 + rnorm(nrow(d))
##Quantile regression
Taus <- c(0.1,0.5,0.9)
QUA<-rq(y ~ 1 + x + I(x^2), tau=Taus, data=d)
plot(y~x,data=d)
for (k in 1:length(Taus)){
curve((QUA$coef[1,k])+(QUA$coef[2,k])*(x)+(QUA$coef[3,k])*(x^2),lwd=2,lty=1, add = TRUE)
}
I can obtain the maximum y value through the 'predict.rq' function and you can see this the following plot.
##Maximum prediction
Pred_df<- as.data.frame(predict.rq(QUA))
apply(Pred_df,2,max)
So my question is how do I obtain the x-value which corresponds to the maximum y-value (i.e. the break in slope) for each quantile?

The package broom could be very useful here:
library(broom)
library(dplyr)
augment(QUA) %>%
group_by(.tau) %>%
filter(.fitted == max(.fitted))

Related

Find Cook's Distance on Predicted Values for LM

Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)

intervals for contrasts at specific values of covariate - emmeans and bootMer

I've been learning emmeans (great package) and using it to generate confidence intervals for contrasts of levels of a categorical variable (variable m) at specific values of a continuous variable (variable s), and I'd like to know if the same thing is possible using bootMer from lme4.
I've pasted in the results of running sjPlot::plot_model on the model to help visualization. I know that the confidence intervals for the contrasts are not shown on the plot, but I'm interested in knowing how to obtain the point estimates and confidence intervals for:
the B-A contrast at s=1
the B-A contrast at s=5
the C-A contrast at s=1
the C-A contrast at s=5
I'm not trying to control the family-wise error rate, so adjusting for multiple comparisons isn't necessarily needed.
I would have used predict(), but that doesn't work to get confidence intervals (no interval="confidence") for lmer models, with the recommendation to use bootMer instead found in the help for predict.merMod. Unfortunately I still haven't been able to figure out how to get the same four confidence intervals using bootMer as I have with emmeans. Is it even possible? If not, is that because it's not statistically legitimate and I'm just confused about things?
library(lme4)
library(emmeans)
library(ggplot2)
library(sjPlot)
# create the dataset, unbalanced at the lowest stratum ( 2 repeats for L2 instead of 3)
set.seed(1234)
s_levels <- 1:5
m_levels <- c("A", "B", "C")
v_levels <- c("L2", "L3", "L4")
reps <- 1:3
df <- expand.grid(rep=reps, s=s_levels, m=m_levels, v=v_levels)
df$y <- 10 + as.numeric(as.factor(df$v))*0.1 + rnorm(nrow(df), mean=0, sd=0.1)
df$subunit <- as.factor(paste(df$v,"-",df$m,"-",df$s, sep=""))
df <- subset(df, !(rep==3 & v=="L2")) # drop the 3rd repeat for v=="L2"
# fit the with-interaction model using lmer()
fit <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
# emmeans confidence intervals with size fixed = 1
ref_grid(fit, at=list(s = 1))
fit_rg <- ref_grid(fit, at=list(s = 1))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# emmeans confidence intervals with size fixed = 5
ref_grid(fit, at=list(s = 5))
fit_rg <- ref_grid(fit, at=list(s = 5))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# requires sjPlot library
plot_model(
model = fit ,
type="pred" ,
terms=c("s", "m", "v") ,
ci.lvl = 0.95)

Get significance level (p-value) of a model with quantreg [duplicate]

I have a data series of around 250 annual maximum rainfall measurements, maxima[,] and want to apply quantile regression to all series at once and obtain the significance of each regression model in R.
library(quantreg)
qmag <- array(NA, c(250,4))
taus <- c(0.05, 0.1, 0.95, 0.975)
for(igau in 1:250){
qure <- rq(maxima[,igau+1]~maxima[,1], tau=taus)
qmag[igau,] <- coef(qure)[2,]
}
I've tried
summary(qure, se="boot")$p.value
ci(qure)
and other similar variations but get NULL values. Is it actually possible to automatically extract the p-values from quantreg to a table, rather than just viewing them individually in summary() for each model?
have a look at the structure produced by running str() of the summary-object:
require(quantreg)
data(engel)
mod <- rq(foodexp ~ income, data = engel)
summ <- summary(mod, se = "boot")
summ
str(summ)
summ$coefficients[,4]

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

How to plot the survival curve generated by survreg (package survival of R)?

I’m trying to fit and plot a Weibull model to a survival data. The data has just one covariate, cohort, which runs from 2006 to 2010. So, any ideas on what to add to the two lines of code that follows to plot the survival curve of the cohort of 2010?
library(survival)
s <- Surv(subSetCdm$dur,subSetCdm$event)
sWei <- survreg(s ~ cohort,dist='weibull',data=subSetCdm)
Accomplishing the same with the Cox PH model is rather straightforward, with the following lines. The problem is that survfit() doesn’t accept objects of type survreg.
sCox <- coxph(s ~ cohort,data=subSetCdm)
cohort <- factor(c(2010),levels=2006:2010)
sfCox <- survfit(sCox,newdata=data.frame(cohort))
plot(sfCox,col='green')
Using the data lung (from the survival package), here is what I'm trying to accomplish.
#create a Surv object
s <- with(lung,Surv(time,status))
#plot kaplan-meier estimate, per sex
fKM <- survfit(s ~ sex,data=lung)
plot(fKM)
#plot Cox PH survival curves, per sex
sCox <- coxph(s ~ as.factor(sex),data=lung)
lines(survfit(sCox,newdata=data.frame(sex=1)),col='green')
lines(survfit(sCox,newdata=data.frame(sex=2)),col='green')
#plot weibull survival curves, per sex, DOES NOT RUN
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
lines(survfit(sWei,newdata=data.frame(sex=1)),col='red')
lines(survfit(sWei,newdata=data.frame(sex=2)),col='red')
Hope this helps and I haven't made some misleading mistake:
copied from above:
#create a Surv object
s <- with(lung,Surv(time,status))
#plot kaplan-meier estimate, per sex
fKM <- survfit(s ~ sex,data=lung)
plot(fKM)
#plot Cox PH survival curves, per sex
sCox <- coxph(s ~ as.factor(sex),data=lung)
lines(survfit(sCox,newdata=data.frame(sex=1)),col='green')
lines(survfit(sCox,newdata=data.frame(sex=2)),col='green')
for Weibull, use predict, re the comment from Vincent:
#plot weibull survival curves, per sex,
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
The trick here was reversing the quantile orders for plotting vs predicting. There is likely a better way to do this, but it works here. Good luck!
An alternative option is to make use of the package flexsurv. This offers some additional functionality over the survival package - including that the parametric regression function flexsurvreg() has a nice plot method which does what you ask.
Using lung as above;
#create a Surv object
s <- with(lung,Surv(time,status))
require(flexsurv)
sWei <- flexsurvreg(s ~ as.factor(sex),dist='weibull',data=lung)
sLno <- flexsurvreg(s ~ as.factor(sex),dist='lnorm',data=lung)
plot(sWei)
lines(sLno, col="blue")
You can plot on the cumulative hazard or hazard scale using the type argument, and add confidence intervals with the ci argument.
This is just a note clarifying Tim Riffe's answer, which uses the following code:
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
The reason for the two mirror-image sequences, seq(.01,.99,by=.01) and seq(.99,.01,by=-.01), is because the predict() method is giving quantiles for the event distribution f(t) - that is, values of the inverse CDF of f(t) - while a survival curve is plotting 1-(CDF of f) versus t. In other words, if you plot p versus predict(p), you'll get the CDF, and if you plot 1-p versus predict(p) you'll get the survival curve, which is 1-CDF. The following code is more transparent and generalizes to arbitrary vectors of p values:
pct <- seq(.01,.99,by=.01)
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=pct),1-pct,col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=pct),1-pct,col="red")
In case someone wants to add a Weibull distribution to the Kaplan-Meyer curve in the ggplot2 ecosystem, we can do the following:
library(survminer)
library(tidyr)
s <- with(lung,Surv(time,status))
fKM <- survfit(s ~ sex,data=lung)
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
pred.sex1 = predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01))
pred.sex2 = predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01))
df = data.frame(y=seq(.99,.01,by=-.01), sex1=pred.sex1, sex2=pred.sex2)
df_long = gather(df, key= "sex", value="time", -y)
p = ggsurvplot(fKM, data = lung, risk.table = T)
p$plot = p$plot + geom_line(data=df_long, aes(x=time, y=y, group=sex))
In case you'd like to use the survival function itself S(t) (instead of the inverse survival function S^{-1}(p) used in other answers here) I've written a function to implement that for the case of the Weibull distribution (following the same inputs as the pec::predictSurvProb family of functions:
survreg.predictSurvProb <- function(object, newdata, times){
shape <- 1/object$scale # also equals 1/exp(fit$icoef[2])
lps <- predict(object, newdata = newdata, type = "lp")
surv <- t(sapply(lps, function(lp){
sapply(times, function(t) 1 - pweibull(t, shape = shape, scale = exp(lp)))
}))
return(surv)
}
You can then do:
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
times <- seq(min(lung$time), max(lung$time), length.out = 1000)
new_dat <- data.frame(sex = c(1,2))
surv <- survreg.predictSurvProb(sWei, newdata = new_dat, times = times)
lines(times, surv[1, ],col='red')
lines(times, surv[2, ],col='red')

Resources