How to plot a Cox hazard model with splines - r

I have a following model:
coxph(Surv(fulength, mortality == 1) ~ pspline(predictor))
where is fulength is a duration of follow-up (including mortality), predictor is a predictor of mortality.
The output of the command above is this:
coef se(coef) se2 Chisq DF p
pspline(predictor), line 0.174 0.0563 0.0562 9.52 1.00 0.002
pspline(predictor), nonl 4.74 3.09 0.200
How can I plot this model so that I get the nice curvy line with 95% confidence bands and hazard ratio on the y axis? What I am aiming for is something similar to this:

This is when you get when you run the first example in ?cph of the rms-package:
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n,
rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"
dd <- datadist(age, sex)
options(datadist='dd')
S <- Surv(dt,e)
f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE)
cox.zph(f, "rank") # tests of PH
anova(f)
plot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes
Because the rms/Hmisc package combo uses lattice plots, annotation with a marginal age-density feature would need to be done with lattice-functions. On the other hand, if you want to change the response value to relative hazard you can just add a 'fun=exp' argument to the Predict call and relable the graph to get:
png(); plot(Predict(f, age, sex, fun=exp), ylab="Relative Hazard");dev.off()

Related

Lethal Dose (LD50) for the different factors in a logistic regression

I have a logistic regression model that contains a continuous variable and a qualitative variable with various levels/factors. Is there a way to calculate LD50 for each of the factors?
library(dplyr)
library(ggeffects)
library(MASS)
# Produce a table with data
set.seed(321)
x1 <- rnorm(400)
x2 <- 3*x1^3 + 2*x1^2 + x1 + 1
x3 <- 1/(1 + exp(-x2))
x4 = rbinom(400, 1, x3)
dt <- tibble(binary = x4,
continuous = x1,
qualitative = sample(c("white", "pink", "lime"), 400, TRUE))
# Logistic model
mdl <- glm(binary ~ continuous + qualitative, family = "binomial", data = dt)
# Calculate LD50, which gives the lethal dose for the whole model
dose.p(mdl) # -0.58 ± 0.11 se
# Plot regression model by factor of the qualitative variable
newdat <- ggpredict(mdl, terms = c("continuous[all]", "qualitative"))
plot(newdat)
How can I calculate LD50 ± SE for each of the factors (LD50 for lime, LD50 for pink, etc.)?
This below fits a model for binary vs continuous for every factor and pulls out the LD50 and SE:
get_LD50 = function(fit){
data.frame(
LD50 = dose.p(fit)[1],
SE = attributes(dose.p(fit))$SE[,1]
)
}
dt %>% group_by(qualitative) %>%
do(get_LD50(glm(binary ~ continuous, family = "binomial", data = .)))
# A tibble: 3 x 3
# Groups: qualitative [3]
qualitative LD50 SE
<chr> <dbl> <dbl>
1 lime -0.578 0.115
2 pink -0.479 0.104
3 white -0.392 0.116
If you would need something actually more complicated on the statistical side, you have to clarify

Plotting a survival curve from a survreg prediction

I'm relatively new to survival analysis and have been used some standard telco churn data example with a sample below called 'telco':
telco <- read.csv(text = "State,Account_Length,Area_Code,Intl_Plan,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn
IN,65,415,no,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,TRUE
RI,74,415,no,187.7,127,31.91,163.4,148,13.89,196,94,8.82,9.1,5,2.46,0,FALSE
IA,168,408,no,128.8,96,21.9,104.9,71,8.92,141.1,128,6.35,11.2,2,3.02,1,FALSE
MT,95,510,no,156.6,88,26.62,247.6,75,21.05,192.3,115,8.65,12.3,5,3.32,3,FALSE
IA,62,415,no,120.7,70,20.52,307.2,76,26.11,203,99,9.14,13.1,6,3.54,4,FALSE
NY,161,415,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,TRUE")
I've run:
library(survival)
dependentvars = Surv(telco$Account_Length, telco$Churn)
telcosurvreg = survreg(dependentvars ~ -Churn -Account_Length, dist="gaussian",data=telco)
telcopred = predict(telcosurvreg, newdata=telco, type="quantile", p=.5)
...to get the predicted lifetime of each customer.
What I'm struggling with is how to visualise a survival curve for this. Is there a way (preferably in ggplot2) to do this from the data I have?
Here is a base R version that plots the predicted survival curves. I have changed the formula so the curves differ for each row
> # change setup so we have one covariate
> telcosurvreg = survreg(
+ Surv(Account_Length, Churn) ~ Eve_Charge, dist = "gaussian", data = telco)
> telcosurvreg # has more than an intercept
Call:
survreg(formula = Surv(Account_Length, Churn) ~ Eve_Charge, data = telco,
dist = "gaussian")
Coefficients:
(Intercept) Eve_Charge
227.274695 -3.586121
Scale= 56.9418
Loglik(model)= -12.1 Loglik(intercept only)= -12.4
Chisq= 0.54 on 1 degrees of freedom, p= 0.46
n= 6
>
> # find linear predictors
> vals <- predict(telcosurvreg, newdata = telco, type = "lp")
>
> # use the survreg.distributions object. See ?survreg.distributions
> x_grid <- 1:400
> sur_curves <- sapply(
+ vals, function(x)
+ survreg.distributions[[telcosurvreg$dist]]$density(
+ (x - x_grid) / telcosurvreg$scale)[, 1])
>
> # plot with base R
> matplot(x_grid, sur_curves, type = "l", lty = 1)
Here is the result

Mean variance of a difference of BLUEs or BLUPs in `lme4`

Given below is the code for analysis of a resolvable alpha design (alpha lattice design) using the R package asreml.
# load the data
library(agridat)
data(john.alpha)
dat <- john.alpha
# load asreml
library(asreml)
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- asreml(yield ~ 1 + rep, data=dat, random=~ gen + rep:block)
# variance due to `gen`
sg2 <- summary(model1 )$varcomp[1,'component']
# mean variance of a difference of two BLUPs
vblup <- predict(model1 , classify="gen")$avsed ^ 2
# model2 - fixed `gen`
#----------------------
model2 <- asreml(yield ~ 1 + gen + rep, data=dat, random = ~ rep:block)
# mean variance of a difference of two adjusted treatment means (BLUE)
vblue <- predict(model2 , classify="gen")$avsed ^ 2
# H^2 = .803
sg2 / (sg2 + vblue/2)
# H^2c = .809
1-(vblup / 2 / sg2)
I am trying to replicate the above using the R package lme4.
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- lmer(yield ~ 1 + (1|gen) + rep + (1|rep:block), dat)
# variance due to `gen`
varcomp <- VarCorr(model1)
varcomp <- data.frame(print(varcomp, comp = "Variance"))
sg2 <- varcomp[varcomp$grp == "gen",]$vcov
# model2 - fixed `gen`
#----------------------
model2 <- lmer(yield ~ 1 + gen + rep + (1|rep:block), dat)
How to compute the vblup and vblue (mean variance of difference) in lme4 equivalent to predict()$avsed ^ 2 of asreml ?
I'm not that familiar with this variance partitioning stuff, but I'll take a shot.
library(lme4)
model1 <- lmer(yield ~ 1 + rep + (1|gen) + (1|rep:block), john.alpha)
model2 <- update(model1, . ~ . + gen - (1|gen))
## variance due to `gen`
sg2 <- c(VarCorr(model1)[["gen"]]) ## 0.142902
Get conditional variances of BLUPs:
rr1 <- ranef(model1,condVar=TRUE)
vv1 <- attr(rr$gen,"postVar")
str(vv1)
## num [1, 1, 1:24] 0.0289 0.0289 0.0289 0.0289 0.0289 ...
This is a 1x1x24 array (effectively just a vector of variances; we could collapse using c() if we needed to). They're not all the same, but they're pretty close ... I don't know whether they should all be identical (and this is a roundoff issue)
(uv <- unique(vv1))
## [1] 0.02887451 0.02885887 0.02885887
The relative variation is approximately 5.4e-4 ...
If these were all the same then the mean variance of a difference of any two would be just twice the variance (Var(x-y) = Var(x)+Var(y); by construction the BLUPs are all independent). I'm going to go ahead and use this.
vblup <- 2*mean(vv1)
For the model with gen fitted as a fixed effect, let's extract the variances of the parameters relating to genotypes (which are differences in the expected value from the first level):
vv2 <- diag(vcov(model2))[-(1:3)]
summary(vv2)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06631 0.06678 0.07189 0.07013 0.07246 0.07286
I'm going to take the means of these values (not double the values, since these are already the variances of differences)
vblue <- mean(vv2)
sg2/(sg2+vblue/2) ## 0.8029779
1-(vblup/2/sg2) ## 0.7979965
The H^2 estimate looks right on, but the H^2c estimate is a little different (0.797 vs. 0.809, a 1.5% relative difference); I don't know if that is big enough to be of concern or not.

Clustered standard errors with texreg?

I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE

Get confidence intervals for regression coefficients of "mlm" object returned by `lm()`

I'm running a multivariate regression with 2 outcome variables and 5 predictors. I would like to obtain the confidence intervals for all regression coefficients. Usually I use the function lm but it doesn't seem to work for a multivariate regression model (object mlm).
Here's a reproducible example.
library(car)
mod <- lm(cbind(income, prestige) ~ education + women, data=Prestige)
confint(mod) # doesn't return anything.
Any alternative way to do it? (I could just use the value of the standard error and multiply by the right critical t value, but I was wondering if there was an easier way to do it).
confint won't return you anything, because there is no "mlm" method supported:
methods(confint)
#[1] confint.default confint.glm* confint.lm confint.nls*
As you said, we can just plus / minus some multiple of standard error to get upper / lower bound of confidence interval. You were probably going to do this via coef(summary(mod)), then use some *apply method to extract standard errors. But my answer to Obtain standard errors of regression coefficients for an “mlm” object returned by lm() gives you a supper efficient way to get standard errors without going through summary. Applying std_mlm to your example model gives:
se <- std_mlm(mod)
# income prestige
#(Intercept) 1162.299027 3.54212524
#education 103.731410 0.31612316
#women 8.921229 0.02718759
Now, we define another small function to compute lower and upper bound:
## add "mlm" method to generic function "confint"
confint.mlm <- function (model, level = 0.95) {
beta <- coef(model)
se <- std_mlm (model)
alpha <- qt((1 - level) / 2, df = model$df.residual)
list(lower = beta + alpha * se, upper = beta - alpha * se)
}
## call "confint"
confint(mod)
#$lower
# income prestige
#(Intercept) -3798.25140 -15.7825086
#education 739.05564 4.8005390
#women -81.75738 -0.1469923
#
#$upper
# income prestige
#(Intercept) 814.25546 -1.72581876
#education 1150.70689 6.05505285
#women -46.35407 -0.03910015
It is easy to interpret this. For example, for response income, the 95%-confidence interval for all variables are
#(intercept) (-3798.25140, 814.25546)
# education (739.05564, 1150.70689)
# women (-81.75738, -46.35407)
This comes from the predict.lm example. You want the interval = 'confidence' option.
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, pred.w.clim,
lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")
This seems to have been discussed recently (July 2018) on the R-devel list, so hopefully by the next version of R it will be fixed. A workaround proposed on that list is to use:
confint.mlm <- function (object, level = 0.95, ...) {
cf <- coef(object)
ncfs <- as.numeric(cf)
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual)
pct <- stats:::format.perc(a, 3)
ses <- sqrt(diag(vcov(object)))
ci <- ncfs + ses %o% fac
setNames(data.frame(ci),pct)
}
Test:
fit_mlm <- lm(cbind(mpg, disp) ~ wt, mtcars)
confint(fit_mlm)
Gives:
2.5 % 97.5 %
mpg:(Intercept) 33.450500 41.119753
mpg:wt -6.486308 -4.202635
disp:(Intercept) -204.091436 -58.205395
disp:wt 90.757897 134.198380
Personnally, I like it in a clean tibble way (using broom::tidy would be even better, but has an issue currently)
library(tidyverse)
confint(fit_mlm) %>%
rownames_to_column() %>%
separate(rowname, c("response", "term"), sep=":")
Gives:
response term 2.5 % 97.5 %
1 mpg (Intercept) 33.450500 41.119753
2 mpg wt -6.486308 -4.202635
3 disp (Intercept) -204.091436 -58.205395
4 disp wt 90.757897 134.198380

Resources