I am using R survival package, survdiff function. I wonder how to get the p value from the return value.
> diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
> diff
Call:
survdiff(formula = Surv(Time, Censored) ~ Treatment + Gender,
data = dat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treatment=Control, Gender=M 2 1 1.65 0.255876 0.360905
Treatment=Control, Gender=F 7 3 2.72 0.027970 0.046119
Treatment=IND, Gender=M 5 2 2.03 0.000365 0.000519
Treatment=IND, Gender=F 6 2 1.60 0.100494 0.139041
Chisq= 0.5 on 3 degrees of freedom, p= 0.924
I want to get the p value 0.924 using some function. Thanks.
The p value is not stored in the survdiff class, so it must be calculated on the fly at the time of output. To reproduce the p value one could use the chisq distribution function: "pchisq"
diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
pchisq(diff$chisq, length(diff$n)-1, lower.tail = FALSE)
The code in the function print.survdiff that displays the p-value is:
cat("\n Chisq=", format(round(x$chisq, 1)), " on", df,
"degrees of freedom, p=", format(signif(1 - pchisq(x$chisq,
df), digits)), "\n")
The code leading up to it:
if (is.matrix(x$obs)) {
otmp <- apply(x$obs, 1, sum)
etmp <- apply(x$exp, 1, sum)
} else {
otmp <- x$obs
etmp <- x$exp
}
df <- (sum(1 * (etmp > 0))) - 1
And 'digits' is set to 3 in the argument list, so using the example on the surv.diff help page:
x <- survdiff(Surv(time, status) ~ pat.karno + strata(inst), data=lung)
cat( "p=", format(signif(1 - pchisq(x$chisq,
df), digits)) )
#p= 0.00326
Addressing the comment: In the example the second code block reduces to:
df <- with(x, (sum(1 * (apply(x$exp, 1, sum) > 0))) - 1 )
> df
[1] 7
With the glance() function from the broom package, it is very easy to get the p.value.
diff = survdiff(Surv(Time, Censored) ~ Treatment+Gender, data = dat)
broom::glance(diff)$p.value
Related
In the mixed model (or REWB) framework it is common to model within changes by subtracting the cluster mean (demeaning) from a time varying x-variable, see eg. (Bell, Fairbrother & Jones, 2018). This estimator is basically the same as a fixed effects (FE) estimator (shown below using the sleepstudy data).
The issue arises when trying to model polynomials using the same principle. The equality between the estimators break when we enter our demeaned variable as a polynomial. We can restore this equality by first squaring the variable and then demeaning (see. re_poly_fixed).
dt <- lme4::sleepstudy
dt$days_squared <- dt$Days * dt$Days
dt <- cbind(dt, datawizard::demean(dt, select = c("Days", "days_squared"), group = "Subject"))
re <- lme4::lmer(Reaction ~ Days_within + (1 | Subject), data = dt, REML = FALSE)
fe <- fixest::feols(Reaction ~ Days | Subject, data = dt)
re_poly <- lme4::lmer(Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
fe_poly <- fixest::feols(Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(Reaction ~ Days_within + days_squared_within + (1 | Subject),
data = dt, REML = FALSE)
models <-
list("re" = re, "fe" = fe, "re_poly" = re_poly, "fe_poly" = fe_poly, "re_poly_fixed" = re_poly_fixed)
modelsummary::modelsummary(models)
The main issue with this strategy is that for postestimation, especially packages that calculate marginal effects (e.g. marginaleffects in R or margins in STATA) the variable needs to be entered as a polynomial term for the calculations to consider both x and x^2. That is using poly() or I() in R or factor notation c.x##c.x in STATA). The difference can be seen in the two calls below, where the FE-call returns one effect for "Days" and the manual call returns two separate terms.
(me_fe <- summary(marginaleffects::marginaleffects(fe_poly)))
(me_re <- summary(marginaleffects::marginaleffects(re_poly_fixed)))
I may be missing something obvious here, but is it possible to retain the equality between the estimators in FE and the Mixed model setups with polynomials, while still being able to use common packages for marginal effects?
The problem is that when a transformed variable is hardcoded, the marginaleffects package does not know that it should manipulate both the transformed and the original at the same time to compute the slope. One solution is to de-mean inside the formula with I(). You should be aware that this may make the model fitting less efficient.
Here’s an example where I pre-compute the within-group means using data.table, but you could achieve the same result with dplyr::group_by():
library(lme4)
library(data.table)
library(modelsummary)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days)),
by = "Subject"]
re_poly <- lmer(
Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
re_poly_2 <- lmer(
Reaction ~ poly(I(Days - Days_mean), 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
models <- list(re_poly, re_poly_2)
modelsummary(models, output = "markdown")
Model 1
Model 2
(Intercept)
295.727
295.727
(9.173)
(9.173)
poly(Days_within, 2, raw = TRUE)1
10.467
(0.799)
poly(Days_within, 2, raw = TRUE)2
0.337
(0.316)
poly(I(Days - Days_mean), 2, raw = TRUE)1
10.467
(0.799)
poly(I(Days - Days_mean), 2, raw = TRUE)2
0.337
(0.316)
SD (Intercept Subject)
36.021
36.021
SD (Observations)
30.787
30.787
Num.Obs.
180
180
R2 Marg.
0.290
0.290
R2 Cond.
0.700
0.700
AIC
1795.8
1795.8
BIC
1811.8
1811.8
ICC
0.6
0.6
RMSE
29.32
29.32
The estimated average marginal effects are – as expected – different:
marginaleffects(re_poly) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days_within 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
marginaleffects(re_poly_2) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
The following answer is not exactly what I asked for in the question. But at least it is a decent workaround for anyone having similar problems.
library(lme4)
library(data.table)
library(fixest)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days),
Days2 = Days^2,
Days2_within = Days^2 - mean(Days^2)),
by = "Subject"]
fe_poly <- fixest::feols(
Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(
Reaction ~ Days_within + Days2_within + (1 | Subject), data = dt, REML = FALSE)
modelsummary(list(fe_poly, re_poly_fixed), output = "markdown")
We start with the two models previously described. We can manually calculate the AME or marginal effects at other values and get confidence intervals using multcomp::glht(). The approach is relatively similar to that of lincom in STATA. I have written a wrapper that returns the values in a data.table:
lincom <- function(model, linhyp) {
t <- summary(multcomp::glht(model, linfct = c(linhyp)))
ci <- confint(t)
dt <- data.table::data.table(
"estimate" = t[["test"]]$coefficients,
"se" = t[["test"]]$sigma,
"ll" = ci[["confint"]][2],
"ul" = ci[["confint"]][3],
"t" = t[["test"]]$tstat,
"p" = t[["test"]]$pvalues,
"id" = rownames(t[["linfct"]])[1])
return(dt)
}
This can likely be improved or adapted to other similar needs. We can calculate the AME by taking the partial derivative. For the present case we do this with the following equation: days + 2 * days^2 * mean(days).
marginaleffects(fe_poly) |> summary()
Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
1 Days 10.47 1.554 6.734 1.6532e-11 7.421 13.51
Model type: fixest
Prediction type: response
By adding this formula to the lincom function, we get similar results:
names(fe_poly$coefficients) <- c("Days", "Days2")
mean(dt$Days) # Mean = 4.5
lincom(fe_poly, "Days + 2 * Days2 * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 1.554498 7.397306 13.53727 6.733549 2.817051e-10 Days + 2 * Days2 * 4.5
lincom(re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 0.798932 8.901408 12.03316 13.1016 0 Days_within + 2 * Days2_within * 4.5
It is possible to check other ranges of values and to add other variables from the model using the formula. This can be done using lapply or a loop and the output can then be combined using a simple rbind. This should make it relatively easy to present/plot results.
EDIT
Like Vincent pointed out below there is also marginaleffects::deltamethod. This looks to be a better more robust option, that provide similar results (with the same syntax):
mfx1 <- marginaleffects::deltamethod(
fe_poly, "Days + 2 * Days2 * 4.5 = 0")
mfx2 <- marginaleffects::deltamethod(
re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
rbind(mfx1, mfx2)
term estimate std.error statistic p.value conf.low conf.high
1 Days + 2 * Days2 * 4.5 = 0 10.46729 1.554498 6.733549 1.655739e-11 7.420527 13.51405
2 Days_within + 2 * Days2_within * 4.5 = 0 10.46729 0.798932 13.101597 3.224003e-39 8.901408 12.03316
I'm relatively new to survival analysis and have been used some standard telco churn data example with a sample below called 'telco':
telco <- read.csv(text = "State,Account_Length,Area_Code,Intl_Plan,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn
IN,65,415,no,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,TRUE
RI,74,415,no,187.7,127,31.91,163.4,148,13.89,196,94,8.82,9.1,5,2.46,0,FALSE
IA,168,408,no,128.8,96,21.9,104.9,71,8.92,141.1,128,6.35,11.2,2,3.02,1,FALSE
MT,95,510,no,156.6,88,26.62,247.6,75,21.05,192.3,115,8.65,12.3,5,3.32,3,FALSE
IA,62,415,no,120.7,70,20.52,307.2,76,26.11,203,99,9.14,13.1,6,3.54,4,FALSE
NY,161,415,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,TRUE")
I've run:
library(survival)
dependentvars = Surv(telco$Account_Length, telco$Churn)
telcosurvreg = survreg(dependentvars ~ -Churn -Account_Length, dist="gaussian",data=telco)
telcopred = predict(telcosurvreg, newdata=telco, type="quantile", p=.5)
...to get the predicted lifetime of each customer.
What I'm struggling with is how to visualise a survival curve for this. Is there a way (preferably in ggplot2) to do this from the data I have?
Here is a base R version that plots the predicted survival curves. I have changed the formula so the curves differ for each row
> # change setup so we have one covariate
> telcosurvreg = survreg(
+ Surv(Account_Length, Churn) ~ Eve_Charge, dist = "gaussian", data = telco)
> telcosurvreg # has more than an intercept
Call:
survreg(formula = Surv(Account_Length, Churn) ~ Eve_Charge, data = telco,
dist = "gaussian")
Coefficients:
(Intercept) Eve_Charge
227.274695 -3.586121
Scale= 56.9418
Loglik(model)= -12.1 Loglik(intercept only)= -12.4
Chisq= 0.54 on 1 degrees of freedom, p= 0.46
n= 6
>
> # find linear predictors
> vals <- predict(telcosurvreg, newdata = telco, type = "lp")
>
> # use the survreg.distributions object. See ?survreg.distributions
> x_grid <- 1:400
> sur_curves <- sapply(
+ vals, function(x)
+ survreg.distributions[[telcosurvreg$dist]]$density(
+ (x - x_grid) / telcosurvreg$scale)[, 1])
>
> # plot with base R
> matplot(x_grid, sur_curves, type = "l", lty = 1)
Here is the result
I did a linear regression for a two tailed t-test with 178 degrees of freedom. The summary function gives me two p-values for my two t-values.
t value Pr(>|t|)
5.06 1.04e-06 ***
10.09 < 2e-16 ***
...
...
F-statistic: 101.8 on 1 and 178 DF, p-value: < 2.2e-16
I want to calculate manually the p-value of the t-values with this formula:
p = 1 - 2*F(|t|)
p_value_1 <- 1 - 2 * pt(abs(t_1), 178)
p_value_2 <- 1 - 2 * pt(abs(t_2), 178)
I don't get the same p-values as in the model summary. Therefore, I want to know how the summary function Pr(>|t|) is different from my formula, as I can't find the definition of Pr(>|t|).
Can you help me? Thanks a lot!
It is
2 * pt(-abs(t_value), df)
For example:
2 * pt(-5.06, 178)
#[1] 1.038543e-06
2 * pt(-10.09, 178)
#[1] 3.223683e-19
Alternatively, use
2 * pt(abs(t_value), df, lower.tail = FALSE)
We can compute the p value Pr(>|t|) in the following different ways:
tval <- 5.06
df <- 178
# compute area under the t-pdf
integrate(function(x) dt(x, df), -Inf, -tval)$value + integrate(function(x) dt(x, df), tval, Inf)$value # sum of two areas
# [1] 1.038543e-06
1-integrate(function(x) dt(x, df), -tval, tval)$value
# [1] 1.038543e-06
# 2-sided t-test: Pr_T(|t|>|tval|) = 2*(1 - F_T(|tval|)) = 2*F_T(-|tval|), where T~t(df=178)
2*(1 - pt(tval, df))
# [1] 1.038543e-06
2*pt(tval, df, lower.tail = FALSE)
# [1] 1.038543e-06
1 - (pt(tval, df) - pt(-tval, df))
# [1] 1.038543e-06
2*pt(-tval, df)
# [1] 1.038543e-06
The following illustrates the same geometrically with a different (less extreme) value of the t-statistic, as we can see, there are two (symmetric) blue regions that together represent the corresponding probability, under the 2-sided t-test.
df <- 178
x <- seq(-6, 6,0.01)
y <- dt(x, df)
tval <- 1.25
plot(x, y, type='l', main='t-distribution and p-value (5% significance level, 2-sided t-test)')
abline(h=0)
abline(v = c(tval, -tval), col='red')
index1 <- which(x >= -tval)[1]
index2 <- which(x >= tval)[1]
polygon(x = c(x[1:index1], x[index1], x[1]),
y = c(y[1:index1], 0, 0),
col = "blue")
polygon(x = c(x[index2], x[index2], x[index2:length(x)]),
y = c(0, y[index2], y[index2:length(y)]),
col = "blue")
I'm performing a cross validation on a competing risks proportional hazards model. With help from the mstate pacakge, I've prepared my data and am fitting it with survival::coxph. I get a fitted Cox model object for my training data, but I want to evaluate the partial likelihood of my trained coefficients with my test data.
If I need to, I'll write the partial likelihood function myself, but I'd rather not (though it would probably be good for me). The survival package calculates in this C code, but the likelihood calculation is embedded in the fitting function. Maybe there's a way to fix parameters, or some other tools to easily get at the partial likelihood?
Minimum Working Exmaple
# Adapted from examples in the mstate vignette
# http://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf
# beginning at the bottom of page 28
library(mstate)
library(survival)
# Get data. I add a second explanatory variable (badx) for illustration
# Also divide the data by subject into training and test sets.
data(aidssi)
si <- aidssi # Just a shorter name
si$badx <- sample(c("A", "B"), size = nrow(si), replace = TRUE)
si$fold <- sample(c("train", "test"), size = nrow(si), replace = TRUE, prob = c(0.7, 0.3))
tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)
# Convert the data to a long competing risks format
silong <- msprep(time = c(NA, "time", "time"),
status = c(NA,"stat1", "stat2"),
data = si, keep = c("ccr5", "badx", "fold"), trans = tmat)
silong <- na.omit(silong)
silong <- expand.covs(silong, c("ccr5", "badx"))
train.dat <- subset(silong, fold == "train")
test.dat <- subset(silong, fold == "test")
Data looks like this:
> head(silong)
An object of class 'msdata'
Data:
id from to trans Tstart Tstop time status ccr5 badx fold ccr5WM.1 ccr5WM.2 badxB.1 badxB.2
1 1 1 2 1 0 9.106 9.106 1 WW A train 0 0 0 0
2 1 1 3 2 0 9.106 9.106 0 WW A train 0 0 0 0
3 2 1 2 1 0 11.039 11.039 0 WM B train 1 0 1 0
4 2 1 3 2 0 11.039 11.039 0 WM B train 0 1 0 1
5 3 1 2 1 0 2.234 2.234 1 WW B train 0 0 1 0
6 3 1 3 2 0 2.234 2.234 0 WW B train 0 0 0 1
Now, the ccr5 variable could be modeled as transition-specific, or as a having equal proportional effect for all transitions. The models are:
train.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
data = train.dat)
train.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
data = train.dat)
Now I would like to use the test data to evaluate the variable selection
on whether or not ccr5 should be transition-specific or not.
I have a large data set and many variables--mostly but not all categorical--that could go either way. The evaluation is where I'm stuck.
# We can fit the same models to the test data,
# this yields new parameter estimates of course,
# but the model matrices might be useful
test.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
data = test.dat)
test.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
data = test.dat)
test.eq.mm <- model.matrix(test.mod.equal)
test.sp.mm <- model.matrix(test.mod.specific)
# We can use these to get the first part of the sum of the partial likelihood:
xbeta.eq <- test.eq.mm[test.dat$status == 1, ] %*% coef(train.mod.equal)
xbeta.sp <- test.sp.mm[test.dat$status == 1, ] %*% coef(train.mod.specific)
# We can also get linear predictors
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
lp.sp <- predict(train.mod.specific, newdata = test.dat, type = "lp")
I'm hoping to calculate the partial likelihood for each of the models on the test data with the training coefficient estimates. Maybe I should move the question to Cross Validated and ask if the sum of the linear predictors (or the sum of the linear predictors excluding censored cases) is close enough to an equivalent measure.
This is what I was proposing when I wrote: 'Can you calculate a "neo-model" (using the [new data] with a formula that includes an offset [built with] beta estimates [from the original fit] and then use summary(mdl) to do the heavy lifting for you? You might even be able to calculate the offset with predict.coxph.' Turns out I don't need to use summary.coxph since print.coxph gives the LLR statistic.
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
eq.test.mod <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq),
data=test.dat )
eq.test.mod
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) +
offset(lp.eq), data = test.dat)
coef exp(coef) se(coef) z p
ccr5WM -0.20841 0.812 0.323 -0.6459 0.52
badxB -0.00829 0.992 0.235 -0.0354 0.97
Likelihood ratio test=0.44 on 2 df, p=0.804 n= 212, number of events= 74
I would interpret this to mean that a similar model, fit with the predictions based on the first model but with new data, was not significantly different (than a null model) and that on a log-likelihood scale, it was 0.44 "away" from an exact fit.
As pointed out by #Gregor, one can access the 'loglik' node of the coxph-object, but I would advise against attaching too much meaning to the single values. To get he LRT statistic one could produce:
> diff(eq.test.mod$loglik)
[1] 0.399137
For interest sake, also look at the result without the offset:
> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
+ data=test.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans),
data = test.dat)
coef exp(coef) se(coef) z p
ccr5WM -0.8618 0.422 0.323 -2.671 0.0076
badxB -0.0589 0.943 0.235 -0.251 0.8000
Likelihood ratio test=8.42 on 2 df, p=0.0148 n= 212, number of events= 74
And you do get the expected result when testing against the original data:
> lp.eq2 <- predict(train.mod.equal, newdata = train.dat, type = "lp")
> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq2),
+ data=train.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) +
offset(lp.eq2), data = train.dat)
coef exp(coef) se(coef) z p
ccr5WM -4.67e-12 1 0.230 -2.03e-11 1
badxB 2.57e-14 1 0.168 1.53e-13 1
Likelihood ratio test=0 on 2 df, p=1 n= 436, number of events= 146
I need to fit a parametric PH model (so, not a Cox model) with time-varying covariates. Can we do that in R? I heard the survreg function cannot handle time-varying covariates. I've looked in vain for packages that could deal with that.
As #adibender writes, you can easily estimate a model with a constant baseline with the poisson family with a log time offset with glm. Here is an example
> # Input parameters
> n <- 100 # Number of individuals
> t_max <- 5 # max number of period per individual
> beta <- c(-1, 1) # true coefficient
>
> # Simulate data
> set.seed(47261114)
> sim_dat <- replicate(
+ n,
+ {
+ out <- data.frame(
+ tstart = rep(NA_integer_, t_max),
+ tstop = rep(NA_integer_, t_max),
+ event = rep(NA, t_max),
+ x = rnorm(t_max))
+
+ for(i in 1:t_max){
+ rate <- exp(beta %*% c(1, out$x[i]))
+ tstop <- min(rexp(1, rate), 1)
+ out[i, ] <- list(i - 1, i - (1 - tstop), tstop < 1, out$x[i])
+ if(out$event[i])
+ break
+ }
+ out[!is.na(out$tstart), ]
+ }, simplify = FALSE)
>
> sim_dat <- do.call(rbind, sim_dat)
> head(sim_dat) # show final data
tstart tstop event x
1 0 0.3018182 TRUE 0.7095841
2 0 0.6724803 TRUE 1.5152877
3 0 1.0000000 FALSE 0.1036868
4 1 2.0000000 FALSE -0.5214508
5 2 2.4831577 TRUE 1.0101403
6 0 1.0000000 FALSE 0.1437594
>
> # Fit with glm
> glm(event ~ x + offset(log(tstop - tstart)), sim_dat, family = poisson())
Call: glm(formula = event ~ x + offset(log(tstop - tstart)), family = poisson(),
data = sim_dat)
Coefficients:
(Intercept) x
-0.9053 0.9714
Degrees of Freedom: 248 Total (i.e. Null); 247 Residual
Null Deviance: 382.5
Residual Deviance: 306.4 AIC: 498.4
For other distribution, it seems like the flexsurv package may do. See this post.