Delayed entry survival model. R vs Stata differences - r

I have some code in Stata that I'm trying to redo in R. I'm working on a delayed entry survival model and I want to limit the follow-up to 5 years. In Stata this is very easy and can be done as follows for example:
stset end, fail(failure) id(ID) origin(start) enter(entry) exit(time 5)
stcox var1
However, I'm having trouble recreating this in R. I've made a toy example limiting follow-up to 1000 days - here is the setup:
library(survival); library(foreign); library(rstpm2)
data(brcancer)
brcancer$start <- 0
# Make delayed entry time
brcancer$entry <- brcancer$rectime / 2
# Write to dta file for Stata
write.dta(brcancer, "brcancer.dta")
Ok so now we've set up an identical dataset for use in both R and Stata. Here is the Stata bit code and model result:
use "brcancer.dta", clear
stset rectime, fail(censrec) origin(start) enter(entry) exit(time 1000)
stcox hormon
And here is the R code and results:
# Limit follow-up to 1000 days
brcancer$limit <- ifelse(brcancer$rectime <1000, brcancer$rectime, 1000)
# Cox model
mod1 <- coxph(Surv(time=entry, time2= limit, event = censrec) ~ hormon, data=brcancer, ties = "breslow")
summary(mod1)
As you can see the R estimates and State estimates differ slightly, and I cannot figure out why. Have I set up the R model incorrectly to match Stata, or is there another reason the results differ?

Since the methods match on an avaialble dataset after recoding the deaths that occur after to termination date, I'm posting the relevant sections of my comment as an answer.
I also think that you should have changed any of the deaths at time greater than 1000 to be considered censored. (Notice that the numbers of events is quite different in the two sets of results.

Related

Syntax for survival analysis with late-entry

I am trying to fit a survival model with left-truncated data using the survival package however I am unsure of the correct syntax.
Let's say we are measuring the effect of age at when hired (age) and job type (parttime) on duration of employment of doctors in public health clinics. Whether the doctor quit or was censored is indicated by the censor variable (0 for quittting, 1 for censoring). This behaviour was measured in an 18-month window. Time to either quit or censoring is indicated by two variables, entry (start time) and exit(stop time) indicating how long, in years, the doctor was employed at the clinic. If doctors commenced employment after the window 'opened' their entry time is set to 0. If they commenced employment prior to the window 'opening' their entry time represents how long they had already been employed in that position when the window 'opened', and their exit time is how long from when they were initially hired they either quit or were censored by the window 'closing'. We also postulate a two-way interaction between age and duration of employment (exit).
This is the toy data set. It is much smaller than a normal dataset would be, so the estimates themselves are not as important as whether the syntax and variables included (using the survival package in R) are correct, given the structure of the data. The toy data has the exact same structure as a dataset discussed in Chapter 15 of Singer and Willet's Applied Longitudinal Data Analysis. I have tried to match the results they report, without success. There is not a lot of explicit information online how to conduct survival analyses on left-truncated data in R, and the website that provides code for the book (here) does not provide R code for the chapter in question. The methods for modeling time-varying covariates and interaction effects are quite complex in R and I just wonder if I am missing something important.
Here is the toy data
id <- 1:40
entry <- c(2.3,2.5,2.5,1.2,3.5,3.1,2.5,2.5,1.5,2.5,1.4,1.6,3.5,1.5,2.5,2.5,3.5,2.5,2.5,0.5,rep(0,20))
exit <- c(5.0,5.2,5.2,3.9,4.0,3.6,4.0,3.0,4.2,4.0,2.9,4.3,6.2,4.2,3.0,3.9,4.1,4.0,3.0,2.0,0.2,1.2,0.6,1.9,1.7,1.1,0.2,2.2,0.8,1.9,1.2,2.3,2.2,0.2,1.7,1.0,0.6,0.2,1.1,1.3)
censor <- c(1,1,1,1,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,rep(1,20))
parttime <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0)
age <- c(34,28,29,38,33,33,32,28,40,30,29,34,31,33,28,29,29,31,29,29,30,37,33,38,34,37,37,40,29,38 ,49,32,30,27,35,34,35,30,35,34)
doctors <- data.frame(id,entry,exit,censor,parttime,age)
Now for the model.
coxph(Surv(entry, exit, 1-censor) ~ parttime + age + age:exit, data = doctors)
Is this the correct way to specify the model given the structure of the data and what we want to know? An answer here suggests it is correct, but I am not sure whether, for example, the interaction variable is correctly specified.
As is often the case, it's not until I post a question about a problem on SO that I work out how to do it myself. If there is an interaction with time predictor we need to convert the dataset into a count process, person period format (i.e. a long form). This is because each participant needs an interval that tracks their status with respect to the event for every time point that the event occurred to anyone else in the data set, up to the point when they exited the study.
First let's make an event variable
doctors$event <- 1 - doctors$censor
Before we run the cox model we need to use the survSplit function in the survival package. To do this we need to make a vector of all the time points when an event occurred
cutPoints <- order(unique(doctors$exit[doctors$event == 1]))
Now we can pass this into the survSplit function to create a new dataset...
docNew <- survSplit(Surv(entry, exit, event)~.,
data = doctors,
cut = cutPoints,
end = "exit")
... which we then run our model on
coxph(Surv(entry,exit,event) ~ parttime + age + age:exit, data = docNew)
Voila!

Count-process datasets for Non-proportional Hazard (Cox) models with interaction variables

I am trying to run a nonproportional cox regression model featuring an interaction-with-time variable, as described in Chapter 15 (section 15.3) of Applied Longitudinal Data Analaysis by Singer and Willett. However I cannot seem to get answers that agree with the book.
The data used in this book and source code is supplied at this fantastic website. Unfortunarely no R code is supplied for the final chapter and the supplied dataset for R for the example discussed in-text is incomplete and provides incorrect answers for the simplest model (which I do know how to run). Instead, to obtain the complete dataset for this example, one must click the 'Download' link in the 'SAS' column (which has the correct dataset) and then, after installing the haven package (which allows one to read in foreign data formats), read in the dataset in question via:
haven::read_sas("alda/lengthofstay.sas7bdat")
This dataset indicates participants' (variable ID) length of stay (variable DAYS) in inpatient treatment in a hospital. The censoring variable is CENSOR. The researchers hypothesised that two different types of treatment (binary variable TREAT) would predict differential values of hazard of checking out of treatment. In addition they anticipated that the between-group difference in hazard would not be constant over time, therefore requiring the creation of an interaction term. I can get the simple main effect model to work, returning the same hazard coefficients reported in the book (which is how i eventually found out the .csv file supplied with the R code was incomplete).
summary(modA <- coxph(Surv(DAYS,1-CENSOR) ~ TREAT, data = los))
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.1457 1.1568 0.1541 0.945 0.345
I tried to follow the procedure laid out here, and here, and the sources listed therein (e.g. Therneau vignette on time-varying covariates in the survival package), and, of course, when I am copy-pasting someone else's code and running that it all works fine. But I am trying to do this for myself from scratch with a dataset whose results I can compare against mine. And I just can't make it work.
first I created an EVENT variable
los$EVENT <- 1 - los$CENSOR
there is a duplicate id number in the dataset that causes issues. So we have to change it to a new ID number
los$ID[which(duplicated(los$ID))] <- 842
Now, based on what I read here and here the dataframe needs to be split so that, for every participant, there is one row indicating the EVENT status at every point prior to their event (or censorship) time when any other participant experienced an event. Therefore we need to create a vector of all the unique event times, then split the dataset on those event times
cutPoints <- sort(unique(los$DAYS[los$EVENT == 1]))
# now split the dataset
longLOS <- survSplit(Surv(DAYS,EVENT)~ ., data = los, cut = cutPoints)
# and (just because I'm anal) rename the interval upper bound column (formerly "DAYS")
names(longLOS)[5] <- "tstop"
When I looked at this dataset it appeared to be what I was after, with (1) as many rows for each participant as there are intervals prior to their event time when anyone else in the dataset experienced an event, (2) two columns indicating the lower and upper bounds of each interval, and (3) an event column with a 0 for all rows when the respondent did not experience the event, and a 1 in the final row when they either did experience the event or were censored.
Next I created the interaction-with-time variable, subtracting 1 from the 'interval upper bound' column so that main effect of TREAT represents the treatment effect on the first day of hospitalisation.
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
And ran the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
But it doesn't work! I got the (fairly unhelpful) error message
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
routine failed due to numeric overflow.This should never happen. Please contact the author.
What am I doing wrong? I have been slowly working through Singer and Willett for almost three years (I started while still a grad student), and now the final chapter is proving to be by far my greatest challenge. I have thirty pages to go; any help would be incredibly appreciated.
I figured out what I was doing wrong. A stupid error when I created the interaction variable TREATINT. instead of
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
it should have been
longLOS$TREATINT <- longLOS$TREAT*(longLOS$tstop - 1)
Now when you run the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
Not only does it work, it yields coefficients that match those reported in the Singer and Willett book.
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.706411 2.026705 0.292404 2.416 0.0157
TREATINT -0.020833 0.979383 0.009207 -2.263 0.0237
Given how dumb my mistake was I was tempted to just delete this whole post but I think I'll leave it up for others like me who want to know how to do interaction with time Cox models in R.

"Simulating" a large number of regressions with different predictor values

Let's say I have the following data and I'm interested in examining some counterfactuals. In particular, I want to examine whether there would be changes in predicted income given a change in income. The best way I can think to do this is to write a loop that runs this regression 1:n. However, how do I also make adjustments to the data frame while running through the loop. I'm really hoping that there is a base R function or something in a package that someone can point me to.
df = data.frame(year=c(2000,2001,2002,2003,2004,2005,2006,2007,2009,2010),
income=c(100,50,70,80,50,40,60,100,90,80),
age=c(26,30,35,30,28,29,31,34,20,35),
gpa=c(2.8,3.5,3.9,4.0,2.1,2.65,2.9,3.2,3.3,3.1))
df
mod = lm(income ~ age + gpa, data=df)
summary(mod)
Here are some counter factuals that may be worth considering when looking at the relationship between age, gpa, and income.
# What is everyone in the class had a lower/higher gpa?
df$gpa2 = df$gpa + 0.55
# what if one person had a lower/higher gpa?
df$gpa2[3] = 1.6
# what if the most recent employee/person had a lower/higher gpa?
df[10,4] = 4.0
With or without looping, what would be the best way to "simulate" a large (1000+) number of regression models in order examine various counter factuals, and then save those results in some data structure? Is there a "counter factual" analysis package which could save me a bit of work?

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Cox regression in MATLAB

I know there is COXPHFIT function in MATLAB to do Cox regression, but I have problems understanding how to apply it.
1) How to compare two groups of samples with survival data in days (survdays), censoring (cens) and some predictor value (x)? The groups defined by groups logical variable. Groups have different number of samples.
2) What is the baseline parameter in coxphfit? I did read the docs, but how should I choose the baseline properly?
It would be great if you know a site with good detailed examples on medical survival data. I found only the Mathworks demo that does not even mention coxphfit.
Do you know may be another 3rd party function for Cox regression?
UPDATE: The r tag added since the answer I've got is for R.
With survival analysis, the hazard function is the instantaneous death rate.
In these analyses, you are typically measuring what effect something has on this hazard function. For example, you may ask "does swallowing arsenic increase the rate at which people die?". A background hazard is the level at which people would die anyway (without swallowing arsenic, in this case).
If you read the docs for coxphfit carefully, you will notice that that function tries to calculate the baseline hazard; it is not something that you enter.
baseline The X values at which to
compute the baseline hazard.
EDIT: MATLAB's coxphfit function doesn't obviously work with grouped data. If you are happy to switch to R, then the anaylsis is a one-liner.
library(survival)
#Create some data
n <- 20;
dfr <- data.frame(
survdays = runif(n, 5, 15),
cens = runif(n) < .3,
x = rlnorm(n),
groups = rep(c("first", "second"), each = n / 2)
)
#The Cox ph analysis
summary(coxph(Surv(survdays, cens) ~ x / groups, dfr))
ANOTHER EDIT: That baseline parameter to MATLAB's coxphfit appears to be a normalising constant. R's coxph function doesn't have an equivalent parameter. I looked in Statistical Computing by Michael Crawley and it seems to suggest that the baseline hazard isn't important, since it cancels out when you calculate the likelihood of your individual dying. See Chapter 33, and p615-616 in particular. My knowledge of how the model works isn't deep enough to explain the discrepancy in the MATLAB and R implementations; perhaps you could ask on the Stack Exchange Stats Analysis site.

Resources