This is a dataset were week is nested in period, this gets problematic when I want to see pairwise comparisons between Diet and week. What does the error "Try taking nested factors out of 'by'." mean?
form <- as.formula(paste(colnames(df)[8],'~ Diet + period +week*Diet +(1|id)')) #get data for interactions
dflmer <- lmer(form, data=df)
a <- Anova(dflmer, type=3)
library(emmeans)
emm <- emmeans(dflmer, pairwise ~ Diet | week)
NOTE: A nesting structure was detected in the fitted model:
week %in% period
Note: Grouping factor(s) for 'week' have been added to the 'by' list.
Error in .nested_contrast(rgobj = object, method = method, by = by, adjust = adjust, :
There are no factor levels left to contrast. Try taking nested factors out of 'by'.
Since week is nested in period, you can't condition on week without also conditioning on period. Try
emmeans(dflmer , pairwise ~ Diet | period:week)
The very latest version, 1.46, of emmeans fixes this, in that older versions did not consider the possibility of nesting in by variables.
Addendum
I think I'm remembering some details wrong. The code that generates this error message was misplaced in versions <= 1.4.5. I think you may need to install version 1.4.6 to get this to work. See the related issue report
Addendum 2
I constructed a similar example, and I got errors from this model still. The problem is that week is nested in period, and the model has Diet crossed with week but not with period, which doesn't make sense. I was able to get results after I fitted the model with fixed-effect terms Diet*(period + week)
I am trying to fit a survival model with left-truncated data using the survival package however I am unsure of the correct syntax.
Let's say we are measuring the effect of age at when hired (age) and job type (parttime) on duration of employment of doctors in public health clinics. Whether the doctor quit or was censored is indicated by the censor variable (0 for quittting, 1 for censoring). This behaviour was measured in an 18-month window. Time to either quit or censoring is indicated by two variables, entry (start time) and exit(stop time) indicating how long, in years, the doctor was employed at the clinic. If doctors commenced employment after the window 'opened' their entry time is set to 0. If they commenced employment prior to the window 'opening' their entry time represents how long they had already been employed in that position when the window 'opened', and their exit time is how long from when they were initially hired they either quit or were censored by the window 'closing'. We also postulate a two-way interaction between age and duration of employment (exit).
This is the toy data set. It is much smaller than a normal dataset would be, so the estimates themselves are not as important as whether the syntax and variables included (using the survival package in R) are correct, given the structure of the data. The toy data has the exact same structure as a dataset discussed in Chapter 15 of Singer and Willet's Applied Longitudinal Data Analysis. I have tried to match the results they report, without success. There is not a lot of explicit information online how to conduct survival analyses on left-truncated data in R, and the website that provides code for the book (here) does not provide R code for the chapter in question. The methods for modeling time-varying covariates and interaction effects are quite complex in R and I just wonder if I am missing something important.
Here is the toy data
id <- 1:40
entry <- c(2.3,2.5,2.5,1.2,3.5,3.1,2.5,2.5,1.5,2.5,1.4,1.6,3.5,1.5,2.5,2.5,3.5,2.5,2.5,0.5,rep(0,20))
exit <- c(5.0,5.2,5.2,3.9,4.0,3.6,4.0,3.0,4.2,4.0,2.9,4.3,6.2,4.2,3.0,3.9,4.1,4.0,3.0,2.0,0.2,1.2,0.6,1.9,1.7,1.1,0.2,2.2,0.8,1.9,1.2,2.3,2.2,0.2,1.7,1.0,0.6,0.2,1.1,1.3)
censor <- c(1,1,1,1,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,rep(1,20))
parttime <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0)
age <- c(34,28,29,38,33,33,32,28,40,30,29,34,31,33,28,29,29,31,29,29,30,37,33,38,34,37,37,40,29,38 ,49,32,30,27,35,34,35,30,35,34)
doctors <- data.frame(id,entry,exit,censor,parttime,age)
Now for the model.
coxph(Surv(entry, exit, 1-censor) ~ parttime + age + age:exit, data = doctors)
Is this the correct way to specify the model given the structure of the data and what we want to know? An answer here suggests it is correct, but I am not sure whether, for example, the interaction variable is correctly specified.
As is often the case, it's not until I post a question about a problem on SO that I work out how to do it myself. If there is an interaction with time predictor we need to convert the dataset into a count process, person period format (i.e. a long form). This is because each participant needs an interval that tracks their status with respect to the event for every time point that the event occurred to anyone else in the data set, up to the point when they exited the study.
First let's make an event variable
doctors$event <- 1 - doctors$censor
Before we run the cox model we need to use the survSplit function in the survival package. To do this we need to make a vector of all the time points when an event occurred
cutPoints <- order(unique(doctors$exit[doctors$event == 1]))
Now we can pass this into the survSplit function to create a new dataset...
docNew <- survSplit(Surv(entry, exit, event)~.,
data = doctors,
cut = cutPoints,
end = "exit")
... which we then run our model on
coxph(Surv(entry,exit,event) ~ parttime + age + age:exit, data = docNew)
Voila!
I have some code in Stata that I'm trying to redo in R. I'm working on a delayed entry survival model and I want to limit the follow-up to 5 years. In Stata this is very easy and can be done as follows for example:
stset end, fail(failure) id(ID) origin(start) enter(entry) exit(time 5)
stcox var1
However, I'm having trouble recreating this in R. I've made a toy example limiting follow-up to 1000 days - here is the setup:
library(survival); library(foreign); library(rstpm2)
data(brcancer)
brcancer$start <- 0
# Make delayed entry time
brcancer$entry <- brcancer$rectime / 2
# Write to dta file for Stata
write.dta(brcancer, "brcancer.dta")
Ok so now we've set up an identical dataset for use in both R and Stata. Here is the Stata bit code and model result:
use "brcancer.dta", clear
stset rectime, fail(censrec) origin(start) enter(entry) exit(time 1000)
stcox hormon
And here is the R code and results:
# Limit follow-up to 1000 days
brcancer$limit <- ifelse(brcancer$rectime <1000, brcancer$rectime, 1000)
# Cox model
mod1 <- coxph(Surv(time=entry, time2= limit, event = censrec) ~ hormon, data=brcancer, ties = "breslow")
summary(mod1)
As you can see the R estimates and State estimates differ slightly, and I cannot figure out why. Have I set up the R model incorrectly to match Stata, or is there another reason the results differ?
Since the methods match on an avaialble dataset after recoding the deaths that occur after to termination date, I'm posting the relevant sections of my comment as an answer.
I also think that you should have changed any of the deaths at time greater than 1000 to be considered censored. (Notice that the numbers of events is quite different in the two sets of results.
I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?
I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!