How to do survival analysis in R with time-varying exposure to an intervention, using Surv and coxph? - r

I have survival data in this format, with a time-varying exposure to Intervention:
ID start stop status Intervention
1 2 14 0 0
2 2 5 0 0
3 2 3 0 0
3 3 10 1 1
4 5 8 0 0
5 6 10 0 0
For example, for patient ID #3: from day 2 to day 3, the patient has not yet received the intervention (Intervention = 0), but starting on day 3 and lasting until day 10 (when the patient dies), the patient has received the intervention (Intervention = 1).
I thought that I could then estimate the time-varying effect of exposure in the following manner:
coxph (Surv (start, stop, status) ~ Intervention + cluster (ID), data = df.td)
However, I recently found that this method is not correct for right-censored data (Two different results from coxph in R, using same stop and start times, why?). Most basic guides to time-dependent survival analysis use a line like this (for example, as in https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html).
Is this method correct for estimating the effect of Intervention on outcome, given the structure of the data?

Related

Survival analysis time dependent input data

I am fairly new to survival analysis and I apologize if this is a trivial question, but I wasn't able to find any solution to my problem.
I'm trying to find a good model for predicting whether and when a contract for a specific product (identified by ID column) will be bought, therefore a time to event prediction. I am interested mostly in a probability, that the event will occur in 3 months. However, my data is pretty much a monthly time series. Sample of the dataset would look somewhat like this:
ID
Time
Number of assistance calls
Number of product malfunctions
Time to fix
Contract bought
1
2012-01
0
0
NA
0
1
2012-02
3
1
37.124
0
1
2012-03
2
0
NA
0
1
2012-04
0
0
NA
1
2
2012-03
1
0
NA
0
2
2012-04
0
0
NA
0
Here's what I struggle with. I could use a survival analysis model, e.g. Cox proportional hazards model, which is able to deal with time dependent variables, but in that case it wouldn't be able to predict (1). I could also summarize the data for each ID, but that would mean losing some information contained in the data, e.g. malfunction could occur 1, 2 or 3 months before the event.
Is there a better way to approach this?
Thank you very much for any tips!
Sources:
[1] https://www.annualreviews.org/doi/10.1146/annurev.publhealth.20.1.145

How can I specify the carryover in the first period of a two-treatment three-period crossover study (ABB/BAA)?

I have found a lot of information on how to analyze a 2*2 (AB/BA) crossover trial; however, there are fewer materials on how to disentangle the carryover effect when the study is designed in three periods and two sequences (ABB/BAA). It is worth mentioning that A and B are the treatments and there have been wash-out phases between the three periods.
As sample data, I would like to use the bioequivalence data from "daewr" library.
library("daewr")
data(bioequiv)
head(bioequiv)
Group Subject Period Treat Carry y
1 1 2 1 A none 112.25
2 1 2 2 B A 106.36
3 1 2 3 B B 88.59
4 1 3 1 A none 153.71
5 1 3 2 B A 150.13
6 1 3 3 B B 151.31
The variable Carry contains lagged information from the previous period's Treatment.
The model below should be able to disentangle the effects, but I don't know how to replace the none in the Carry column. I am not sure how to specify this, or how to check if the carryover effect is negligible.
If we don't replace the none in the Carry column, the undermentioned model faces a problem of multicollinearity.
fit <- lmer(y ~ Period+Treat+Carry+(1|Subject), bioequiv)
anova(fit)
summary(fit)

Creating time-varying covariates when time is different for each observation

I'm attempting to conduct survival analysis with time-varying covariates. The data comes from a longitudinal survey that is administered biannually, and currently looks something like this:
id event1yr event2yr income 14 income16 income18 income20
1 2014 2020 8 10 13 8
2 2018 NA 13 15 24 35
In the case of my study, I am trying to begin time (t_0) at event1yr, and measure time from that variable, which obviously is different for each observation. So, for instance, time to event for observation 1 is 6 years, whereas the time to event for observation 2 is right-censored and 2 years. The main issue comes with also trying to pull data from different time points since the beginning time is different. For instance, income for years 0-2 (exclusive) for observation 1 would come from income14, but income for year 0-2 for observation would come from income18. In the end, I'd like my data to look something like this:
id st.time end.time event2 censor inc
1 0 2 0 0 8
1 2 4 0 0 10
1 4 6 1 0 13
2 0 2 0 1 24
Thus, I'm trying to think of the best way to code to account for pulling the data from different points in time since the beginning reference time is not constant across observations.

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Competing risk analysis of interval data

I study competitive risks and use R.
I would like to use the model in Fine and Gray (1999), A proportional hazards model for the subdistribution of a competing risk, JASA, 94:496-509.
I found the cmprsk package.
However, I have an “interval data” configuration with a starting time t0 and an ending time t1 for each interval, t1 being the exit or the right censoring when I am in the last interval for a given entity. Here is an extract of the dataset
entity t0 t1 cov
1 0 3 12
1 3 7 4
1 7 9 1
2 2 3 2
2 3 10 9
3 0 10 11
4 0 1 0
4 1 6 21
4 6 7 12
...
I do not find how to implement that with cmprsk, while it is implemented FOR EXAMPLE in the survival package (Surv(time,time2,…)).
Is it possible to do it with cmprsk or should I go to another package?
I know that there is a Stata package (stcrreg) doing it but I prefer working with R.

Resources