I'm attempting to conduct survival analysis with time-varying covariates. The data comes from a longitudinal survey that is administered biannually, and currently looks something like this:
id event1yr event2yr income 14 income16 income18 income20
1 2014 2020 8 10 13 8
2 2018 NA 13 15 24 35
In the case of my study, I am trying to begin time (t_0) at event1yr, and measure time from that variable, which obviously is different for each observation. So, for instance, time to event for observation 1 is 6 years, whereas the time to event for observation 2 is right-censored and 2 years. The main issue comes with also trying to pull data from different time points since the beginning time is different. For instance, income for years 0-2 (exclusive) for observation 1 would come from income14, but income for year 0-2 for observation would come from income18. In the end, I'd like my data to look something like this:
id st.time end.time event2 censor inc
1 0 2 0 0 8
1 2 4 0 0 10
1 4 6 1 0 13
2 0 2 0 1 24
Thus, I'm trying to think of the best way to code to account for pulling the data from different points in time since the beginning reference time is not constant across observations.
Related
I am fairly new to survival analysis and I apologize if this is a trivial question, but I wasn't able to find any solution to my problem.
I'm trying to find a good model for predicting whether and when a contract for a specific product (identified by ID column) will be bought, therefore a time to event prediction. I am interested mostly in a probability, that the event will occur in 3 months. However, my data is pretty much a monthly time series. Sample of the dataset would look somewhat like this:
ID
Time
Number of assistance calls
Number of product malfunctions
Time to fix
Contract bought
1
2012-01
0
0
NA
0
1
2012-02
3
1
37.124
0
1
2012-03
2
0
NA
0
1
2012-04
0
0
NA
1
2
2012-03
1
0
NA
0
2
2012-04
0
0
NA
0
Here's what I struggle with. I could use a survival analysis model, e.g. Cox proportional hazards model, which is able to deal with time dependent variables, but in that case it wouldn't be able to predict (1). I could also summarize the data for each ID, but that would mean losing some information contained in the data, e.g. malfunction could occur 1, 2 or 3 months before the event.
Is there a better way to approach this?
Thank you very much for any tips!
Sources:
[1] https://www.annualreviews.org/doi/10.1146/annurev.publhealth.20.1.145
I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!
Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.
I have survival data in this format, with a time-varying exposure to Intervention:
ID start stop status Intervention
1 2 14 0 0
2 2 5 0 0
3 2 3 0 0
3 3 10 1 1
4 5 8 0 0
5 6 10 0 0
For example, for patient ID #3: from day 2 to day 3, the patient has not yet received the intervention (Intervention = 0), but starting on day 3 and lasting until day 10 (when the patient dies), the patient has received the intervention (Intervention = 1).
I thought that I could then estimate the time-varying effect of exposure in the following manner:
coxph (Surv (start, stop, status) ~ Intervention + cluster (ID), data = df.td)
However, I recently found that this method is not correct for right-censored data (Two different results from coxph in R, using same stop and start times, why?). Most basic guides to time-dependent survival analysis use a line like this (for example, as in https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html).
Is this method correct for estimating the effect of Intervention on outcome, given the structure of the data?
I have a scheduling puzzle that I am looking for suggestions/solutions using R.
Context
I am coordinating a series of live online group discussions where registered participants will be grouped according to their availability. In a survey, 28 participants (id) indicated morning, afternoon, or evening (am, after, pm) availability on days Monday through Saturday (18 possibilities). I need to generate groups of 4-6 participants who are available at the same time, without replacement (meaning they can only be assigned to one group). Once assigned, groups will meet weekly at the same time (i.e. Group A members will always meet Monday mornings).
Problem
Currently group assignment is being achieved manually (by a human), but with more participants optimizing group assignment will become increasingly challenging. I am interested in finding an algorithm that efficiently achieves relatively equal group placements, and respects other factors such as a person's timezone.
Sample Data
Sample data are in long-format located in an R-script here.
>str(x)
'data.frame': 504 obs. of 4 variables:
$ id : Factor w/ 28 levels "1","10","11",..: 1 12 22 23 24 25 26 27 28 2 ...
$ timezone: Factor w/ 4 levels "Central","Eastern",..: 2 1 3 4 2 1 3 4 2 1 ...
$ day.time: Factor w/ 18 levels "Fri.after","Fri.am",..: 5 5 5 5 5 5 5 5 5 5 ...
$ avail : num 0 0 1 0 1 1 0 1 0 0 ...
The first 12 rows of the data look like this:
> head(x, 12)
id timezone day.time avail
1 1 Eastern Mon.am 0
2 2 Central Mon.am 0
3 3 Mountain Mon.am 1
4 4 Pacific Mon.am 0
5 5 Eastern Mon.am 1
6 6 Central Mon.am 1
7 7 Mountain Mon.am 0
8 8 Pacific Mon.am 1
9 9 Eastern Mon.am 0
10 10 Central Mon.am 0
11 11 Mountain Mon.am 0
12 12 Pacific Mon.am 1
Ideal Solution
An algorithm to optimally define groups (size = 4 to 6) that exactly match on day.time and avail while minimizing differences on other more flexible factors (in this case timezone). In the final result, a participant should only exist in a single group.
Okay, so I am not the most knowledge when it comes to this, but have you looked at the K-Means Clustering algorithm. You can specify the number of clusters you want and the variables for the algorithm to consider. It will then cluster the data into the specified number of clusters, aka, categories for you.
What do you think?
References:
https://datascienceplus.com/k-means-clustering-in-r/
http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning