Episode splitting in survival analysis by the timing of an event in R - r

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?

Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().

After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

Related

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

Creating time-varying covariates when time is different for each observation

I'm attempting to conduct survival analysis with time-varying covariates. The data comes from a longitudinal survey that is administered biannually, and currently looks something like this:
id event1yr event2yr income 14 income16 income18 income20
1 2014 2020 8 10 13 8
2 2018 NA 13 15 24 35
In the case of my study, I am trying to begin time (t_0) at event1yr, and measure time from that variable, which obviously is different for each observation. So, for instance, time to event for observation 1 is 6 years, whereas the time to event for observation 2 is right-censored and 2 years. The main issue comes with also trying to pull data from different time points since the beginning time is different. For instance, income for years 0-2 (exclusive) for observation 1 would come from income14, but income for year 0-2 for observation would come from income18. In the end, I'd like my data to look something like this:
id st.time end.time event2 censor inc
1 0 2 0 0 8
1 2 4 0 0 10
1 4 6 1 0 13
2 0 2 0 1 24
Thus, I'm trying to think of the best way to code to account for pulling the data from different points in time since the beginning reference time is not constant across observations.

Algorithm to optimally define groups based on multiple responses in R

I have a scheduling puzzle that I am looking for suggestions/solutions using R.
Context
I am coordinating a series of live online group discussions where registered participants will be grouped according to their availability. In a survey, 28 participants (id) indicated morning, afternoon, or evening (am, after, pm) availability on days Monday through Saturday (18 possibilities). I need to generate groups of 4-6 participants who are available at the same time, without replacement (meaning they can only be assigned to one group). Once assigned, groups will meet weekly at the same time (i.e. Group A members will always meet Monday mornings).
Problem
Currently group assignment is being achieved manually (by a human), but with more participants optimizing group assignment will become increasingly challenging. I am interested in finding an algorithm that efficiently achieves relatively equal group placements, and respects other factors such as a person's timezone.
Sample Data
Sample data are in long-format located in an R-script here.
>str(x)
'data.frame': 504 obs. of 4 variables:
$ id : Factor w/ 28 levels "1","10","11",..: 1 12 22 23 24 25 26 27 28 2 ...
$ timezone: Factor w/ 4 levels "Central","Eastern",..: 2 1 3 4 2 1 3 4 2 1 ...
$ day.time: Factor w/ 18 levels "Fri.after","Fri.am",..: 5 5 5 5 5 5 5 5 5 5 ...
$ avail : num 0 0 1 0 1 1 0 1 0 0 ...
The first 12 rows of the data look like this:
> head(x, 12)
id timezone day.time avail
1 1 Eastern Mon.am 0
2 2 Central Mon.am 0
3 3 Mountain Mon.am 1
4 4 Pacific Mon.am 0
5 5 Eastern Mon.am 1
6 6 Central Mon.am 1
7 7 Mountain Mon.am 0
8 8 Pacific Mon.am 1
9 9 Eastern Mon.am 0
10 10 Central Mon.am 0
11 11 Mountain Mon.am 0
12 12 Pacific Mon.am 1
Ideal Solution
An algorithm to optimally define groups (size = 4 to 6) that exactly match on day.time and avail while minimizing differences on other more flexible factors (in this case timezone). In the final result, a participant should only exist in a single group.
Okay, so I am not the most knowledge when it comes to this, but have you looked at the K-Means Clustering algorithm. You can specify the number of clusters you want and the variables for the algorithm to consider. It will then cluster the data into the specified number of clusters, aka, categories for you.
What do you think?
References:
https://datascienceplus.com/k-means-clustering-in-r/
http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

Competing risk analysis of interval data

I study competitive risks and use R.
I would like to use the model in Fine and Gray (1999), A proportional hazards model for the subdistribution of a competing risk, JASA, 94:496-509.
I found the cmprsk package.
However, I have an “interval data” configuration with a starting time t0 and an ending time t1 for each interval, t1 being the exit or the right censoring when I am in the last interval for a given entity. Here is an extract of the dataset
entity t0 t1 cov
1 0 3 12
1 3 7 4
1 7 9 1
2 2 3 2
2 3 10 9
3 0 10 11
4 0 1 0
4 1 6 21
4 6 7 12
...
I do not find how to implement that with cmprsk, while it is implemented FOR EXAMPLE in the survival package (Surv(time,time2,…)).
Is it possible to do it with cmprsk or should I go to another package?
I know that there is a Stata package (stcrreg) doing it but I prefer working with R.

Resources