Calculating months of follow-up with pyears() of survival package - r

I want to calculate the person-time of follow-up by calendar month. In my example, I have three subjects, with different times of follow-up. I want to know if the rates of the event vary by year tertiles, so I want to sum up the time at risk they spend in each of the tertiles.
library(lubridate)
library(survival)
event <- c(1,1,1)
id <- c(1,2,2)
followup_time <- c(365, 365*2, 365*3)
right.date <- c(ymd("2012-06-01"), ymd("2013-09-01"), ymd("2011-01-01"))
left.date <- right.date-followup
tertile <- cut(month(right.date), c(0,4,9,12), include.lowest = T)
df <- data.frame(id, left.date, right.date, followup_time, event, tertile); df
id left.date right.date followup_time event tertile
1 1 2011-06-01 2012-06-01 365 1 (4,9]
2 2 2011-09-01 2013-09-01 730 1 (4,9]
3 2 2008-01-01 2011-01-01 1095 1 [0,4]
sum(df$followup_time)
[1] 2190
Using the package survival in R, function pyears() I get the following results. However, although the number of subjects and events are correct, the person-time of follow-up is incorrect, according to my needs.
s <- Surv(time = followup_time, event = event)
summary(pyears(s ~ tertile , scale = 1))
Call: pyears(formula = s ~ tertile , scale = 1)
number of observations = 3
month N Events Time
-------- --- -------- ------
[0,4] 1 1 1095
(4,9] 2 2 1095
(9,12] 0 0 0
I expect the following results, which correspond to the sum of the time at risk each subject spent in each of the intervals.
month N Events Time
-------- --- -------- ------
[0,4] 1 1 547.5
(4,9] 2 2 547.5
(9,12] 0 0 547.5
Some people use the function tcut() from this same package to do this kind of operation for calculating person-time, but I did not have satisfactory results.

I don't understand the confusion (or maybe it's really simple and nothing to do with survival package functions):
df
#--------
id left.date right.date followup_time event tertile
1 1 2011-06-02 2012-06-01 365 1 (4,9]
2 2 2011-09-02 2013-09-01 730 1 (4,9]
3 2 2008-01-02 2011-01-01 1095 1 [0,4]
month(right.date)
#[1] 6 9 1
It has to do with how the default R cut function works. Intervals are closed on the right. I happen to find that most people expect the intervals to be closed on the left, and if you want that you would execute:
> df <- data.frame(id, left.date, right.date, followup_time, event, tertile); df
id left.date right.date followup_time event tertile
1 1 2011-06-02 2012-06-01 365 1 [4,9)
2 2 2011-09-02 2013-09-01 730 1 [9,12]
3 2 2008-01-02 2011-01-01 1095 1 [0,4)
> s <- with(df, Surv(time = followup_time, event = event))
>
> summary(pyears(s ~ tertile , scale = 1))
Call: pyears(formula = s ~ tertile, scale = 1)
number of observations = 3
tertile N Events Time
--------- --- -------- ------
[0,4) 1 1 1095
[4,9) 1 1 365
[9,12] 1 1 730

Related

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

lag date upon condition, carry over

I have repeated measurements over individuals who have made donations, or not, when solicited. I wish I could carry over the last successful solicitation date to the next observations until a new success is hit.
Here is my sample data:
set.seed(13)
df <- data.frame(ID=rep(letters[1:3], each=4),
SolicitationDate= sample(seq(as.Date('2016/01/01'),
as.Date('2018/01/01'), by="day"), 3),
Success=rbinom(4,1,0.2))
df$ExpectedResult <- c(NA, NA, "2016-06-28", "2016-06-28",
NA, NA, "2016-10-11", "2016-10-11",
NA,NA,"2017-06-03", "2017-06-03")
Should an individual have multiple successes, the last success date should be carrried over.
Thanks
Romain
Here's a version using tidyverse. I think your expected output may be off as the dates should be ordered within ID but that may be wrong. In that case let me know.
df %>%
group_by(ID) %>% # Group by ID
arrange(SolicitationDate) %>% # Sort according to date
mutate(res=replace(SolicitationDate, Success==0, NA)) %>% # Create new value
tidyr::fill(res) # Fill down
This will give you
# A tibble: 12 x 4
# Groups: ID [3]
ID SolicitationDate Success res
<fct> <date> <int> <date>
1 a 2016-06-28 1 2016-06-28
2 a 2016-10-11 0 2016-06-28
3 a 2017-06-03 0 2016-06-28
4 a 2017-06-03 0 2016-06-28
5 b 2016-06-28 0 NA
6 b 2016-06-28 0 NA
7 b 2016-10-11 1 2016-10-11
8 b 2017-06-03 0 2016-10-11
9 c 2016-06-28 0 NA
10 c 2016-10-11 0 NA
11 c 2016-10-11 0 NA
12 c 2017-06-03 1 2017-06-03
I'm not sure if you want the success dates to be part of the result or not. If not then you could set to missing and fill down again. In any case: hope this helps.

Count consecutive events

I have daily data for 1 year having 0 and 1 values. I want to calculate monthly events, there is consecutive 1 value for 3 on more days using R?
set.seed(123)
abts1 <- sample(0:1, 366, replace=TRUE)
library(xts)
d16 <- seq(as.Date("2016-01-01"), as.Date("2016-12-31"), 1)
ax16 <- as.Date(d16,"%y-%m-%d")
abts12 <- xts(abts1, ax16)
# but it gives events for complete period, not as monthly.
apply.monthly(abts12, function(x) sum(with(rle(c(x!=0)), lengths*values)>=3))
The last line of your code throws an error for me when I use xts_0.9-7.
R> apply.monthly(abts12, function(x) sum(with(rle(x!=0), lengths*values)>=3))
Error in rle(x != 0) : 'x' must be a vector of an atomic type
That's easy to fix though. You just need to wrap x != 0 in as.logical.
R> apply.monthly(abts12, function(x) sum(with(rle(as.logical(x!=0)), lengths*values)>=3))
[,1]
2016-01-31 2
2016-02-29 1
2016-03-31 3
2016-04-30 2
2016-05-31 1
2016-06-30 2
2016-07-31 3
2016-08-31 3
2016-09-30 2
2016-10-31 3
2016-11-30 0
2016-12-31 2
That seems like the output you expect. The number of times there are 3 or more consecutive days with a value of 1.

Difftime for workdays according to holidayNYSE in R

I'm trying to find difftime for working days only. I want to calculate difftime according to holidayNYSE calendar. When I use the difftime function weekends and holidays are included in the answers, my dataset contaies only data from working days, but when using difftime I have to subtract the non-working days somehow.
A is a vector of 0 and 1, and I want to find the duration of how many days with 0 or 1. Duration for run one are suppose to be 35 and I get 49 (working days from January 1990).
df <- data.frame(Date=(dates), A)
setDT(df)
df <- data.frame(Date=(dates), A)
DF1 <- df[, list(A = unique(A), duration = difftime(max(Date),min(Date), holidayNYSE
(year=setRmetricsOptions(start="1990-01-01", end="2015-31-12")))), by = run]
DF1
run A duration
1: 1 1 49 days
2: 2 0 22 days
3: 3 1 35 days
4: 4 0 27 days
5: 5 1 14 days
---
291: 291 1 6 days
292: 292 0 34 days
293: 293 1 10 days
294: 294 0 15 days
295: 295 1 29 days
An answer to my question without use of difftime:
df <- data.frame(Date=(dates), Value1=bull01)
setDT(df)
df[, run := cumsum(c(1, diff(Value1) !=0))]
duration <- rep(0)
for (i in 1:295){
ind <- which(df$run==i)
a <- df$Date[ind]
duration[i] <- length(a)
}
c <- rep(c(1,0),295)
c <- c[1:295]
df2 <- data.frame(duration, type=c)
> df2
run duration type
1 35 1
2 17 0
3 25 1
4 20 0
5 10 1
---
291 5 1
292 25 0
293 9 1
294 11 0
295 21 1

R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.
ID start date end date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
....
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000
The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.
I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly.
In other words, I would like to turn this series into something like
ID start date end date amount
1 1-1-2012 2-1-2012 3096 = 6000 * 16/31
1 2-1-2012 3-1-2012 4339 = 6000*15/31+4000*14/39
1 3-1-2012 4-1-2012 etc
....
1 6-1-2012 7-1-2012 etc
2 1-1-2012 2-1-2012 etc
2 2-1-2012 3-1-2012 etc
2 3-1-2012 4-1-2012 etc
2 4-1-2012 5-1-2012 etc
2 5-1-2012 6-1-2012 etc
....
10000 1-1-2012 2-1-2012 etc
....
10000 6-1-2012 7-1-2012 etc
Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.
I'm rather new to r and could certainly use some help on this!
Here is using native R:
#The data
df=read.table(text='ID start_date end_date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000',
header=T,row.names = NULL,stringsAsFactors =FALSE)
df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")
df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot
#using by compute new amount
df1$amount=do.call(c,
by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
)
df1
> df1
ID startm endm amount
1 1 2012-01-01 2012-02-01 3096.774
2 1 2012-02-01 2012-03-01 4339.123
3 1 2012-03-01 2012-05-01 4306.038
4 1 2012-05-01 2012-06-01 1535.842
5 2 2012-01-01 2012-02-01 2500.000
6 2 2012-02-01 2012-03-01 4700.000
7 2 2012-03-01 2012-05-01 3754.545
8 10000 2012-01-01 2012-02-01 5302.326
9 10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000
To solve this I think the easiest way is to break it down into two problems.
How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
How do I group by a date range and summarise to what I'm interested in?
For the following example, I will use the data set which I created using the code below:
df <- data.frame(
id=c(1,1,1,1,2,2,2),
start_date=as.Date(c("1-15-2012",
"2-15-2012",
"3-25-2012",
"5-26-2012",
"1-16-2012",
"2-27-2012",
"3-18-2012"), "%m-%d-%Y"),
end_date=as.Date(c("2-15-2012",
"3-25-2012",
"5-26-2012",
"6-13-2012",
"2-27-2012",
"3-18-2012",
"5-23-2012"), "%m-%d-%Y"),
amount=c(6000,
4000,
3000,
1000,
7000,
2000,
3000)
)
1. Provide daily figures
To provide the daily figures, firstly we get the daily contribution:
df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)
Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:
library(dplyr)
df <- df %>%
rowwise() %>%
do(data.frame(id=.$id,
date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")),
daily_contribution=.$daily_contribution))
which has some output which looks like this:
Source: local data frame [285 x 3]
Groups: <by row>
id date daily_contribution
1 1 2012-01-15 193.5484
2 1 2012-01-16 193.5484
3 1 2012-01-17 193.5484
4 1 2012-01-18 193.5484
5 1 2012-01-19 193.5484
6 1 2012-01-20 193.5484
7 1 2012-01-21 193.5484
8 1 2012-01-22 193.5484
9 1 2012-01-23 193.5484
10 1 2012-01-24 193.5484
.. .. ... ...
2. Create a grouping variable
Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:
library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)
Now with all of this we can easily use dplyr to summarise our information by the dates as required.
df %>%
group_by(id, mnth, yr) %>%
summarise(amount=sum(daily_contribution))
with output:
Source: local data frame [11 x 4]
Groups: id, mnth
id mnth yr amount
1 1 1 2012 3290.3226
2 1 2 2012 4441.6873
3 1 3 2012 2902.8122
4 1 4 2012 1451.6129
5 1 5 2012 1591.3978
6 1 6 2012 722.2222
7 2 1 2012 2666.6667
8 2 2 2012 4800.0000
9 2 3 2012 2436.3636
10 2 4 2012 1363.6364
11 2 5 2012 1045.4545
To get it precisely in the format you specified:
df %>% rowwise() %>%
mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
select(id, start_date, end_date, amount)
with output:
Source: local data frame [11 x 4]
Groups: <by row>
id start_date end_date amount
1 1 2012-01-01 2012-02-01 3290.3226
2 1 2012-02-01 2012-03-01 4441.6873
3 1 2012-03-01 2012-04-01 2902.8122
4 1 2012-04-01 2012-05-01 1451.6129
5 1 2012-05-01 2012-06-01 1591.3978
6 1 2012-06-01 2012-07-01 722.2222
7 2 2012-01-01 2012-02-01 2666.6667
8 2 2012-02-01 2012-03-01 4800.0000
9 2 2012-03-01 2012-04-01 2436.3636
10 2 2012-04-01 2012-05-01 1363.6364
11 2 2012-05-01 2012-06-01 1045.4545
as needed.
note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.
Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).
df$index <- 1:nrow(df) #Create a unique index number
#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")
library(plyr); library(reshape) #Load the libraries
#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
# ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) {
ID <- X$ID #Keep the ID value
n.days <- as.numeric(difftime( X$end.date, X$start.date )) #Calculate time difference in days, report the result as a number
day <- seq(X$start.date, X$end.date, by="days") #Sequence of days
amount.per.day <- X$amount/n.days #Amount for that day
data.frame(ID, day, amount.per.day) #Last line is the output
})
#Change list back into data.frame
df3 <- ldply(df2, data.frame) #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <- as.numeric(format(df3$day, "%m")) #Assign a month to all dates
#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))
# ID mon amount
# 1 1 1 3290.3226
# 2 1 2 4441.6873
# 3 1 3 2902.8122
# 4 1 4 1451.6129
# 5 1 5 1591.3978
# 6 1 6 722.2222
# 7 2 1 2666.6667
# 8 2 2 4800.0000
# 9 2 3 2436.3636
# 10 2 4 1363.6364
# 11 2 5 1045.4545
Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata).
HTH!

Resources