I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.
I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!
Sample data:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
Desired output:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
Comment:
For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).
EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]
here is an option:
interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(period_start[1L])) {
.(period_start=sdt, period_end=edt, Status=0L)
} else {
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
if (period_end[.N] > edt) {
ps <- ps[-length(ps)]
pe <- period_end[.N]
} else {
pe <- edt
}
.(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
}
},
.(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans
data:
library(data.table)
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
Reference:
Alternate, interweave or interlace two vectors
rptdate st
1 2/18/2017 2/12/2017
2 2/25/2017 2/19/2017
3 3/4/2017 2/26/2017
4 3/11/2017 3/5/2017
5 3/18/2017 3/12/2017
6 3/25/2017 3/19/2017
7 4/1/2017 3/26/2017
8 4/8/2017 4/2/2017
9 4/15/2017 4/9/2017
10 4/22/2017 4/16/2017
11 4/29/2017 4/23/2017
12 5/6/2017 4/30/2017
13 5/13/2017 5/7/2017
14 5/20/2017 5/14/2017
15 5/27/2017 5/21/2017
16 6/3/2017 5/28/2017
17 6/10/2017 6/4/2017
So basically rptdate is a bunch of Saturdays and st is each previous Sunday.
I would like to reshape this dataframe (the data is in date format) in this manner:
what I would like to do is this:
i=1
j=1
While (rptdate[i][j]>=st[i][j])
{add a new row where rptdate[i][j+1]= rptdate[i][j] and st[i][j+1]=rptdate[i][j]+1}
So basically, my desired new dataframe should be like this:
rptdate st
1 2/18/2017 2/12/2017
2/18/2017 2/13/2017
2/18/2017 2/14/2017
2/18/2017 2/15/2017
2/18/2017 2/16/2017
2/18/2017 2/17/2017
2/18/2017 2/18/2017
2 2/25/2017 2/19/2017
2/25/2017 2/20/2017
2/25/2017 2/21/2017
2/25/2017 2/22/2017
2/25/2017 2/23/2017
2/25/2017 2/24/2017
2/25/2017 2/25/2017
Thank you very much for your time.
Here is an idea via base R. You need to convert you variables to dates first. Then expand the data frame with extra 7 rows (1 week) for each date. Generate all the missing dates using seq and add them in your st variable.
d2[] <- lapply(d2, function(i) as.Date(i, format = '%m/%d/%Y'))
d3 <- d2[rep(row.names(d2), each = 7),]
d3$st<- do.call(c, Map(function(x, y)seq(x, y, by = 1), d2$st, d2$rptdate))
head(d3, 10)
# rptdate st
#1 2017-02-18 2017-02-12
#1.1 2017-02-18 2017-02-13
#1.2 2017-02-18 2017-02-14
#1.3 2017-02-18 2017-02-15
#1.4 2017-02-18 2017-02-16
#1.5 2017-02-18 2017-02-17
#1.6 2017-02-18 2017-02-18
#2 2017-02-25 2017-02-19
#2.1 2017-02-25 2017-02-20
#2.2 2017-02-25 2017-02-21
...
library(data.table)
dt <- data.table(V1=as.Date(c("2/18/2017","2/25/2017","3/4/2017","3/11/2017"),format = "%m/%d/%Y"),
V2=as.Date(c("2/12/2017","2/19/2017","2/26/2017","3/5/2017"),format = "%m/%d/%Y"))
for(i in 0:6){
dt[,paste0("colomn_i",i):=V1-i]
}
dt[,V2:=NULL]
temp <- melt(dt,id.vars = "V1")
setorder(temp,V1,value)
temp[,variable:=NULL]
Even though, eventually V2, is not needed
Here is an example using functions from dplyr and lubridate. dt2 would be the final output.
# Create example data frame
dt <- read.table(text = "rptdate st
2/18/2017 2/12/2017
2/25/2017 2/19/2017
3/4/2017 2/26/2017
3/11/2017 3/5/2017
3/18/2017 3/12/2017
3/25/2017 3/19/2017
4/1/2017 3/26/2017
4/8/2017 4/2/2017
4/15/2017 4/9/2017
4/22/2017 4/16/2017
4/29/2017 4/23/2017
5/6/2017 4/30/2017
5/13/2017 5/7/2017
5/20/2017 5/14/2017
5/27/2017 5/21/2017
6/3/2017 5/28/2017
6/10/2017 6/4/2017",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
rowwise() %>%
do(data_frame(rptdate = rep(.$rptdate[1], 7),
st = seq(.$st[1], .$rptdate[1], by = 1))) %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))
Or you can use the map2 and unnest functions from tidyverse.
# Load packages
library(tidyverse)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
mutate(st = map2(st, rptdate, seq, by = 1)) %>%
unnest() %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))
I would like to create a column of 0s and 1s based on inequalities of three columns of dates.
The idea is the following. If event_date is before death_date or study_over, the the column event should be ==1, if event_date occurs after death_date or study_over, event should be == 0. Both event_date and death_date may contain NAs.
set.seed(1337)
rand_dates <- Sys.Date() - 365:1
df <-
data.frame(
event_date = sample(rand_dates, 20),
death_date = sample(rand_dates, 20),
study_over = sample(rand_dates, 20)
)
My attempt was the following
eventR <-
function(x, y, z){
if(is.na(y)){
ifelse(x <= z, 1, 0)
} else if(y <= z){
ifelse(x < y, 1, 0)
} else {
ifelse(x <= z, 1, 0)
}
}
I use it in the following manner
library(dplyr)
df[c(3, 5, 7), "event_date"] <- NA #there are some NA in .$event_date
df[c(3, 4, 6), "death_date"] <- NA #there are some NA in .$death_date
df %>%
mutate(event = sapply(.$event_date, eventR, y = .$death_date, z = .$study_over))
##Error: wrong result size (400), expected 20 or 1
##In addition: There were 40 warnings (use warnings() to see them)
I can't figure out how to do this. Any suggestions?
This would seem to construct a binary column (with NA's where needed) where 1 indicates "event_date is before death_date or study_over" and 0 is used elsewhere. As already pointed out your specification does not cover all cases:
df$event <- with(df, as.numeric( event_date < pmax( death_date , study_over) ) )
df
Can use pmap_dbl() from the purrr package instead of sapply...
library(dplyr)
library(purrr)
df %>% mutate(event = pmap_dbl(list(event_date, death_date, study_over), eventR))
event_date death_date study_over event
1 2016-10-20 2017-01-27 2016-12-16 1
2 2016-10-15 2016-12-12 2017-01-20 1
3 <NA> <NA> 2016-10-09 NA
4 2016-09-04 <NA> 2016-11-17 1
5 <NA> 2016-10-13 2016-06-09 NA
6 2016-07-21 <NA> 2016-04-26 0
7 <NA> 2017-02-21 2016-07-12 NA
8 2016-07-02 2017-02-08 2016-08-24 1
9 2016-06-19 2016-09-07 2016-04-11 0
10 2016-05-14 2017-03-13 2016-08-03 1
11 2017-03-06 2017-02-05 2017-02-28 0
12 2017-03-10 2016-04-28 2016-11-30 0
13 2017-01-10 2016-12-10 2016-10-27 0
14 2016-05-31 2016-06-12 2016-08-13 1
15 2017-03-03 2016-12-25 2016-12-20 0
16 2016-04-01 2016-11-03 2016-06-30 1
17 2017-02-26 2017-02-25 2016-05-12 0
18 2017-02-08 2016-12-08 2016-10-14 0
19 2016-07-19 2016-07-03 2016-09-22 0
20 2016-06-17 2016-06-06 2016-11-09 0
You might also be interested in the dplyr function, case_when() for handling many if else statements.
I am trying to replicate something like this with a custom function but I am getting errors. I have the following data frame
> dd
datetimeofdeath injurydatetime
1 2/10/05 17:30
2 2/13/05 19:15
3 2/15/05 1:10
4 2/24/05 21:00 2/16/05 20:36
5 3/11/05 0:45
6 3/19/05 23:05
7 3/19/05 23:13
8 3/23/05 20:51
9 3/31/05 11:30
10 4/9/05 3:07
The typeof these is integer but for some reason they have levels as if they were factors. This could be the root of my problem but I am not sure.
> typeof(dd$datetimeofdeath)
[1] "integer"
> typeof(dd$injurydatetime)
[1] "integer"
> dd$injurydatetime
[1] 2/10/05 17:30 2/13/05 19:15 2/15/05 1:10 2/16/05 20:36 3/11/05 0:45 3/19/05 23:05 3/19/05 23:13 3/23/05 20:51 3/31/05 11:30
[10] 4/9/05 3:07
549 Levels: 1/1/07 18:52 1/1/07 20:51 1/1/08 17:55 1/1/11 15:25 1/1/12 0:22 1/1/12 22:58 1/11/06 23:50 1/11/07 6:26 ... 9/9/10 8:15
Now I would like to apply the following function rowwise()
library(lubridate)
library(dplyr)
get_time_alive = function(datetimeofdeath, injurydatetime)
{
if(as.character(datetimeofdeath) == "" | as.character(injurydatetime) == "") return(NA)
time_of_death = parse_date_time(as.character(datetimeofdeath), "%m/%d/%y %H:%M")
time_of_injury = parse_date_time(as.character(injurydatetime), "%m/%d/%y %H:%M")
time_alive = as.duration(new_interval(time_of_injury,time_of_death))
time_alive_hours = as.numeric(time_alive) / (60*60)
return(time_alive_hours)
}
This works on individual rows, but not when I do the operation rowwise.
> get_time_alive(dd$datetimeofdeath[1], dd$injurydatetime[1])
[1] NA
> get_time_alive(dd$datetimeofdeath[4], dd$injurydatetime[4])
[1] 192.4
> dd = dd %>% rowwise() %>% dplyr::mutate(time_alive_hours=get_time_alive(datetimeofdeath, injurydatetime))
There were 20 warnings (use warnings() to see them)
> dd
Source: local data frame [10 x 3]
Groups:
datetimeofdeath injurydatetime time_alive_hours
1 2/10/05 17:30 NA
2 2/13/05 19:15 NA
3 2/15/05 1:10 NA
4 2/24/05 21:00 2/16/05 20:36 NA
5 3/11/05 0:45 NA
6 3/19/05 23:05 NA
7 3/19/05 23:13 NA
8 3/23/05 20:51 NA
9 3/31/05 11:30 NA
10 4/9/05 3:07 NA
As you can see the fourth element is NA even though when I applied my custom function to it by itself I got 192.4. Why is my custom function failing here?
I think you can simplify your code a lot and just use something like this:
dd %>%
mutate_each(funs(as.POSIXct(as.character(.), format = "%m/%d/%y %H:%M"))) %>%
mutate(time_alive = datetimeofdeath - injurydatetime)
# datetimeofdeath injurydatetime time_alive
#1 <NA> 2005-02-15 01:10:00 NA days
#2 2005-02-24 21:00:00 2005-02-16 20:36:00 8.016667 days
#3 <NA> 2005-03-11 00:45:00 NA days
Side notes:
I shortened your input data, because it's not easy to copy (I only took those three rows that you also see in my answer)
If you want the "time_alive" formatted in hours, just use mutate(time_alive = (datetimeofdeath - injurydatetime)*24) in the last mutate.
If you use this code, there's no need for rowwise() - which should also make it faster, I guess
I have two large data sets like these:
df1=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'))
df2=data.frame(subject = c(rep(1, 10), rep(2, 10)), day=c(1,1,2,2,3,3,9,9,15,15,1,1,2,2,3,3,9,9,15,15),dtime=c('4/16/2012 6:15','4/16/2012 15:16','4/18/2012 7:15','4/18/2012 21:45','4/19/2012 7:05','4/19/2012 23:17','4/28/2012 7:15','4/28/2012 21:12','5/1/2012 7:15','5/1/2012 15:15','4/23/2012 6:45','4/23/2012 16:45','4/25/2012 6:45','4/25/2012 21:30','4/26/2012 6:45','4/26/2012 22:00','5/2/2012 7:00','5/2/2012 22:00','5/8/2012 6:45','5/8/2012 15:45'))
...
in df2, the 'dtime' contains two time points for each subject on each day. I want to use the time points for each sub on each day in df1 (ie. 'stime') to subtract the second time point for each sub on each day in df2, if the result is positive, then give the second time point in dtime for that observation, otherwise give the first time point. For example, for subject 1 on day 1, ('4/16/2012 6:25'-'4/16/2012 15:16')<0, so we give the first time point '4/16/2012 6:15' to this obs; ('4/16/2012 17:22'-'4/16/2012 15:16')>0,
so we give this second time point '4/16/2012 15:16' to this obs. The expected output should look like this:
df3=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'), dtime=c('4/16/2012 6:15','4/16/2012 6:15','4/16/2012 15:16','4/16/2012 15:16','4/16/2012 15:16','4/18/2012 7:15','4/19/2012 7:05','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 15:15','5/1/2012 15:15','.','4/23/2012 6:45','4/23/2012 6:45','4/23/2012 16:45','4/23/2012 16:45','4/25/2012 6:45','4/26/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 15:45','5/8/2012 15:45'))
...
I used the code below to realize this, however, due to the missing 'dtime' for day 19, R kept giving me the error:
df1$dtime <- apply(df1, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[2],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
Error in if (as.POSIXct(x["stime"], format = "%m/%d/%Y %H:%M") < as.POSIXct(choices[2], : missing value where TRUE/FALSE needed
Does anyone have idea how to solve this problem?
As a start, I inputted the two data frames in to try things out. Here is what I am thinking in terms of a pseudo-code approach (will leave you to finish the code). df1, when inputted, looks like the following:
subject day stime
1 1 1 4/16/2012 6:25
2 1 1 4/16/2012 7:01
3 1 1 4/16/2012 17:22
4 1 1 4/16/2012 17:45
5 1 1 4/16/2012 18:13
6 1 2 4/18/2012 6:50
7 1 3 4/19/2012 6:55
8 1 15 5/1/2012 6:28
9 1 15 5/1/2012 7:00
10 1 15 5/1/2012 16:28
11 1 15 5/1/2012 17:00
12 2 1 4/23/2012 5:56
13 2 1 4/23/2012 6:30
14 2 1 4/23/2012 16:55
15 2 1 4/23/2012 17:20
16 2 2 4/25/2012 6:32
17 2 3 4/26/2012 6:28
18 2 15 5/8/2012 5:54
19 2 15 5/8/2012 6:30
20 2 15 5/8/2012 15:55
21 2 15 5/8/2012 16:30
Why not try the following:
First, write a simple loop that will enable you to loop through each of the values in the stime column for both df1 and df2. Do make this easy, you could convert the df1 and df2 data frame into a matrix if you like (using as.matrix(), which is my preference).
After you grab the first value in row 1, column, 3 from df1, which is 4/16/2012 6:25, pull out the 6:25 and store it in a temporary variable ... let's call this variable a
Do the exact same thing for df2, which you also want to compare to, and store this in a temporary variable, except grab the variable from the relevant position ... let's call this variable b
Subtract the two temporary variables (you may need to write some code to get the two parts set up so that you can easily do an a-b and get a numerical answer. That said, I will leave that up to you).
Check whether the answer is positive or negative using a simple conditional if statement
Get the value of a or b depending on the output from your conditional check
Add this new value to a new data table, with the appropriate subject and day. You have called this df3.
I'm getting different answers than you. First I made a copy of df1 to work with:
df4 <- df1
df4$dtime <- apply(df4, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[1],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
#----------------------------------------------
subject day stime dtime
1 1 1 4/16/2012 6:25 4/16/2012 15:16
2 1 1 4/16/2012 7:01 4/16/2012 15:16
3 1 1 4/16/2012 17:22 4/16/2012 15:16
4 1 1 4/16/2012 17:45 4/16/2012 15:16
5 1 1 4/16/2012 18:13 4/16/2012 15:16
6 1 2 4/18/2012 6:50 4/18/2012 7:15
7 1 3 4/19/2012 6:55 4/19/2012 7:05
8 1 15 5/1/2012 6:28 5/1/2012 7:15
9 1 15 5/1/2012 7:00 5/1/2012 7:15
10 1 15 5/1/2012 16:28 5/1/2012 15:15
11 1 15 5/1/2012 17:00 5/1/2012 15:15
12 2 1 4/23/2012 5:56 4/23/2012 6:45
13 2 1 4/23/2012 6:30 4/23/2012 6:45
14 2 1 4/23/2012 16:55 4/23/2012 16:45
15 2 1 4/23/2012 17:20 4/23/2012 16:45
16 2 2 4/25/2012 6:32 4/25/2012 6:45
17 2 3 4/26/2012 6:28 4/26/2012 6:45
18 2 15 5/8/2012 5:54 5/8/2012 6:45
19 2 15 5/8/2012 6:30 5/8/2012 6:45
20 2 15 5/8/2012 15:55 5/8/2012 15:45
21 2 15 5/8/2012 16:30 5/8/2012 15:45