R data manipulation - Rang condition in data.table / dplyr - r

In R I am conducting analyses on df1 but I also need to pull data from the more detailed records / observations in df2 and attach to df1 based on certain conditions.
This is sample data comparable to my own:
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ),
actiontime=c("2020-03-10" , "2020-02-17" , "2020-04-22" , "2020-04-19" , "2020-04-20" , "2020-04-22" , "2020-03-02" , "2020-05-07" ) )
df2 <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4, 5,5,5) ,
observation=c( "2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00", "2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00", "2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" , "2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" , "2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30", "2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" , "2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
var1=round(runif(25, min=10, max=60),0) ,
var2=c("Red" , "Blue" , "Yellow" , NA , "Yellow" , "Blue" , "Red" , "Yellow" , NA , NA , "Yellow" , NA , NA , NA , NA , NA , "Blue", NA , "Blue" , "Yellow" , NA , "Blue" , "Yellow" , "Red" , "Blue") )
In example how can I do the following procedures (preferably with data.table but if someone also would like to demonstrate with dplyr it is also nice) :
Q1. If I decide the following rang Blue > Red > Yellow . How can I then get the highest rang color in df2$var2 among the observations related to same id (if any) attached to a new variable by respective id in df1 ?
Q2. In addition to rang as in Q1, how do I add condition to only select var2 if the observation happens a day before actiontime in df1 ?
Q3 And to learn even more - how can the data that was pulled out of df2 joined to df1 in Q1 be updated on the record with the earliest observation by the id in df2 - meaning just working on df2 not involving df1 (and the join).
The output for Q3 would be something like this:
id observation var1 var2 color
1 1 2020-03-09 01:00 37 Red Blue
2 1 2020-03-09 10:00 35 Blue <NA>
3 1 2020-03-10 05:00 27 Yellow <NA>
4 2 2020-02-15 08:00 21 <NA> Yellow
5 2 2020-02-16 09:00 37 Yellow <NA>
6 2 2020-02-17 08:00 38 Blue <NA>
7 3 2020-04-16 14:30 56 Red <NA>
8 3 2020-04-16 07:30 35 Yellow Red
9 3 2020-04-17 15:00 40 <NA> <NA>
10 3 2020-04-25 07:20 20 <NA> <NA>
11 3 2020-04-18 10:00 49 <NA> <NA>
12 3 2020-04-19 10:30 58 <NA> <NA>
13 3 2020-04-20 12:00 37 <NA> <NA>
14 3 2020-04-21 12:00 25 <NA> <NA>
15 3 2020-04-22 09:30 16 <NA> <NA>
16 3 2020-04-24 23:00 52 <NA> <NA>
17 3 2020-04-23 17:30 46 Blue <NA>
18 4 2020-03-01 08:00 16 <NA> Blue
19 4 2020-03-02 08:00 14 Blue <NA>
20 4 2020-03-03 08:00 21 Yellow <NA>
21 4 2020-03-15 16:45 52 <NA> <NA>
22 4 2020-03-16 08:00 40 Blue <NA>
23 5 2020-05-05 13:45 13 Yellow Red
24 5 2020-05-06 08:00 12 Red <NA>
25 5 2020-05-07 11:00 11 Blue <NA>

There are 3 questions in one, I will try to answer them one by one.
Question 1
If I understand correctly, the OP wants to identify the highest ranked color in var2 per id and wants to copy the color to a new column in df1 for the matching ids.
This can be solved by turning var2 into an ordered factor, aggregating df2 by id, and adding the result to df1 by an update join:
library(data.table)
setDT(df1)
setDT(df2)
df2[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow", NA), exclude = NULL)]
str(df2)
Classes ‘data.table’ and 'data.frame': 25 obs. of 5 variables:
$ id : num 1 1 1 2 2 2 3 3 3 3 ...
$ observation: chr "2020-03-09 01:00" "2020-03-09 10:00" "2020-03-10 05:00" "2020-02-15 08:00" ...
$ var1 : num 15 58 12 35 11 25 24 54 14 15 ...
$ var2 : Ord.factor w/ 4 levels "Blue"<"Red"<"Yellow"<..: 2 1 3 4 3 1 2 3 4 4 ...
$ action_day : IDate, format: "2020-03-10" "2020-03-10" "2020-03-11" "2020-02-16" ...
- attr(*, ".internal.selfref")=<externalptr>
So, we can find the highest ranked color per id by using min()
df2[, min(var2, na.rm = TRUE), by = id]
id V1
1: 1 Blue
2: 2 Blue
3: 3 Blue
4: 4 Blue
5: 5 Blue
which is rather trivial because all id groups include Blue in var2.
This can be appended to df1 by an update join
df1[df2[, min(var2, na.rm = TRUE), by = id], on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Blue
3: 3 a 2020-04-22 Blue
4: 3 b 2020-04-19 Blue
5: 3 b 2020-04-20 Blue
6: 4 a 2020-04-22 Blue
7: 4 a 2020-03-02 Blue
8: 5 a 2020-05-07 Blue
Question 2
If I understand correctly, the OP wants to filter df2 so that only those rows are kept where the date of the observation in df2 is exactly one day before an actiontime in df1 (for the same id). This intermediate result is then processed in the same way as df2 in Question 1, above.
The filtering is accomplished by a join operation but requires to coerce the character date actiontime and character date-time observation, resp., to numeric date type for date calculation.
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
keep_df2_rows
[1] 1 2 5 14 11 12 18 24
keep_df2_rows contains the row numbers of those rows of df2 which fullfil the condition that the observation has happened exactly one day before an actiontime in df1 (for the same id).
Now, we can use the code of question 1 but use keep_df2_rows to filter df2:
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
, on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Yellow
3: 3 a 2020-04-22 Yellow
4: 3 b 2020-04-19 Yellow
5: 3 b 2020-04-20 Yellow
6: 4 a 2020-04-22 <NA>
7: 4 a 2020-03-02 <NA>
8: 5 a 2020-05-07 Red
Question 3
If I understand correctly, the final goal of the OP is to add the color column to df2 instead of df1 with the additional requirement that the only the row with the earliest observation within an id is to be updated.
This can be accomplished by an update join with a look-up table lut which contains the colors by id as described above and the earliest observation by id
library(data.table)
setDT(df2)[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow"))]
setDT(df1)[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
agg1 <- df2[keep_df2_rows][!is.na(var2), min(var2), by = id]
agg2 <- df2[, .(observation = min(observation)), by = id]
lut <- merge(agg1, agg2, by = "id")
df2[lut, on = .(id, observation), color := as.character(V1)][]
id observation var1 var2 action_day color
1: 1 2020-03-09 01:00 23 Red 2020-03-10 Blue
2: 1 2020-03-09 10:00 29 Blue 2020-03-10 <NA>
3: 1 2020-03-10 05:00 39 Yellow 2020-03-11 <NA>
4: 2 2020-02-15 08:00 55 <NA> 2020-02-16 Yellow
5: 2 2020-02-16 09:00 20 Yellow 2020-02-17 <NA>
6: 2 2020-02-17 08:00 55 Blue 2020-02-18 <NA>
7: 3 2020-04-16 14:30 57 Red 2020-04-17 <NA>
8: 3 2020-04-16 07:30 43 Yellow 2020-04-17 Yellow
9: 3 2020-04-17 15:00 41 <NA> 2020-04-18 <NA>
10: 3 2020-04-25 07:20 13 <NA> 2020-04-26 <NA>
11: 3 2020-04-18 10:00 20 Yellow 2020-04-19 <NA>
12: 3 2020-04-19 10:30 19 <NA> 2020-04-20 <NA>
13: 3 2020-04-20 12:00 44 <NA> 2020-04-21 <NA>
14: 3 2020-04-21 12:00 29 <NA> 2020-04-22 <NA>
15: 3 2020-04-22 09:30 48 <NA> 2020-04-23 <NA>
16: 3 2020-04-24 23:00 35 <NA> 2020-04-25 <NA>
17: 3 2020-04-23 17:30 46 Blue 2020-04-24 <NA>
18: 4 2020-03-01 08:00 60 <NA> 2020-03-02 <NA>
19: 4 2020-03-02 08:00 29 Blue 2020-03-03 <NA>
20: 4 2020-03-03 08:00 49 Yellow 2020-03-04 <NA>
21: 4 2020-03-15 16:45 57 <NA> 2020-03-16 <NA>
22: 4 2020-03-16 08:00 21 Blue 2020-03-17 <NA>
23: 5 2020-05-05 13:45 43 Yellow 2020-05-06 Red
24: 5 2020-05-06 08:00 16 Red 2020-05-07 <NA>
25: 5 2020-05-07 11:00 23 Blue 2020-05-08 <NA>
id observation var1 var2 action_day color
Note that the result differs from the example table posted by the OP because OP's definition of df2 is different to the example table.
Also note that I had to modify the computation of agg1 because of an unexpected behaviour of min(var2, na.rm = TRUE) when an id group consists only of NA. (To reproduce the issue, try min(ordered(NA), na.rm = TRUE) vs min(ordered(NA)))

Related

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

How to merge/join dataframes in R conditionally to dynamic time intervals

In R I have two data frames representing covid-19 patients and I would like to merge them into one table to be able to perform the desired analyses.
df1 contains metadata of each hospital stay
df2 contains observational data for different timepoints during each stay, usually one per day but for some there are more than one per day
some cases have multiple stays and I find it difficult to merge these datasets so that the observational data is assigned to its resepctive metadata from the corresponding stay where there is no variable to indicate which stay the observational data belongs to other than the dates
Sample data can be generated with this code
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5), in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") , location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0) )
lubridate ymd_hm and ymd functions was used to convert the factors with dates into POSIX date-variables.
Be aware data is not completely sorted and case id 3 has 3 stays but they are all consecutive with no days between and at first day there is 2 observations. Case 4 has two stays but there are days between.
When merging the two data frames I need to assign the different observations to the different stays so that I can make plots with starting point (time zero) the time when they came in at the actual stay. In example plot the development of temprature along x axis for all cases by location group where first obs_time is time zero.
Alltough I find other threads related it is not the same.
The desired output would be like this:
id obs_time temp_celsius stay_id stay_day location
1 1 2020-03-09 01:00 40.53805 1 1 a
2 1 2020-03-09 10:00 37.54832 1 1 a
3 1 2020-03-10 05:00 38.78600 1 2 a
4 2 2020-02-15 08:00 36.19048 1 1 a
5 2 2020-02-16 09:00 37.74323 1 2 a
6 2 2020-02-17 08:00 41.83050 1 3 a
7 3 2020-04-16 14:30 39.82978 1 1 a
8 3 2020-04-16 07:30 39.84554 1 1 a
9 3 2020-04-17 15:00 38.31164 1 2 a
10 3 2020-04-25 07:20 36.37992 3 2 b
11 3 2020-04-18 10:00 38.65261 1 3 a
12 3 2020-04-19 10:30 38.94991 2 1 b
13 3 2020-04-20 12:00 36.84384 2 2 b
14 3 2020-04-21 12:00 35.81786 2 3 b
15 3 2020-04-22 09:30 39.20979 2 4 b
16 3 2020-04-24 23:00 41.39876 3 1 b
17 3 2020-04-23 17:30 37.68251 2 5 b
18 4 2020-03-01 08:00 41.55690 1 1 a
19 4 2020-03-02 08:00 38.53060 1 2 a
20 4 2020-03-03 08:00 39.99385 1 3 a
21 4 2020-03-15 16:45 38.29500 2 1 a
22 4 2020-03-16 08:00 41.20947 2 2 a
23 5 2020-05-05 13:45 36.43556 1 1 a
24 5 2020-05-06 08:00 41.06712 1 2 a
25 5 2020-05-07 11:00 36.76612 1 3 a
Hope anyone can help me with this issue
I think this just about covers it. Rolling joins from data.table and a little manipulation should get you there.
set.seed(1)
library(data.table)
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") ,
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0))
setDT(df1)
setDT(df2)
df1[, c('in_date') := .(lubridate::ymd(in_date))]
df1[, stay_id := 1:.N, by = id]
df2[, obs_time := lubridate::ymd_hm(obs_time)]
df2[, obs_date := as.Date(obs_time)]
df1[df2, .(id, in_date, obs_time, temp_celsius, stay_id, location), on = c('id', 'in_date' = 'obs_date'), roll = Inf][
, stay_day := as.numeric(difftime(as.Date(obs_time), min(as.Date(in_date)), units = 'day')) + 1, by = .(id, stay_id)][, in_date := NULL][]
#> id obs_time temp_celsius stay_id location stay_day
#> 1: 1 2020-03-09 01:00:00 37.44615 1 a 1
#> 2: 1 2020-03-09 10:00:00 38.10717 1 a 1
#> 3: 1 2020-03-10 05:00:00 39.35169 1 a 2
#> 4: 2 2020-02-15 08:00:00 41.43089 1 a 1
#> 5: 2 2020-02-16 09:00:00 37.05043 1 a 2
#> 6: 2 2020-02-17 08:00:00 41.37002 1 a 3
#> 7: 3 2020-04-16 14:30:00 41.65699 1 a 1
#> 8: 3 2020-04-16 07:30:00 39.89695 1 a 1
#> 9: 3 2020-04-17 15:00:00 39.70051 1 a 2
#> 10: 3 2020-04-25 07:20:00 36.18307 3 b 2
#> 11: 3 2020-04-18 10:00:00 37.07704 1 a 3
#> 12: 3 2020-04-19 10:30:00 36.89465 2 b 1
#> 13: 3 2020-04-20 12:00:00 40.05954 2 b 2
#> 14: 3 2020-04-21 12:00:00 38.18144 2 b 3
#> 15: 3 2020-04-22 09:30:00 40.57302 2 b 4
#> 16: 3 2020-04-24 23:00:00 38.88574 3 b 1
#> 17: 3 2020-04-23 17:30:00 40.24923 2 b 5
#> 18: 4 2020-03-01 08:00:00 41.94982 1 a 1
#> 19: 4 2020-03-02 08:00:00 38.15622 1 a 2
#> 20: 4 2020-03-03 08:00:00 40.62016 1 a 3
#> 21: 4 2020-03-15 16:45:00 41.59517 2 a 1
#> 22: 4 2020-03-16 08:00:00 37.11528 2 a 2
#> 23: 5 2020-05-05 13:45:00 39.84038 1 a 1
#> 24: 5 2020-05-06 08:00:00 36.57844 1 a 2
#> 25: 5 2020-05-07 11:00:00 37.45677 1 a 3
#> id obs_time temp_celsius stay_id location stay_day
Created on 2020-07-16 by the reprex package (v0.3.0)

Fill in missing rows for dates by group [duplicate]

This question already has answers here:
Efficient way to Fill Time-Series per group
(2 answers)
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 4 years ago.
I have a data table like this, just much bigger:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m-
%d"), as.Date("2017-06-01","%Y-%m-%d"),
as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m-
%d"), as.Date("2017-05-01","%Y-%m-%d"),
as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m-
%d"), as.Date("2017-04-01","%Y-%m-%d"),
as.Date("2017-05-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-02-01 1 120
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-02-01 NA NA
1 11 2017-03-01 NA NA
1 11 2017-04-01 NA NA
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-01-01 NA NA
2 55 2017-02-01 1 120
2 55 2017-03-01 NA NA
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-02-01 NA NA
3 38 2017-03-01 NA NA
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
3 38 2017-06-01 NA NA
I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using
ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")
and then merge it to the original data with
ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)
but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?
We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'
ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)],
on = .(customer_id, account_id, time)]
# customer_id account_id time tenor variable_x
# 1: 1 11 2017-01-01 1 87
# 2: 1 11 2017-02-01 NA NA
# 3: 1 11 2017-03-01 NA NA
# 4: 1 11 2017-04-01 NA NA
# 5: 1 11 2017-05-01 2 90
# 6: 1 11 2017-06-01 3 100
# 7: 2 55 2017-01-01 NA NA
# 8: 2 55 2017-02-01 1 120
# 9: 2 55 2017-03-01 NA NA
#10: 2 55 2017-04-01 2 130
#11: 2 55 2017-05-01 3 150
#12: 2 55 2017-06-01 4 12
#13: 3 38 2017-01-01 1 13
#14: 3 38 2017-02-01 NA NA
#15: 3 38 2017-03-01 NA NA
#16: 3 38 2017-04-01 2 15
#17: 3 38 2017-05-01 3 14
#18: 3 38 2017-06-01 NA NA
Or using tidyverse
library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
mutate(time = list(ts1)) %>%
unnest %>%
left_join(my_data)
Or with complete from tidyr
my_data %>%
complete(nesting(customer_id, account_id), time = ts1)
A different data.table approach:
my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"),
by = "month")), by = list(customer_id, account_id)]
merge(my_data2, my_data, all.x = TRUE)
customer_id account_id time tenor variable_x
1: 1 11 2017-01-01 1 87
2: 1 11 2017-02-01 NA NA
3: 1 11 2017-03-01 NA NA
4: 1 11 2017-04-01 NA NA
5: 1 11 2017-05-01 2 90
6: 1 11 2017-06-01 3 100
7: 2 55 2017-01-01 NA NA
8: 2 55 2017-02-01 1 120
9: 2 55 2017-03-01 NA NA
10: 2 55 2017-04-01 2 130
11: 2 55 2017-05-01 3 150
12: 2 55 2017-06-01 4 12
13: 3 38 2017-01-01 1 13
14: 3 38 2017-02-01 NA NA
15: 3 38 2017-03-01 NA NA
16: 3 38 2017-04-01 2 15
17: 3 38 2017-05-01 3 14
18: 3 38 2017-06-01 NA NA

Fill in missing cases till specific condition per group

I'm attempting to create a data frame that shows all of the in between months for my data set, by subject. Here is an example of what the data looks like:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
dat$Out.Of.Study <- c("", "", "Out", "Out", "", "", "Out", "", "", "Out")
dat
id value date Out.Of.Study
1 1 30 2017-01-01
2 1 30 2017-02-01
3 1 25 2017-04-01 Out
4 2 25 2017-02-01 Out
5 3 25 2017-01-01
6 3 25 2017-02-01
7 3 25 2017-03-01 Out
8 4 20 2017-01-01
9 4 20 2017-02-01
10 4 20 2017-04-01 Out
If I want to show the in between months where no data was collected (but the subject was still enrolled in the study) I can use the complete() function. However, the issue is that I get all missing months for each subject id based on the min and max month identified in the data set:
## Add Dates by Group
library(tidyr)
complete(dat, id, date)
id date value Out.Of.Study
1 1 2017-01-01 30
2 1 2017-02-01 30
3 1 2017-03-01 NA <NA>
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA <NA>
6 2 2017-02-01 25 Out
7 2 2017-03-01 NA <NA>
8 2 2017-04-01 NA <NA>
9 3 2017-01-01 25
10 3 2017-02-01 25
11 3 2017-03-01 25 Out
12 3 2017-04-01 NA <NA>
13 4 2017-01-01 20
14 4 2017-02-01 20
15 4 2017-03-01 NA <NA>
16 4 2017-04-01 20 Out
The issue with this is that I don't want the missing months to exceed the subject's final observed month (essentially, I have subjects who are censored and would need to be removed from the study) or show up prior to the month a subject started the study. For example, subject 2 was only a participant in the month '2017-02-01'. There for, I'd like the data to represent that this was the only month they were in there and not have them represented by the extra months after and the extra month before, as shown above. The same is the case with subject 3, who has an extra month, even though they are out of the study.
Perhaps the complete() isn't the best way to go about this?
This can be solved by creating a sequence of months individually for each id and by joining the sequences with dat to complete the missing months.
1. data.table
(The question is tagged with tidyr. But as I am more acquainted with data.table I have tried this first.)
library(data.table)
# coerce date strings to class Date
setDT(dat)[, date := as.Date(date)]
# create sequence of months for each id
sdt <- dat[, .(date = seq(min(date), max(date), "month")), by = id]
# join
dat[sdt, on = .(id, date)]
id value date Out.Of.Study
1: 1 30 2017-01-01
2: 1 30 2017-02-01
3: 1 NA 2017-03-01 <NA>
4: 1 25 2017-04-01 Out
5: 2 25 2017-02-01 Out
6: 3 25 2017-01-01
7: 3 25 2017-02-01
8: 3 25 2017-03-01 Out
9: 4 20 2017-01-01
10: 4 20 2017-02-01
11: 4 NA 2017-03-01 <NA>
12: 4 20 2017-04-01 Out
Note that there is only one row for id == 2 as requested by the OP.
This approach requires to coerce date from factor to class Date to make sure that all missing months will be completed.
This is also safer than to rely on the avialable date factors in the dataset. For illustration, let's assume that id == 4 is Out in month 2017-06-01 (June) instead of 2017-04-01 (April). Then, there would be no month 2017-05-01 (May) in the whole dataset and the final result would be incomplete.
Without creating the temporary variable sdt the code becomes
library(data.table)
setDT(dat)[, date := as.Date(date)][
dat[, .(date = seq(min(date), max(date), "month")), by = id], on = .(id, date)]
2. tidyr / dplyr
library(dplyr)
library(tidyr)
# coerce date strings to class Date
dat <- dat %>%
mutate(date = as.Date(date))
dat %>%
# create sequence of months for each id
group_by(id) %>%
expand(date = seq(min(date), max(date), "month")) %>%
# join to complete the missing month for each id
left_join(dat, by = c("id", "date"))
# A tibble: 12 x 4
# Groups: id [?]
id date value Out.Of.Study
<dbl> <date> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-02-01 25 Out
6 3 2017-01-01 25 ""
7 3 2017-02-01 25 ""
8 3 2017-03-01 25 Out
9 4 2017-01-01 20 ""
10 4 2017-02-01 20 ""
11 4 2017-03-01 NA NA
12 4 2017-04-01 20 Out
There is a variant which does not update dat:
library(dplyr)
library(tidyr)
dat %>%
mutate(date = as.Date(date)) %>%
right_join(group_by(., id) %>%
expand(date = seq(min(date), max(date), "month")),
by = c("id", "date"))
I would still use complete (probably the right method to use here), but after it would subset rows that exceed row with "Out". You can do this with dplyr::between.
dat %>%
group_by(id) %>%
complete(date) %>%
# Filter rows that are between 1 and the one that has "Out"
filter(between(row_number(), 1, which(Out.Of.Study == "Out")))
id date value Out.Of.Study
<dbl> <fct> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA NA
6 2 2017-02-01 25 Out
7 3 2017-01-01 25 ""
8 3 2017-02-01 25 ""
9 3 2017-03-01 25 Out
10 4 2017-01-01 20 ""
11 4 2017-02-01 20 ""
12 4 2017-03-01 NA NA
13 4 2017-04-01 20 Out

R: na.locf not behaving as expected

I am trying to use the na.locf function in a mutate and I am getting a strange answer. The data is ordered desc by date and then if a column is NA gets the result from na.locf and otherwise uses the value in the column. For most of the data, the answer is being returned as expected, but one row is coming back not as the previous non-NA but as the next non-NA. If we order the data by date ascending and use na.rm = F and fromLast = T it works as expected, but I want to understand why the result is not working if date is ordered descending.
The example is as follows:
example = data.frame(Date = factor(c("1/14/15", "1/29/15", "2/3/15",
"2/11/15", "2/15/15", "3/4/15","3/7/15", "3/7/15", "3/11/15",
"3/18/15", "3/21/15", "4/22/15", "4/22/15", "4/23/15", "5/6/15",
"5/13/15", "5/18/15", "5/24/15", "5/26/15", "5/28/15", "5/29/15",
"5/29/15", "6/25/15", "6/25/15","8/6/15", "8/15/15", "8/20/15",
"8/22/15", "8/22/15", "8/29/15")),
Scan = c(1, rep(NA, 21),2,rep(NA,7)),
Hours = c(rep(NA,3), rep(3,3), NA, 2, rep(3,3), NA, 2, 3, 2,
rep(3,5), NA, 2, rep(c(NA, 3),2), 3, NA, 2, 3)
)
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan),
Scan))
The issue in the result is in row 24, the Scan is coming in as 1 rather than 2:
Date Scan Hours date scan_date scan_new
23 3/7/15 NA 0 2015-03-07 <NA> 2
24 3/7/15 NA 2 2015-03-07 <NA> 1
25 3/4/15 NA 3 2015-03-04 <NA> 2
Interestingly, other data with the same date is handled appropriately, for example on line 18-19
Date Scan Hours date scan_date scan_new
18 4/22/15 NA 0 2015-04-22 <NA> 2
19 4/22/15 NA 2 2015-04-22 <NA> 2
For reference as noted above, the following provides the expected answer:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = F, fromLast = T),
Scan))
Date Scan Hours date scan_date scan_new
6 3/4/15 NA 3 2015-03-04 <NA> 2
7 3/7/15 NA 0 2015-03-07 <NA> 2
8 3/7/15 NA 2 2015-03-07 <NA> 2
Can someone tell me why this is behaving this way?
In your first try na.locf(Scan), the leading NAs are removed and the remaining values are recycled to the full length in the ifelse. You can see the results with na.rm = F(or na.locf0, see comments) for reference:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = FALSE),
Scan))
# Date Scan Hours date scan_date scan_new
# 1 8/29/15 NA 3 2015-08-29 <NA> NA
# 2 8/22/15 NA 0 2015-08-22 <NA> NA
# 3 8/22/15 NA 2 2015-08-22 <NA> NA
# 4 8/20/15 NA 3 2015-08-20 <NA> NA
# 5 8/15/15 NA 3 2015-08-15 <NA> NA
# 6 8/6/15 NA 0 2015-08-06 <NA> NA
# 7 6/25/15 2 0 2015-06-25 2015-06-25 2
# 8 6/25/15 NA 3 2015-06-25 <NA> 2
# 9 5/29/15 NA 0 2015-05-29 <NA> 2
# 10 5/29/15 NA 2 2015-05-29 <NA> 2
# 11 5/28/15 NA 3 2015-05-28 <NA> 2
# 12 5/26/15 NA 3 2015-05-26 <NA> 2
# 13 5/24/15 NA 3 2015-05-24 <NA> 2
# 14 5/18/15 NA 3 2015-05-18 <NA> 2
# 15 5/13/15 NA 3 2015-05-13 <NA> 2
# 16 5/6/15 NA 2 2015-05-06 <NA> 2
# 17 4/23/15 NA 3 2015-04-23 <NA> 2
# 18 4/22/15 NA 0 2015-04-22 <NA> 2
# 19 4/22/15 NA 2 2015-04-22 <NA> 2
# 20 3/21/15 NA 3 2015-03-21 <NA> 2
# 21 3/18/15 NA 3 2015-03-18 <NA> 2
# 22 3/11/15 NA 3 2015-03-11 <NA> 2
# 23 3/7/15 NA 0 2015-03-07 <NA> 2
# 24 3/7/15 NA 2 2015-03-07 <NA> 2
# 25 3/4/15 NA 3 2015-03-04 <NA> 2
# 26 2/15/15 NA 3 2015-02-15 <NA> 2
# 27 2/11/15 NA 3 2015-02-11 <NA> 2
# 28 2/3/15 NA 0 2015-02-03 <NA> 2
# 29 1/29/15 NA 0 2015-01-29 <NA> 2
# 30 1/14/15 1 0 2015-01-14 2015-01-14 1

Resources