Select rows based on multiple conditions from two independent database - r

I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!

You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])

Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA

Related

How to deduplicate date sequences across non-consecutive rows in R?

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:
library(tidyverse)
library(lubridate)
library(tibbletime)
D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
"2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1
# A tibble: 10 x 2
id index_date
<dbl> <chr>
1 12 2019-01-01
2 12 2019-01-07
3 12 2019-01-21
4 12 2019-02-02
5 12 2019-02-09
6 12 2019-03-06
7 10 2019-01-05
8 10 2019-02-01
9 10 2019-02-02
10 10 2019-02-08
The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:
temp <- D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start"),
keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)
Result:
id index_date keyed_to_index_date keep
<dbl> <date> <date> <lgl>
1 12 2019-01-01 2019-01-01 TRUE
2 12 2019-01-07 2019-01-01 FALSE
3 12 2019-01-21 2019-01-01 FALSE
4 12 2019-02-02 2019-02-02 TRUE
5 12 2019-02-09 2019-02-02 FALSE
6 12 2019-03-06 2019-03-06 TRUE
7 10 2019-01-05 2019-01-05 TRUE
8 10 2019-02-01 2019-02-01 TRUE
9 10 2019-02-02 2019-02-01 FALSE
10 10 2019-02-08 2019-02-01 FALSE
Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?
UPDATE: Adding the option start_date = first(index_date) to collapse_index(), as suggested by #mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:
D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
index_date = c("2019-03-04","2019-03-05","2019-03-06",
"2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
"2019-03-03","2019-03-04","2019-03-05"))
D2
id index_date
<chr> <chr>
1 A 2019-03-04
2 A 2019-03-05
3 A 2019-03-06
4 B 2019-03-01
5 B 2019-03-02
6 B 2019-03-04
7 B 2019-03-05
8 B 2019-03-06
9 C 2019-03-03
10 C 2019-03-04
11 C 2019-03-05
I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date,
period = '2 d',
side = "start",
start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(id, index_date)
Result:
> t3
# A time tibble: 11 x 4
# Index: index_date
# Groups: id [3]
id index_date keyed_to_index_date keep
<chr> <date> <date> <lgl>
1 A 2019-03-04 2019-03-04 TRUE
2 A 2019-03-05 2019-03-04 FALSE
3 A 2019-03-06 2019-03-06 TRUE
4 B 2019-03-01 2019-03-01 TRUE
5 B 2019-03-02 2019-03-01 FALSE
6 B 2019-03-04 2019-03-04 TRUE
7 B 2019-03-05 2019-03-05 TRUE
8 B 2019-03-06 2019-03-05 FALSE
9 C 2019-03-03 2019-03-03 TRUE
10 C 2019-03-04 2019-03-03 FALSE
11 C 2019-03-05 2019-03-05 TRUE
Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by #tmfmnk, I get the correct result.
Code:
t4 <-
D2 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 |
accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2,
.y,
.x + .y)
) >= 2
)
Result:
> t4
# A tibble: 11 x 3
# Groups: id [3]
id index_date keep
<chr> <date> <lgl>
1 A 2019-03-04 TRUE
2 A 2019-03-05 FALSE
3 A 2019-03-06 TRUE
4 B 2019-03-01 TRUE
5 B 2019-03-02 FALSE
6 B 2019-03-04 TRUE
7 B 2019-03-05 FALSE
8 B 2019-03-06 TRUE
9 C 2019-03-03 TRUE
10 C 2019-03-04 FALSE
11 C 2019-03-05 TRUE
What is wrong with the tibbletime approach in this example?
One option utilizing dplyr, lubridate and purrr could be:
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)
id index_date keep
<dbl> <date> <lgl>
1 12 2019-01-01 TRUE
2 12 2019-01-07 FALSE
3 12 2019-01-21 FALSE
4 12 2019-02-02 TRUE
5 12 2019-02-09 FALSE
6 12 2019-03-06 TRUE
7 10 2019-01-05 TRUE
8 10 2019-02-01 FALSE
9 10 2019-02-02 FALSE
10 10 2019-02-08 TRUE
You just need to add the start_date argument to collapse_index:
D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index: index_date
#> # Groups: id [2]
#> id index_date keyed_to_index_date keep
#> <dbl> <date> <date> <lgl>
#> 1 12 2019-01-01 2019-01-01 TRUE
#> 2 12 2019-01-07 2019-01-01 FALSE
#> 3 12 2019-01-21 2019-01-01 FALSE
#> 4 12 2019-02-02 2019-02-02 TRUE
#> 5 12 2019-02-09 2019-02-02 FALSE
#> 6 12 2019-03-06 2019-03-06 TRUE
#> 7 10 2019-01-05 2019-01-05 TRUE
#> 8 10 2019-02-01 2019-01-05 FALSE
#> 9 10 2019-02-02 2019-01-05 FALSE
#> 10 10 2019-02-08 2019-02-08 TRUE
Created on 2020-09-11 by the reprex package (v0.3.0)
You can use accumulate() from purrr.
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))
# id index_date keep
# <dbl> <date> <lgl>
# 1 12 2019-01-01 TRUE
# 2 12 2019-01-07 FALSE
# 3 12 2019-01-21 FALSE
# 4 12 2019-02-02 TRUE
# 5 12 2019-02-09 FALSE
# 6 12 2019-03-06 TRUE
# 7 10 2019-01-05 TRUE
# 8 10 2019-02-01 FALSE
# 9 10 2019-02-02 FALSE
# 10 10 2019-02-08 TRUE
The iteration rule is following:
1. 2019-01-07 - 2019-01-01 = 6 < 31 then return 2019-01-01
2. 2019-01-21 - 2019-01-01 = 20 < 31 then return 2019-01-01
3. 2019-02-02 - 2019-01-01 = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7 < 31 then return 2019-02-02
5. etc.

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Calculate each overlapping date ranges from two independent databases in r

I have two independent two databases, one contains followup data (start date and end date). As follows:
> data1 <- data.frame("ID" = c(1,1,1,1,2,2,2), "FUstart" = c("2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2019-04-01", "2019-07-01", "2019-10-01"), "FUend" = c("2019-03-31", "2019-06-30", "2019-09-30", "2019-12-31", "2019-06-30", "2019-09-30", "2019-12-31"))
> data1
ID FUstart FUend
1 1 2019-01-01 2019-03-31
2 1 2019-04-01 2019-06-30
3 1 2019-07-01 2019-09-30
4 1 2019-10-01 2019-12-31
5 2 2019-04-01 2019-06-30
6 2 2019-07-01 2019-09-30
7 2 2019-10-01 2019-12-31
Another contains drug use data (also start date and end date). As follows:
> data2 <- data.frame("ID" = c(1,1,1,2), "Drugstart" = c("2019-01-11", "2019-03-26", "2019-06-26", "2019-03-20"), "Drugend" = c("2019-01-20", "2019-04-05", "2019-10-05", "2019-10-10"))
> data2
ID Drugstart Drugend
1 1 2019-01-11 2019-01-20
2 1 2019-03-26 2019-04-05
3 1 2019-06-26 2019-10-05
4 2 2019-03-20 2019-10-10
The two databases are linked by "ID". The problem is that the rows for each ID may not be the same. I would like to calculate overlapping days and add it into the data1. I would expect to have the following results:
> data1
ID FUstart FUend Overlapping.Days
1 1 2019-01-01 2019-03-31 16
2 1 2019-04-01 2019-06-30 10
3 1 2019-07-01 2019-09-30 92
4 1 2019-10-01 2019-12-31 5
5 2 2019-04-01 2019-06-30 91
6 2 2019-07-01 2019-09-30 92
7 2 2019-10-01 2019-12-31 10
Note that data1 is the basic database. And adds data2's overlapping days into data1. Many many thanks for helping~~
An option using data.table::foverlaps:
foverlaps(data1, data2)[,
sum(1L + pmin(Drugend, FUend) - pmax(Drugstart, FUstart)),
.(ID, FUstart, FUend)]
output and I am also getting slightly diff numbers from OP's expected output:
ID FUstart FUend V1
1: 1 2019-01-01 2019-03-31 16
2: 1 2019-04-01 2019-06-30 10
3: 1 2019-07-01 2019-09-30 92
4: 1 2019-10-01 2019-12-31 5
5: 2 2019-04-01 2019-06-30 91
6: 2 2019-07-01 2019-09-30 92
7: 2 2019-10-01 2019-12-31 10
data:
library(data.table)
setDT(data1)
cols <- paste0("FU", c("start","end"))
data1[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data1, c("ID", cols))
#too lazy to generalize and hence copy paste
setDT(data2)
cols <- paste0("Drug", c("start","end"))
data2[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data2, c("ID", cols))

Fill in missing cases till specific condition per group

I'm attempting to create a data frame that shows all of the in between months for my data set, by subject. Here is an example of what the data looks like:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
dat$Out.Of.Study <- c("", "", "Out", "Out", "", "", "Out", "", "", "Out")
dat
id value date Out.Of.Study
1 1 30 2017-01-01
2 1 30 2017-02-01
3 1 25 2017-04-01 Out
4 2 25 2017-02-01 Out
5 3 25 2017-01-01
6 3 25 2017-02-01
7 3 25 2017-03-01 Out
8 4 20 2017-01-01
9 4 20 2017-02-01
10 4 20 2017-04-01 Out
If I want to show the in between months where no data was collected (but the subject was still enrolled in the study) I can use the complete() function. However, the issue is that I get all missing months for each subject id based on the min and max month identified in the data set:
## Add Dates by Group
library(tidyr)
complete(dat, id, date)
id date value Out.Of.Study
1 1 2017-01-01 30
2 1 2017-02-01 30
3 1 2017-03-01 NA <NA>
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA <NA>
6 2 2017-02-01 25 Out
7 2 2017-03-01 NA <NA>
8 2 2017-04-01 NA <NA>
9 3 2017-01-01 25
10 3 2017-02-01 25
11 3 2017-03-01 25 Out
12 3 2017-04-01 NA <NA>
13 4 2017-01-01 20
14 4 2017-02-01 20
15 4 2017-03-01 NA <NA>
16 4 2017-04-01 20 Out
The issue with this is that I don't want the missing months to exceed the subject's final observed month (essentially, I have subjects who are censored and would need to be removed from the study) or show up prior to the month a subject started the study. For example, subject 2 was only a participant in the month '2017-02-01'. There for, I'd like the data to represent that this was the only month they were in there and not have them represented by the extra months after and the extra month before, as shown above. The same is the case with subject 3, who has an extra month, even though they are out of the study.
Perhaps the complete() isn't the best way to go about this?
This can be solved by creating a sequence of months individually for each id and by joining the sequences with dat to complete the missing months.
1. data.table
(The question is tagged with tidyr. But as I am more acquainted with data.table I have tried this first.)
library(data.table)
# coerce date strings to class Date
setDT(dat)[, date := as.Date(date)]
# create sequence of months for each id
sdt <- dat[, .(date = seq(min(date), max(date), "month")), by = id]
# join
dat[sdt, on = .(id, date)]
id value date Out.Of.Study
1: 1 30 2017-01-01
2: 1 30 2017-02-01
3: 1 NA 2017-03-01 <NA>
4: 1 25 2017-04-01 Out
5: 2 25 2017-02-01 Out
6: 3 25 2017-01-01
7: 3 25 2017-02-01
8: 3 25 2017-03-01 Out
9: 4 20 2017-01-01
10: 4 20 2017-02-01
11: 4 NA 2017-03-01 <NA>
12: 4 20 2017-04-01 Out
Note that there is only one row for id == 2 as requested by the OP.
This approach requires to coerce date from factor to class Date to make sure that all missing months will be completed.
This is also safer than to rely on the avialable date factors in the dataset. For illustration, let's assume that id == 4 is Out in month 2017-06-01 (June) instead of 2017-04-01 (April). Then, there would be no month 2017-05-01 (May) in the whole dataset and the final result would be incomplete.
Without creating the temporary variable sdt the code becomes
library(data.table)
setDT(dat)[, date := as.Date(date)][
dat[, .(date = seq(min(date), max(date), "month")), by = id], on = .(id, date)]
2. tidyr / dplyr
library(dplyr)
library(tidyr)
# coerce date strings to class Date
dat <- dat %>%
mutate(date = as.Date(date))
dat %>%
# create sequence of months for each id
group_by(id) %>%
expand(date = seq(min(date), max(date), "month")) %>%
# join to complete the missing month for each id
left_join(dat, by = c("id", "date"))
# A tibble: 12 x 4
# Groups: id [?]
id date value Out.Of.Study
<dbl> <date> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-02-01 25 Out
6 3 2017-01-01 25 ""
7 3 2017-02-01 25 ""
8 3 2017-03-01 25 Out
9 4 2017-01-01 20 ""
10 4 2017-02-01 20 ""
11 4 2017-03-01 NA NA
12 4 2017-04-01 20 Out
There is a variant which does not update dat:
library(dplyr)
library(tidyr)
dat %>%
mutate(date = as.Date(date)) %>%
right_join(group_by(., id) %>%
expand(date = seq(min(date), max(date), "month")),
by = c("id", "date"))
I would still use complete (probably the right method to use here), but after it would subset rows that exceed row with "Out". You can do this with dplyr::between.
dat %>%
group_by(id) %>%
complete(date) %>%
# Filter rows that are between 1 and the one that has "Out"
filter(between(row_number(), 1, which(Out.Of.Study == "Out")))
id date value Out.Of.Study
<dbl> <fct> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA NA
6 2 2017-02-01 25 Out
7 3 2017-01-01 25 ""
8 3 2017-02-01 25 ""
9 3 2017-03-01 25 Out
10 4 2017-01-01 20 ""
11 4 2017-02-01 20 ""
12 4 2017-03-01 NA NA
13 4 2017-04-01 20 Out

Resources