aggregate/merge over date range using data.table - r

Suppose I have two data.tables:
summary <- data.table(period = c("A","B","C","D"),
from_date = ymd(c("2017-01-01", "2017-01-03", "2017-02-08", "2017-03-07")),
to_date = ymd(c("2017-01-31", "2017-04-01", "2017-03-08", "2017-05-01"))
)
log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03",
"2017-03-15","2017-03-28","2017-04-03","2017-04-23")),
event1 = c(4,8,8,4,3,4,7,3), event2 = c(1,8,7,3,8,4,6,3))
which look like this:
> summary
period from_date to_date
1: A 2017-01-01 2017-01-31
2: B 2017-01-03 2017-04-01
3: C 2017-02-08 2017-03-08
4: D 2017-03-07 2017-05-01
> log
date event1 event2
1: 2017-01-03 4 1
2: 2017-01-20 8 8
3: 2017-02-01 8 7
4: 2017-03-03 4 3
5: 2017-03-15 3 8
6: 2017-03-28 4 4
7: 2017-04-03 7 6
8: 2017-04-23 3 3
I would like to get the sum of event1 and event2 for each time period in the table summary.
I know I can do this:
summary[, c("event1","event2") := .(sum(log[date>=from_date & date<=to_date, event1]),
sum(log[date>=from_date & date<=to_date, event2]))
, by=period][]
to get the desired result:
period from_date to_date event1 event2
1: A 2017-01-01 2017-01-31 12 9
2: B 2017-01-03 2017-04-01 31 31
3: C 2017-02-08 2017-03-08 4 3
4: D 2017-03-07 2017-05-01 17 21
Now, in my real-life problem, I have about 30 columns to be summed, which I may want to change later, and summary has ~35,000 rows, log has ~40,000,000 rows. Is there an efficient way to achieve this?
Note: This is my first post here. I hope my question is clear and specific enough, please do make suggestions if there is anything I should do to improve the question. Thanks!

Yes, you can perform a non-equi join.
(Note I've changed log and summary to Log and Summary as the originals are already functions in R.)
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch=0L,
allow.cartesian = TRUE][, .(from_date = date[1],
to_date = date.1[1],
event1 = sum(event1),
event2 = sum(event2)),
keyby = "period"]
To sum over a pattern of columns, use lapply with .SD:
joined_result <-
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch = 0L,
allow.cartesian = TRUE]
cols <- grep("event[a-z]?[0-9]", names(joined_result), value = TRUE)
joined_result[, lapply(.SD, sum),
.SDcols = cols,
keyby = .(period,
from_date = date,
to_date = date.1)]

With data.table, it is possible to aggregate during a non-equi join using by = .EACHI.
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
lapply(.SD, sum), by = .EACHI]
date date event1 event2
1: 2017-01-01 2017-01-31 12 9
2: 2017-01-03 2017-04-01 31 31
3: 2017-02-08 2017-03-08 4 3
4: 2017-03-07 2017-05-01 17 21
With some additional clean-up:
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
c(period = period, lapply(.SD, sum)), by = .EACHI][
, setnames(.SD, 1:2, c("from_date", "to_date"))]
from_date to_date period event1 event2
1: 2017-01-01 2017-01-31 A 12 9
2: 2017-01-03 2017-04-01 B 31 31
3: 2017-02-08 2017-03-08 C 4 3
4: 2017-03-07 2017-05-01 D 17 21

Related

efficient way of selecting rows with a minimum time spacing between dates while grouping

I want to select rows of data with dates such that the dates have a minimum time difference of 3 months.
Here is an example:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-01-05
5: 2 numser2 2020-01-08
6: 2 numser2 2020-01-20
7: 2 numser2 2020-03-15
8: 2 numser2 2020-03-18
9: 2 numser3 2020-03-13
10: 2 numser3 2020-03-18
11: 3 numser3 2020-01-22
12: 4 numser4 2018-01-02
I want, by patient and numsermed, keep the date that have at least 3 months difference. I cannot use simply the successive differences. Expected result is:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-03-15
5: 2 numser3 2020-03-13
6: 3 numser3 2020-01-22
7: 4 numser4 2018-01-02
Here, for numsermed2 and patient 2, after 2019-12-12, the next date 3 months a least later is 2020-03-15, that I keep. I thus remove 2020-01-05, 2020-01-08, 2020-01-20.
I then remove 2020-03-18, which is 3 days after 2020-03-15.
Here is my solution with data.table:
library(data.table)
library(lubridate)
setkeyv(test,c("numsermed","patient","date"))
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
max(test[,.N,by = .(numsermed,patient)]$N)
Nmax <- max(test[,.N,by = .(numsermed,patient)]$N)
test[,supp := 0]
for(i in 1:Nmax){
test[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
test <- test2[supp != 1 ]
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
The idea is for each row, to test the condition and then perform the subset. It seems to work, but on a million row table, it is rather slow (few hours). I am sure there is an efficient way with semi equi join or rolling join in data.table, but I did not manage to write it. Could someone come up with a more efficient solution ? dplyr solutions are of course welcome too.
The data:
library(data.table)
library(lubridate) test<-setDT(list(patient=c(1:3,2),numsermed=c(paste0("numser",1:3),"numser2"),date=as_date(c("2020-01-08","2020-01-20","2020-01-22","2019-12-12"))))
test<-rbind(test,data.table(patient=4,numsermed="numser4",date=as_date("2018-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2015-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-15")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-05")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-08")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-13")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-18")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-18")))
Edit
I propose I comparison of the solution proposed, #Ben 's solution, #chinsoon12 's and #astrofunkswag 's .
Here is the test data:
library(data.table)
library(lubridate)
library(magrittr)
set.seed(1234)
origin <- "1970-01-01"
dt <- data.table(numsermed = sample(paste0("numsermed",1:30),10000,replace = T))
dt[,patient := sample(1:10000,.N,replace = T),by = numsermed]
dt[,date := sample((dmy("01.01.2019") %>% as.numeric()):(dmy("01.01.2020") %>% as.numeric()),.N),by = .(patient)]
and here the 4 functions, including mine:
ben = function(dt){
dt[, c("idx", "date2") := list(.I, date - 90L)]
dt_final <- unique(dt[dt, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(dt_final, c("patient", "numsermed", "i.date"))
return(dt_final[,.(patient,numsermed,date = i.date)])
}
chinson = function(dt){
dt[, d := as.integer(date)]
setkey(dt,date)
return( dt[dt[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1][,.(patient,numsermed,date)])
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
library(tidyverse); library(zoo)
astrofun = function(dt){
return(
dt %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
)
}
denis = function(dt){
df <- copy(dt)
setkeyv(dt,c("numsermed","patient","date"))
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
df[,N := .N,by = .(numsermed,patient)]
Nmax <- max(df[,N])
df[,supp := 0]
for(i in 1:Nmax){
df[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
df <- df[supp != 1 ]
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
return(df[,.(patient,numsermed,date)])
}
First, none of them produce the same result! denis(dt) output 9833 lines, ben(dt) 9928, chinson(dt) 9929, and #astrofunkswag solution astrofun(dt) output 9990 lines. I am not sur why this does not produce the same output, nor what solution is the good one (I would say mine just to be pretentious, but I am not even sure).
Then a benchmarking to compare efficiency.
library(microbenchmark)
microbenchmark(ben(dt),
chinson(dt),
astrofun(dt),
denis(dt),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
ben(dt) 17.3841 19.8321 20.88349 20.9609 21.8815 23.5125 10
chinson(dt) 230.8868 232.6298 275.16637 236.8482 239.0144 544.2292 10
astrofun(dt) 4460.2159 4565.9120 4795.98600 4631.3251 5007.8055 5687.7717 10
denis(dt) 68.0480 68.4170 88.88490 80.9636 90.0514 142.9553 10
#Ben 's solution with rolling join is the fastest of course. Mine is not that bad, and #astrofunkswag 's solution is super slow because of the cumulative sum I guess.
With data.table you could try the following. This would involve creating a second date 90 days prior and then doing a rolling join.
library(data.table)
setDT(test[, c("idx", "date2") := list(.I, date - 90L)])
test_final <- unique(test[test, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(test_final, c("patient", "numsermed", "i.date"))
test_final
Output
(i.date has the final date desired)
patient numsermed date idx date2 i.date i.idx
1: 1 numser1 2019-10-10 1 2019-10-10 2020-01-08 1
2: 2 numser2 2014-10-04 6 2014-10-04 2015-01-02 6
3: 2 numser2 2019-09-13 4 2019-09-13 2019-12-12 4
4: 2 numser2 2019-12-16 8 2019-10-07 2020-03-15 7
5: 2 numser3 2019-12-14 10 2019-12-14 2020-03-13 10
6: 3 numser3 2019-10-24 3 2019-10-24 2020-01-22 3
7: 4 numser4 2017-10-04 5 2017-10-04 2018-01-02 5
Here is a solution with dplyr and purrr. I use 2 helper functions, one to calculate month difference and one to calculate a cumulative sum that resets when a threshold is reached, credit to this post.
I calculate the month difference with the lagging date value, but you want to include the first one which will be NA. One weird part is that to include NA the easiest for me was to convert NA to some value 3 or greater. I arbitrarily made it 300. You could likely modify the sum_reset_at function to handle NA the way you want. You might also want to condense the code in some way since I do multiple mutate calls and then deselect those column, but I did it all in seperate lines to make it more clear what was happening. I think this functional programming solution will be quicker, but I haven't tested it on a large dataset compared to your current solution.
test <- test %>% arrange(patient, numsermed, date)
library(tidyverse); library(zoo)
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
test %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
test
<dbl> <chr> <date>
1 1 numser1 2020-01-08
2 2 numser2 2015-01-02
3 2 numser2 2019-12-12
4 2 numser2 2020-03-15
5 2 numser3 2020-03-13
6 3 numser3 2020-01-22
7 4 numser4 2018-01-02
Another option using findInterval to group:
library(data.table)
DT[, d := as.integer(date)]
DT[DT[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1]
output:
patient numsermed date d g
1: 1 numser1 2020-01-08 18269 1
2: 2 numser2 2015-01-02 16437 1
3: 2 numser2 2019-12-12 18242 21
4: 2 numser2 2020-03-15 18336 22
5: 2 numser3 2020-03-13 18334 1
6: 3 numser3 2020-01-22 18283 1
7: 4 numser4 2018-01-02 17533 1
If you have many groups of patient and numsermed, Ben's solution using rolling join will be faster. And another way of coding the rolling join by chaining:
DT[, .(patient, numsermed, date=date+90L)][
DT, on=.NATURAL, roll=-Inf, .(patient, numsermed, x.date, i.date)][,
.(date=i.date[1L]), .(patient, numsermed, x.date)][,
x.date := NULL][]
Or more succinctly:
DT[, c("rn", "qtrago") := .(.I, date - 90L)]
DT[DT[DT, on=.(patient, numsermed, date=qtrago), roll=-Inf, unique(rn)]]
data:
library(data.table)
DT <- fread("patient numsermed date
1 numser1 2020-01-08
2 numser2 2015-01-02
2 numser2 2019-12-12
2 numser2 2020-01-05
2 numser2 2020-01-08
2 numser2 2020-01-20
2 numser2 2020-03-15
2 numser2 2020-03-18
2 numser3 2020-03-13
2 numser3 2020-03-18
3 numser3 2020-01-22
4 numser4 2018-01-02")
DT[, date := as.IDate(date, format="%Y-%m-%d")]

How to generate a unique ID for each group based on relative date interval in R using dplyr?

I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00

Validate time series index

I am using a dataset which is grouped by group_by function of dplyr package.
Each Group has it's own time index which i.e. supposedly consist of 12 months sequences.
This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.
Here is the dataset example:
ID DATE
8 2017-01-31
8 2017-02-28
8 2017-03-31
8 2017-04-30
8 2017-05-31
8 2017-06-30
8 2017-07-31
8 2017-08-31
8 2017-09-30
8 2017-10-31
8 2017-11-30
8 2017-12-31
32 2017-01-31
32 2017-02-28
32 2017-03-31
32 2017-04-30
32 2017-05-31
32 2017-06-30
32 2017-07-31
32 2017-08-31
32 2017-09-30
32 2017-10-31
32 2017-11-30
32 2017-12-31
45 2016-09-30
45 2016-10-31
45 2016-11-30
45 2016-12-31
45 2017-01-31
45 2017-02-28
45 2017-03-31
45 2017-04-30
45 2017-05-31
45 2017-06-30
45 2017-07-31
45 2017-08-31
The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.
Any help will by appreciated.
Thank you in advance.
Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.
library(data.table)
## Convert to data.table
setDT(df)
## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]
## Order by ID and Date
setkey(df,ID,DATE)
## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
by = "months",
length.out = .N), by = .(ID)]
## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
print(Test)
# ID Valid
# 1: 8 TRUE
# 2: 32 TRUE
# 3: 45 TRUE
## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]
## The final output:
head(df)
# ID DATE Month ExpectedMonth Valid
# 1: 8 2017-01-31 2017-01-01 2017-01-01 TRUE
# 2: 8 2017-02-28 2017-02-01 2017-02-01 TRUE
# 3: 8 2017-03-31 2017-03-01 2017-03-01 TRUE
# 4: 8 2017-04-30 2017-04-01 2017-04-01 TRUE
# 5: 8 2017-05-31 2017-05-01 2017-05-01 TRUE
# 6: 8 2017-06-30 2017-06-01 2017-06-01 TRUE
You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test
setDT(df)
df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
DATE_diff = c(0, diff(DATE2))) %>%
summarise(Valid = !any(DATE_diff > 31))
Result:
# A tibble: 3 x 2
ID Valid
<int> <lgl>
1 8 TRUE
2 32 TRUE
3 45 TRUE
You can also visually check if there are any gaps by plotting your dates for each ID:
library(ggplot2)
df %>%
mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
ID = as.factor(ID)) %>%
ggplot(aes(x = DATE, y = ID, group = ID)) +
geom_point(aes(color = ID)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b-%Y") +
labs(title = "Time Line by ID")

R: time series monthly max adjusted by group

I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341

Vectorising iterative operation across rows

I've seen a lot of questions on here about vectorising for loops, but couldn't find any that involve vectorising a for loop to populate a cell based on the value of a cell in a row below (apologies if I'm just being blind though...).
I have a dataframe with 1.6 million rows of salaries and the date each person started earning that salary. Each person can have multiple salaries, and so multiple rows, each with a different date that it was updated.
Code for a dummy dataset is as follows:
df1 <- data.frame("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(df1$id,df1$start_date),]
I want to create a column with an end date for each salary, which is calculated as the day before the subsequent salary entry. If there is no subsequent salary entry, then it's set as today's date. This is my code, including a for loop, to do that:
df1$end_date <- Sys.Date()
for (i in 1:(nrow(df1)-1)){
if(df1[i,1]== df1[i+1,1]){
df1[i,4] <- df1[i+1,3]-1
}
print(i)
}
However, I know that for loops are not the most efficient way, but how would I go about vectorising this?
Using the dplyr package, you could do:
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(end_date=lead(start_date-1,default=Sys.Date()))
Which returns:
id salary start_date end_date
<dbl> <dbl> <date> <date>
1 1 15456 2016-02-14 2016-03-02
2 1 16594 2016-03-03 2017-05-22
3 2 17364 2016-01-17 2016-11-28
4 2 34564 2016-11-29 2017-05-22
5 3 33656 2016-08-17 2016-11-25
6 3 34525 2016-11-26 2017-05-22
7 4 23464 2016-01-20 2017-05-05
8 4 23467 2017-05-06 2017-05-22
9 5 27454 2016-02-29 2016-12-15
10 5 16794 2016-12-16 2017-05-22
11 6 42743 2016-03-14 2017-01-29
12 6 40663 2017-01-30 2017-05-22
You can use library(data.table):
setDT(df1)[, end_date := shift(start_date, type = "lead", fill = Sys.Date()), id][]
With data.table and shift, you can use below:
df1 <- data.table("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(id,start_date),]
df1[, EndDate := shift(start_date, type="lead"), id]
df1[is.na(EndDate), EndDate := Sys.Date()]
If I understand your question, the following base R code will work.
df1$end <- ave(df1$start_date, df1$id, FUN=function(x) c(tail(x, -1) - 1, Sys.Date()))
ave is used to perform the group level operation. The function performed takes the second through final date and subtracts 1. This is concatenated with the final date.
This returns
df1
id salary start_date end
1 1 15456 2016-03-20 2016-12-06
2 1 16594 2016-12-07 2017-05-22
3 2 17364 2016-10-17 2016-07-27
4 2 34564 2016-07-28 2017-05-22
5 3 34525 2016-05-26 2016-05-01
6 3 33656 2016-05-02 2017-05-22
7 4 23464 2017-04-17 2016-01-19
8 4 23467 2016-01-20 2017-05-22
9 5 16794 2016-09-12 2016-05-06
10 5 27454 2016-05-07 2017-05-22
11 6 40663 2016-10-03 2016-03-28
12 6 42743 2016-03-29 2017-05-22

Resources