Proximity to long weekends & holidays - r

I have a vector of dates in a tibble.
# A tibble: 10 x 1
1 2017-04-04
2 2017-04-05
3 2017-04-07
4 2017-04-10
5 2017-04-11
6 2017-04-12
7 2017-04-13
8 2017-04-14
9 2017-04-17
10 2017-04-18
Reproducible using:
structure(list(Date = structure(c(1491264000, 1491350400, 1491523200,
1491782400, 1491868800, 1491955200, 1492041600, 1492128000, 1492387200,
1492473600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L), .Names = "Date")
Two feature variables:
'Proximity to next holiday'
'Proximity to past holiday'
The intent is to determine if my response variable is dependent on if Date is close to a holiday or long weekend. For example, if 04-11 was a Holiday, I would want:
Date ProxNxtHol ProxPastHol
1 2017-04-04 4 32
2 2017-04-05 3 33
3 2017-04-07 2 34
4 2017-04-10 1 35
5 2017-04-11 0 36
6 2017-04-12 58 1
7 2017-04-13 57 2
8 2017-04-14 56 3
9 2017-04-17 55 4
10 2017-04-18 54 5
While I can manually define all the holidays in a vector myself and calculate the difference between the two dates, this is cumbersome because the holidays vary by location globally. (I have a variable which can indicate location.)
Is there a predefined function which can indicate if a given date is a holiday or not, for a specified region?

I have come up with this for loop that computes both proximity as shown in your desired output. Please see the steps below.
Converting your structure to data frame and all its elements to class Date
> qdates <- data.frame(qdates)
> qdates$Date <- as.Date(qdates$Date)
> qdates
1 2017-04-04
2 2017-04-05
3 2017-04-07
4 2017-04-10
5 2017-04-11
6 2017-04-12
7 2017-04-13
8 2017-04-14
9 2017-04-17
10 2017-04-18
Using library(timeDate) to build a data frame of US holidays. You can add/modify your dates here or use other in-built functions that might contain federal holidays.
> library(timeDate)
> hdates <- data.frame(Dates = c(USNewYearsDay(2017), USInaugurationDay(2017), USMLKingsBirthday(2017),
USLincolnsBirthday(2017), USWashingtonsBirthday(2017), USCPulaskisBirthday(2017),
USGoodFriday(2017), USMemorialDay(2017), USIndependenceDay(2017), USLaborDay(2017),
USColumbusDay(2017), USElectionDay(2017), USVeteransDay(2017), USThanksgivingDay(2017),
> colnames(hdates) <- "HolidayDate"
> hdates$HolidayDate <- as.Date(hdates$HolidayDate)
> hdates
1 2017-01-01
2 2017-01-20
3 2017-01-16
4 2017-02-12
5 2017-02-22
6 2017-03-06
7 2017-04-14
8 2017-05-29
9 2017-07-04
10 2017-09-04
11 2017-10-09
12 2017-11-07
13 2017-11-11
14 2017-11-23
15 2017-12-25
for loop to compute the date difference, and populate output.
for(i in 1:nrow(qdates)) {
minDate <- max(hdates[which(hdates$HolidayDate <= qdates$Date[i]),])
maxDate <- min(hdates[which(hdates$HolidayDate >= qdates$Date[i]),])
qdates$ProxPastHol[i] <- abs(difftime(minDate, qdates$Date[i], units = "days"))
qdates$ProxNxtHol[i] <- abs(difftime(maxDate, qdates$Date[i], units = "days"))
> qdates
Date ProxPastHol ProxNxtHol
1 2017-04-04 29 10
2 2017-04-05 30 9
3 2017-04-07 32 7
4 2017-04-10 35 4
5 2017-04-11 36 3
6 2017-04-12 37 2
7 2017-04-13 38 1
8 2017-04-14 0 0
9 2017-04-17 3 42
10 2017-04-18 4 41
Hope this helps !!!


Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !, ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r :=, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(`1`[1]), "2",`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

How to print a date when the input is number of days since 01-01-60?

I received a set of dates, but it turns out that time is reported in days since 01-01-1960 in this specific data set.
1 20758
2 20856
3 21062
4 19740
5 21222
6 21203
The specific date of interest for Patient 1 is 20758 days since 01-01-60
I want to create a new covariate u$date containing the specific date of interest i d%m%y%. I tried
u %>% mutate(,origin="1960-01-01")
But that did not solve it.
u <- structure(list(D_INDDTO = c(20758, 20856, 21062, 19740, 21222,
21203, 20976, 20895, 18656, 18746)), row.names = c(NA, 10L), class = "data.frame")
Try this:
#Code 1
u %>% mutate(date=as.Date("1960-01-01")+D_INDDTO)
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 2
u %>% mutate(date=as.Date(D_INDDTO,origin="1960-01-01"))
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 3
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d%m%y'))
1 20758 311016
2 20856 060217
3 21062 310817
4 19740 170114
5 21222 070218
6 21203 190118
7 20976 060617
8 20895 170317
9 18656 290111
10 18746 290411
If more customization is required:
#Code 4
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d-%m-%Y'))
1 20758 31-10-2016
2 20856 06-02-2017
3 21062 31-08-2017
4 19740 17-01-2014
5 21222 07-02-2018
6 21203 19-01-2018
7 20976 06-06-2017
8 20895 17-03-2017
9 18656 29-01-2011
10 18746 29-04-2011

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period ( to of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = - + 1)
site ID total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"), = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
To do this easily we first need to unpack the and to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$[i], dat$[i], 1))
newDat <- rbind(newDat, expand)
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
site ID duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

Fixing dates that were coerced into the wrong format

I have a large df with dates that were accidentally coerced into the wrong format.
id <- c(1:12)
date <- c("2014-01-03","2001-08-14","2001-08-14","2014-06-02","2006-06-14", "2006-06-14",
df <- data.frame(id,date)
id date
1 1 2014-01-03
2 2 2001-08-14
3 3 2001-08-14
4 4 2014-06-02
5 5 2006-06-14
6 6 2006-06-14
7 7 2014-08-08
8 8 2014-08-08
9 9 2008-04-14
10 10 2009-12-13
11 11 2010-09-14
12 12 2012-09-14
The data set only includes, or rather should only include the years 2014 and 2013. The dates 2001-08-14 and 2006-06-14 are most likely 2014-08-01 and 2014-06-06, respectively.
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
How can I reconcile this mess?
Package lubridate has the convenient function year that will be useful here.
# Convert date to proper date class variable
df$date <- as.Date(df$date)
# Isolate problematic indices; when year is not in 2013 or 2014,
# we'll go to and from character representation. We'll trim
# the "20" in front of the "false year" and then specify the
# proper format to read the character back into a Date class.
tmp.indices <- which(!year(df$date) %in% c("2013", "2014"))
df$date[tmp.indices] <- as.Date(substring(as.character(df$date[tmp.indices]),
first = 3), format = "%d-%m-%y")
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
We could convert the 'date' column to 'Date' class, extract the 'year' to create a logical index ('indx') for years 2013, 2014).
df$date <- as.Date(df$date)
indx <- !format(df$date, '%Y') %in% 2013:2014
By using lubridate, convert to 'Date' class using dmy after removing the first two characters.
df$date[indx] <- dmy(sub('^..', '', df$date[indx]))
# id date
#1 1 2014-01-03
#2 2 2014-08-01
#3 3 2014-08-01
#4 4 2014-06-02
#5 5 2014-06-06
#6 6 2014-06-06
#7 7 2014-08-08
#8 8 2014-08-08
#9 9 2014-04-08
#10 10 2013-12-09
#11 11 2014-09-10
#12 12 2014-09-12
