Counting within a period of time - r

I have the following dataframe:
Person A made 5 vacances, the first vacations were from 2015-03-11 to 2015-03-15 and the last vacations from Person A from 2016-02-04 to 2016-02-10.
Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05
I want to find all persons which made within 365 days at least 5 times vacations.
In the example above Person A went in 365 day 5 times on vacation. Person C went on 6 vacations but not within 365 days.
The result should be a dataframe like
Person at_least_five_vacations_within_365_days
A TRUE
B FALSE
C FALSE

Your data:
library(data.table)
library(lubridate)
df <- fread("Person\tfromDate\ttoDate
A\t2015-03-11\t2015-03-15
A\t2015-04-17\t2015-06-16
A\t2015-09-18\t2015-10-12
A\t2015-12-18\t2016-01-02
A\t2016-02-04\t2016-02-10
B\t2015-04-10\t2016-04-16
B\t2016-12-12\t2016-12-20
C\t2015-01-02\t2015-02-04
C\t2015-03-03\t2015-03-05
C\t2015-04-04\t2015-04-07
C\t2016-01-03\t2016-01-10
C\t2016-10-12\t2016-10-15
C\t2016-11-01\t2016-11-05",header="auto",sep="auto") %>%
as.data.frame() %>%
mutate(fromDate=ymd(fromDate), toDate=ymd(toDate))
Setting number of trips window:
numoftrips <- 5
Using dpylr & assuming your dates are already sorted by Person
library(dplyr)
df1 <- df %>%
group_by(Person) %>%
mutate(toCompare=lead(toDate,(numoftrips-1))) %>% # Copy return date of 5th-trip-after as new column
mutate(within.year=(toCompare-fromDate)<=365) %>% # Check if difference is less than 365 days
summarise(at_least_five_vacations_within_365_days=ifelse(sum(within.year,na.rm=T)>0,TRUE,FALSE)) # If taken 5 trips in less than 365 days, return TRUE
Output
df1
Person too.many.vacay
1 A TRUE
2 B FALSE
3 C FALSE

This might work. But you should specify the expected output.
library(dplyr)
df %>% group_by(Person) %>%
mutate(diff = toDate - fromDate,instances = n())%>%
filter (instances >=5 & diff < 356)
df is just your dataset and instances is the number of visits for person

The accepted answer uses data.table to read the data but continues with a dplyr approach.
The approach below uses read_table2() from the readr package but achieves the desired result with a data.table "one-liner":
library(data.table) # CRAN version 1.10.4 used
n_trips <- 5L
n_days <- 365L
DT[order(Person, fromDate),
any(fromDate <= shift(toDate, n_trips - 1L, , "lag") + n_days, na.rm = TRUE),
by = Person][]
Person V1
1: A TRUE
2: B FALSE
3: C FALSE
Explanation
The approach is similar to the accepted answer: The toDate is lagged by the required number of trips of the person and then it is checked whether the actual fromDate is within the given range of days. The any() function is used to determine if there is at least one occurrence for a particular person. The result of shift operations depend on the order of rows. So, the data.table is ordered beforehand.
The OP has requested to find all persons which made within 365 days at least 5 times vacations but he hasn't specified exactly how to count the vacations (by start date, by end date, or by a mixture of both?). So, it has been deliberately chosen to check the end date of the 4th previous vacation vs the start date of the actual vacation.
Data
DT <- readr::read_table2(
"Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05"
)
library(data.table)
setDT(DT)

Related

R: merge Dataframes on date and unique IDs with conditions distributed across many rows in R

I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379

Get the Date Difference in Data.table

I'd like to know how to get the date difference of two column in data.table using the lapply in data.table?
library(data.table)
dt <- fread(" ID Date ME_Mes DOB
A 2017-02-20 0.0000 2016-08-19
B 2017-02-06 2.3030 2016-03-11
C 2017-03-20 0.4135 2016-08-19
D 2017-03-06 0.0480 2016-10-09
E 2017-04-20 2.4445 2016-05-04")
> dt
ID Date ME_Mes DOB
1: A 2017-02-20 0.0000 2016-08-19
2: B 2017-02-06 2.3030 2016-03-11
3: C 2017-03-20 0.4135 2016-08-19
4: D 2017-03-06 0.0480 2016-10-09
5: E 2017-04-20 2.4445 2016-05-04
###I'd like to calculate the difference in weeks for every ID by comparing the DOB-Date.
I tired the following:
dt[,lapply(.SD, diff.Date), .SDcols = c(4,2), ID] # but did not work!
You can use difftime to get the difference in weeks. Although, you need to convert your columns to POSIXct.
In case you want to keep class of your columns as they are, this works:
dt[, "DOB_Date" := difftime(strptime(dt$Date, format = "%Y-%m-%d"),
strptime(dt$DOB, format = "%Y-%m-%d"), units = "weeks")]
dt
## ID Date ME_Mes DOB DOB_Date
## 1: A 2017-02-20 0.0000 2016-08-19 26.43452 weeks
## 2: B 2017-02-06 2.3030 2016-03-11 47.42857 weeks
## 3: C 2017-03-20 0.4135 2016-08-19 30.42857 weeks
## 4: D 2017-03-06 0.0480 2016-10-09 21.14881 weeks
## 5: E 2017-04-20 2.4445 2016-05-04 50.14286 weeks
However, as #Frank suggested it's better to convert ("overwrite") your date-columns to POSIXct class first.
My hunch (and I will let others correct me) is that the following is faster on large datasets:
dt[,Date:=as.Date(Date)]
dt[,DOB:=as.Date(DOB)]
dt[,datediff:=as.integer(Date)-as.integer(DOB)]
datediff will contain date differences in days.
If you have a truly large data.table, you may consider fastPOSIXct from fasttime for string conversion.

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

R: sequence of days between dates

I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1

How to subset the most recent 12 months of data for each ID in a data frame?

I have a data frame representing 15 years of follow-up data from several hundred patients. I want to create a subset of the data frame including the most recent 12 months of data for each patient.
Here is a representative example of my data (including one missing value, because missing data abound in my actual dataset):
# Create example dataset.
example.dat <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3), # patient ID numbers
Date = as.Date(c("2000-02-01", "2004-10-21", "2005-02-06", # follow-up dates
"2005-06-14", "2002-11-24", "2009-03-05",
"2009-07-20", "2005-09-02", "2006-01-15",
"2006-05-18")),
Cat = c("Yes", "Yes", "No", "Yes", "No", # responses to a categorical variable
"Yes", "Yes", NA, "No", "No")
)
example.dat
Which yields the following output:
ID Date Cat
1 1 2000-02-01 Yes
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
5 2 2002-11-24 No
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
I need to figure out how to subset, for each ID number, the most recent record and all records from the previous 12 months.
ID Date Cat
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
Several questions have already been asked about subsetting by date in R, but they are generally concerned with subsetting data from a specific date or range of dates, not subsetting by ((variable end date) - (time interval)).
For the sake of completeness, here are two data.table approaches using either subsetting by groups or a non-equi join. In addition, lubridate is used to ensure a period of 12 months is picked even in the case of leap years.
Subsetting by groups
This is essentialy the data.table version of docendo discimus' dplyr answer. However, lubridate functions are used for date arithmetic because simply subtracting 365 days will not cover a period of 12 months as requested by the OP in case the past year contains a leap day:
library(data.table)
library(lubridate)
setDT(example.dat)[, .SD[Date >= max(Date) %m-% years(1)], by = ID]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
Non-equi join
With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins:
library(data.table)
library(lubridate)
mDT <- setDT(example.dat)[, max(Date) %m-% years(1), by = ID]
example.dat[example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
mDT contains the start dates of the 12 months period for each ID:
ID V1
1: 1 2004-06-14
2: 2 2008-07-20
3: 3 2005-05-18
The non-equi join returns the indices of the rows which fulfill the conditions
example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]
[1] 2 3 4 6 7 8 9 10
which are then used to finally subset example.dat.
Comparison of date arithmetic methods
The answers posted so far employed three different methods to find a date 12 months earlier:
docendo discimus subtracts 365 days,
G. Grothendieck uses seq.Date(),
this answer uses years() and %m-%
The three methods differ in case a leap day is included in the period:
library(data.table)
library(lubridate)
mseq <- Vectorize(function(x) seq(x, length = 2L, by = "-1 year")[2L])
data.table(Date = as.Date("2016-02-28") + 0:2)[
, minus_365d := Date -365][
, minus_1yr := Date - years()][
, minus_1yr_m := Date %m-% years()][
, seq.Date := as_date(mseq(Date))][]
Date minus_365d minus_1yr minus_1yr_m seq.Date
1: 2016-02-28 2015-02-28 2015-02-28 2015-02-28 2015-02-28
2: 2016-02-29 2015-03-01 <NA> 2015-02-28 2015-03-01
3: 2016-03-01 2015-03-02 2015-03-01 2015-03-01 2015-03-01
If there is no leap day in the past period, all three methods return the same result (row 1).
If a leap day is included in the past period, subtracting 365 days does not fully cover 12 months (row 3) as a leap year has 366 days.
If the reference date is a leap date, the seq.Date() approach picks the next day, 1 March 2015, as there is no 29 February in 2015. Using lubridate's %m-% rolls the date to the last day of February, 28 Feb 2015, instead.
Here is a base solution. We have ave operate on dates as numbers since if we were to use raw "Date" values ave would try to return "Date" values. Instead, ave returns 0/1 values and !! converts those to FALSE/TRUE.
in_last_yr <- function(x) {
max_date <- as.Date(max(x), "1970-01-01")
x > seq(max_date, length = 2, by = "-1 year")[2]
}
subset(example.dat, !!ave(as.numeric(Date), ID, FUN = in_last_yr))
Update Improved method of determining which days are in last year.
A possible approach using dplyr
library(dplyr)
example.dat %>% group_by(ID) %>% filter(Date >= max(Date)-365)
#Source: local data frame [8 x 3]
#Groups: ID
#
# ID Date Cat
#1 1 2004-10-21 Yes
#2 1 2005-02-06 No
#3 1 2005-06-14 Yes
#4 2 2009-03-05 Yes
#5 2 2009-07-20 Yes
#6 3 2005-09-02 NA
#7 3 2006-01-15 No
#8 3 2006-05-18 No

Resources