Get the Date Difference in Data.table - r

I'd like to know how to get the date difference of two column in data.table using the lapply in data.table?
library(data.table)
dt <- fread(" ID Date ME_Mes DOB
A 2017-02-20 0.0000 2016-08-19
B 2017-02-06 2.3030 2016-03-11
C 2017-03-20 0.4135 2016-08-19
D 2017-03-06 0.0480 2016-10-09
E 2017-04-20 2.4445 2016-05-04")
> dt
ID Date ME_Mes DOB
1: A 2017-02-20 0.0000 2016-08-19
2: B 2017-02-06 2.3030 2016-03-11
3: C 2017-03-20 0.4135 2016-08-19
4: D 2017-03-06 0.0480 2016-10-09
5: E 2017-04-20 2.4445 2016-05-04
###I'd like to calculate the difference in weeks for every ID by comparing the DOB-Date.
I tired the following:
dt[,lapply(.SD, diff.Date), .SDcols = c(4,2), ID] # but did not work!

You can use difftime to get the difference in weeks. Although, you need to convert your columns to POSIXct.
In case you want to keep class of your columns as they are, this works:
dt[, "DOB_Date" := difftime(strptime(dt$Date, format = "%Y-%m-%d"),
strptime(dt$DOB, format = "%Y-%m-%d"), units = "weeks")]
dt
## ID Date ME_Mes DOB DOB_Date
## 1: A 2017-02-20 0.0000 2016-08-19 26.43452 weeks
## 2: B 2017-02-06 2.3030 2016-03-11 47.42857 weeks
## 3: C 2017-03-20 0.4135 2016-08-19 30.42857 weeks
## 4: D 2017-03-06 0.0480 2016-10-09 21.14881 weeks
## 5: E 2017-04-20 2.4445 2016-05-04 50.14286 weeks
However, as #Frank suggested it's better to convert ("overwrite") your date-columns to POSIXct class first.

My hunch (and I will let others correct me) is that the following is faster on large datasets:
dt[,Date:=as.Date(Date)]
dt[,DOB:=as.Date(DOB)]
dt[,datediff:=as.integer(Date)-as.integer(DOB)]
datediff will contain date differences in days.
If you have a truly large data.table, you may consider fastPOSIXct from fasttime for string conversion.

Related

In R: creating a variable that shows the difference in months between two date variables

I have looked around for ages trying to find what I am looking for but none of the code has given me what I want. I need to create a variable that calculates the difference in months between two date variables.
For example, if I have the data below:
start_date end_date
2010-01-01 2010-12-31
2016-05-01 2016-12-31
2004-03-01 2004-10-31
1997-10-01 1998-08-31
I would like the outcome to look like the following:
start_date end_date month_count
2010-01-01 2010-12-31 12
2016-05-01 2016-12-31 8
2004-03-01 2004-10-31 8
1997-10-01 1998-08-31 11
Meaning I would like the whole last month to be included. Many of the codes I have checked have given me 11 months for the first observation instead of 12 for example. Also, many codes have said to specify the actual date but as I have a large dataset I can't do that, and would need to go by the variables instead.
Thank you in advance!
dplyr way
library(lubridate)
library(dplyr)
df %>% mutate(across(everything(), ~as.Date(.))) %>%
mutate(months = (year(end_date) - year(start_date) * 12) + month(end_date) - month(start_date) + 1)
Here is a possible way:
library(data.table)
dtt <- fread(text = 'start_date end_date
2010-01-01 2010-12-31
2016-05-01 2010-12-31
2004-03-01 2010-10-31')
dtt[, month_count := month(end_date) - month(start_date) + 1]
dtt
# start_date end_date month_count
# 1: 2010-01-01 2010-12-31 12
# 2: 2016-05-01 2010-12-31 8
# 3: 2004-03-01 2010-10-31 8

R: time series monthly max adjusted by group

I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341

Counting within a period of time

I have the following dataframe:
Person A made 5 vacances, the first vacations were from 2015-03-11 to 2015-03-15 and the last vacations from Person A from 2016-02-04 to 2016-02-10.
Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05
I want to find all persons which made within 365 days at least 5 times vacations.
In the example above Person A went in 365 day 5 times on vacation. Person C went on 6 vacations but not within 365 days.
The result should be a dataframe like
Person at_least_five_vacations_within_365_days
A TRUE
B FALSE
C FALSE
Your data:
library(data.table)
library(lubridate)
df <- fread("Person\tfromDate\ttoDate
A\t2015-03-11\t2015-03-15
A\t2015-04-17\t2015-06-16
A\t2015-09-18\t2015-10-12
A\t2015-12-18\t2016-01-02
A\t2016-02-04\t2016-02-10
B\t2015-04-10\t2016-04-16
B\t2016-12-12\t2016-12-20
C\t2015-01-02\t2015-02-04
C\t2015-03-03\t2015-03-05
C\t2015-04-04\t2015-04-07
C\t2016-01-03\t2016-01-10
C\t2016-10-12\t2016-10-15
C\t2016-11-01\t2016-11-05",header="auto",sep="auto") %>%
as.data.frame() %>%
mutate(fromDate=ymd(fromDate), toDate=ymd(toDate))
Setting number of trips window:
numoftrips <- 5
Using dpylr & assuming your dates are already sorted by Person
library(dplyr)
df1 <- df %>%
group_by(Person) %>%
mutate(toCompare=lead(toDate,(numoftrips-1))) %>% # Copy return date of 5th-trip-after as new column
mutate(within.year=(toCompare-fromDate)<=365) %>% # Check if difference is less than 365 days
summarise(at_least_five_vacations_within_365_days=ifelse(sum(within.year,na.rm=T)>0,TRUE,FALSE)) # If taken 5 trips in less than 365 days, return TRUE
Output
df1
Person too.many.vacay
1 A TRUE
2 B FALSE
3 C FALSE
This might work. But you should specify the expected output.
library(dplyr)
df %>% group_by(Person) %>%
mutate(diff = toDate - fromDate,instances = n())%>%
filter (instances >=5 & diff < 356)
df is just your dataset and instances is the number of visits for person
The accepted answer uses data.table to read the data but continues with a dplyr approach.
The approach below uses read_table2() from the readr package but achieves the desired result with a data.table "one-liner":
library(data.table) # CRAN version 1.10.4 used
n_trips <- 5L
n_days <- 365L
DT[order(Person, fromDate),
any(fromDate <= shift(toDate, n_trips - 1L, , "lag") + n_days, na.rm = TRUE),
by = Person][]
Person V1
1: A TRUE
2: B FALSE
3: C FALSE
Explanation
The approach is similar to the accepted answer: The toDate is lagged by the required number of trips of the person and then it is checked whether the actual fromDate is within the given range of days. The any() function is used to determine if there is at least one occurrence for a particular person. The result of shift operations depend on the order of rows. So, the data.table is ordered beforehand.
The OP has requested to find all persons which made within 365 days at least 5 times vacations but he hasn't specified exactly how to count the vacations (by start date, by end date, or by a mixture of both?). So, it has been deliberately chosen to check the end date of the 4th previous vacation vs the start date of the actual vacation.
Data
DT <- readr::read_table2(
"Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05"
)
library(data.table)
setDT(DT)

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Add months to IDate column of data.table in R

I have been using data.table for practically everything I was using data.frames for, as it is much, much faster on big in-memory data (several million rows). However, I'm not quite sure how to add days or months to an IDate column without using apply (which is very slow).
A minimal example:
dates = c("2003-01-01", "2003-02-01", "2003-03-01", "2003-06-01", "2003-12-01",
"2003-04-01", "2003-05-01", "2003-07-01", "2003-09-01", "2003-08-01")
dt = data.table(idate1=as.IDate(dates))
Now, let's say I want to create a column with dates 6 months ahead. Normally, for a single IDate, I would do this:
seq(dt$idate1[1],by="6 months",length=2)[2]
But this won't work as from= must be of length 1:
dt[,idate2:=seq(idate1,by="6 months",length=2)[2]]
Is there an efficient way of doing it to create column idate2 in dt?
Thanks a lot,
RR
One way is to use mondate package and add the months to it and then convert it back to iDate class object.
require(mondate)
dt = data.table(idate1=as.IDate(dates))
dt[, idate2 := as.IDate(mondate(as.Date(idate1)) + 6)]
# idate1 idate2
# 1: 2003-01-01 2003-07-01
# 2: 2003-02-01 2003-08-02
# 3: 2003-03-01 2003-09-01
# 4: 2003-06-01 2003-12-02
# 5: 2003-12-01 2004-06-01
# 6: 2003-04-01 2003-10-02
# 7: 2003-05-01 2003-11-01
# 8: 2003-07-01 2004-01-01
# 9: 2003-09-01 2004-03-02
# 10: 2003-08-01 2004-02-01
Although, I suppose that there might be other better solutions.
You can use lubridate,
library(lubridate)
dt[, idate2 := as.IDate(idate1 %m+% months(6))]
idate1 idate2
1: 2003-01-01 2003-07-01
2: 2003-02-01 2003-08-01
3: 2003-03-01 2003-09-01
4: 2003-06-01 2003-12-01
5: 2003-12-01 2004-06-01
6: 2003-04-01 2003-10-01
7: 2003-05-01 2003-11-01
8: 2003-07-01 2004-01-01
9: 2003-09-01 2004-03-01
10: 2003-08-01 2004-02-01

Resources