I have a data frame where each row has a unique ID. I need to replicate each one of these rows based on the number of days between the start date and the max of the end date and the approval date.
ID <- c(1,2)
Value <- c(10,20)
StartDate <- c(as.Date("01/01/2015", '%d/%m/%Y'),
as.Date("01/01/2015", '%d/%m/%Y'))
EndDate <- c(as.Date("31/01/2015", '%d/%m/%Y'),
as.Date("15/01/2015", '%d/%m/%Y'))
AppDate <- c(as.Date("15/01/2015", '%d/%m/%Y'),
as.Date("15/02/2015", '%d/%m/%Y'))
df <- data.frame(ID, Value, StartDate, EndDate, AppDate)
df <- df[rep(row.names(df), ifelse(as.numeric(df$AppDate) >
as.numeric(df$EndDate),as.numeric(df$AppDate-df$StartDate),
as.numeric(df$EndDate-df$StartDate)) + 1),]
I then need to add a sequential list of dates from the start date to the max of the end date or approval date.
I've done this via 2 loops. The outer loop loops through the data frame for each unique ID. The second loop then goes through the ID and adds the date. Once the second loop has finished it passes the row to the outer loop as the new start point.
IDs <- unique(df$ID)
df$Days <- rep(as.Date("01/01/1999",'%d/%m/%Y'), nrow(df))
counter <- 1
for (i in 1:length(IDs)) {
ref <- IDs[i]
start <- 1
while (df$ID[counter] == ref) {
ifelse(start == 1, df$Days[counter] <- df$StartDate[counter],
df$Days[counter] <- df$StartDate[counter] + start -1)
ifelse (counter > nrow(df), break, counter <- counter + 1)
ifelse (counter > nrow(df), break, start <- start + 1)
}
}
My actual data set has over 6,000 ID's and once I've replicated the rows it ends up being over 500,000 rows. The loop took over 15 minutes to run so it's obviously very inefficient.
So I guess I have 2 questions.
1). What is the most efficient way to do this in R
2). What would be the most efficient way of doing this in general i.e. in say something like C++
thanks
Here is one solution that is vectorized. Note: Your code does not match the concept of taking the maximum of EndDate and AppDate, which I tried to do, but if that is not what you want, you can modify the code accordingly.
library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Days = rep(seq(min(StartDate), max(EndDate, df$AppDate), 'days'), ceiling(nrow(df) / n()))[1:n()])
Output will be as follows (just the first few rows):
head(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-01
2 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-02
3 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-03
4 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-04
5 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-05
6 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-06
tail(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-10
2 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-11
3 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-12
4 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-13
5 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-14
6 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-15
Normally, I would recommend the cross join SQL query that returns a cartesian product (all combination between two sets). However, you can replicate the cross join in R using merge() without any by arguments and with all=True. From there, filter for EndDate cut-off:
# CALCULATE CONDITIONAL END DATE
df$TrueEndDate <- as.Date(ifelse(df$AppDate > df$EndDate,
df$AppDate,
df$EndDate), origin="1970-01-01")
# CREATE A SEQUENTIAL DATES DATA FRAME (HERE IS 60 DAYS FROM 2015-01-01)
dates <- data.frame(Date=as.Date(unlist(lapply(0:60, function(x)
as.Date("2015-01-01") + x)),
origin="1970-01-01"))
# RUN CROSS JOIN MERGE, PULLING ONLY NEEDED FIELDS
mergedf <- merge(df[c('ID', 'StartDate', 'TrueEndDate')], dates, all=TRUE)
# FILTER OUT DATES PAST ROW'S TRUE END DATE
mergedf <- mergedf[(mergedf$Date <= mergedf$TrueEndDate),]
# CLEANUP
mergedf <- mergedf[with(mergedf, order(ID)), ] # ORDER BY ID
row.names(mergedf) <- 1:nrow(mergedf) # RESET ROW NAMES
Should you be curious on the equivalent cross join SQL (which you can have R call on a RDMS engine and import as final data frame, may help for performance issues):
SELECT ID.ID, ID.Value, ID.StartDate,
CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate
ELSE ID.EndDate
END As TrueEndDate,
Dates.Dates
FROM ID, Dates
WHERE Dates.Dates <= CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate ELSE ID.EndDate
END
ORDER BY ID.ID, Dates.Dates
Related
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
I need to "split" a 15 million line df of the following form:
library(lubridate)
dateStart <- c(lubridate::ymd("2010-01-01"))
dateEnd <- c(lubridate::ymd("2010-03-06"))
length <- c(65)
Amt <- c(348.80)
df1 <- data.frame(dateStart, dateEnd, length, Amt)
df1
# dateStart dateEnd length Amt
# 1 2010-01-01 2010-03-06 65 348.8
into something like:
dateStart dateEnd length Amt
1 2010-01-01 2010-01-31 31 166.35
2 2010-02-01 2010-02-28 28 150.55
3 2010-03-01 2010-03-06 6 32.19
Where length is the number of days and Amt is the pro-rata amount for the number of days. Does anybody know how to do this? Someone mentioned the padr package to me but I do not know how to use it for this specific purpose.
Thank you in advance
I'm going to assume you have an some sort of unique id field in your data set so you have a unique record. Otherwise this is not going to work. I also added 1 extra record so we can see everything works on multiple records.
Data:
library(lubridate)
id <- c(1:2) # added id field needed for unique record and needed for grouping
dateStart <- c(lubridate::ymd("2010-01-01", "2011-01-09"))
dateEnd <- c(lubridate::ymd("2010-03-06", "2011-04-09"))
length <- c(65, 91)
Amt <- c(348.80, 468.70)
df1 <- data.frame(id , dateStart, dateEnd, length, Amt)
First create a data.frame which has the id and missing months. We need dplyr, tidyr and padr. Create groups per unique id, gather the dates so we have start and end date in 1 column. For padr to extend months we first need to thicken the data.frame. Get rid of not needed columns and fill in the missing months.
library(dplyr)
library(tidyr)
library(padr)
#create last_day function for later use
last_day <- function(date) {
ceiling_date(date, "month") - days(1)
}
dates <- df1 %>%
select(id, dateStart, dateEnd) %>%
group_by(id) %>%
gather(names, dates, -id) %>%
arrange(id, dates) %>%
thicken(interval = "month") %>% # need to thicken first for month interval
select(-c(names, dates)) %>%
pad(interval = "month")
dates
# A tibble: 7 x 2
# Groups: id [2]
id dates_month
<int> <date>
1 1 2010-01-01
2 1 2010-02-01
3 1 2010-03-01
4 2 2011-01-01
5 2 2011-02-01
6 2 2011-03-01
7 2 2011-04-01
Next join back the data to the original data.frame
df_extended <- inner_join(dates, df1, by = "id")
df_extended
# A tibble: 7 x 6
# Groups: id [2]
id dates_month dateStart dateEnd length Amt
<int> <date> <date> <date> <dbl> <dbl>
1 1 2010-01-01 2010-01-01 2010-03-06 65 349.
2 1 2010-02-01 2010-01-01 2010-03-06 65 349.
3 1 2010-03-01 2010-01-01 2010-03-06 65 349.
4 2 2011-01-01 2011-01-09 2011-04-09 91 469.
5 2 2011-02-01 2011-01-09 2011-04-09 91 469.
6 2 2011-03-01 2011-01-09 2011-04-09 91 469.
7 2 2011-04-01 2011-01-09 2011-04-09 91 469.
Now to get to the end result. need to use case_when, ifelse doesn't return the data in date format for some reason. The case_when replace set the correct start and end dates (I assume you need the exact start date, not the first of the month, otherwise adjust code to use dates_month instead.) I create an amount per day (amt_pd) variable to be able to multiply this with the number of days in the month to get the pro-rata amount for the number of days in the month.
df_end <- df_extended %>%
mutate(dateEnd = case_when(last_day(dates_month) <= dateEnd ~ last_day(dates_month),
TRUE ~ dateEnd),
dateStart = case_when(dates_month <= dateStart ~ dateStart,
TRUE ~ dates_month),
amt_pd = Amt / length,
length = dateEnd - dateStart + 1,
Amt = amt_pd * length) %>%
select(-c(dates_month, amt_pd)) # get rid of not needed columns
df_end
# A tibble: 7 x 5
# Groups: id [2]
id dateStart dateEnd length Amt
<int> <date> <date> <time> <time>
1 1 2010-01-01 2010-01-31 31 166.350769230769
2 1 2010-02-01 2010-02-28 28 150.252307692308
3 1 2010-03-01 2010-03-06 6 32.1969230769231
4 2 2011-01-09 2011-01-31 23 118.462637362637
5 2 2011-02-01 2011-02-28 28 144.215384615385
6 2 2011-03-01 2011-03-31 31 159.667032967033
7 2 2011-04-01 2011-04-09 9 46.354945054945
All of this could be done in one go. But if you have 15 million rows it might be better to see if the intermediate steps work. Also note that pad has a break_above option.
This is a numeric value that indicates the number of rows in millions
above which the function will break. Safety net for situations where
the interval is different than expected and padding yields a very
large dataframe, possibly overflowing memory.
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1