I have a big data.table that looks like:
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
dt
start end id cat
1: 2012-07-13 23:45:00 2012-07-14 00:02:00 1 a
2: 2012-07-14 15:30:00 2012-07-14 15:35:00 2 b
3: 2012-07-14 23:57:00 2012-07-15 00:05:00 1 a
I need to get an output that shows total minutes of event on each calendar day by id and category. Using the example above the output should be:
day id cat V1
1: 13.07.2012 1 a 15
2: 14.07.2012 1 a 5
3: 14.07.2012 2 b 5
4: 15.07.2012 1 a 5
I used adply function from plyr package to split duration in intervals by minute:
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
# run the function above for each row in the data.table
dd<-adply(dt,1,fn)
# extract the date from calendar minutes
dd[,day:=format(as.POSIXct(s,"%d.%m.%Y %H:%M%:%S"), "%d.%m.%Y")]
#calculate sum of all minutes of event for each day, id and category
dd[,.N,by=c("day","id","cat")][order(day,id,cat)]
The solution above perfectly suits my needs except the time it takes for calculation. When adply is run in a very big data and several categories defined in fn function, it feels like CPU runs forever.
I will highly appreciate any hint on how to use pure data.table functionality in this problem.
I would suggest a few things
Convert to as.POSIXct only once instead of per each row.
instead of adply which creates a whole data.table in each iteration, just use by within the data.table scope.
In order to do so, simple create an row index using .I
Here's a quick attempt (I've used substr because it will be probably faster than as.Date or as.POSIXct. If you want it to be Date class again, use res[, Date := as.IDate(Date)] on the result istead of doing it by group).
dt[, `:=`(start = as.POSIXct(start), end = as.POSIXct(end), indx = .I)]
dt[, seq(start, end - 1L, by = "mins"), by = .(indx, id, cat)
][, .N, by = .(Date = substr(V1, 1L, 10L), id, cat)]
# Date id cat N
# 1: 2012-07-13 1 a 15
# 2: 2012-07-14 1 a 5
# 3: 2012-07-14 2 b 5
# 4: 2012-07-15 1 a 5
Try to see if this is faster.
It's still data.table in the background, but I'm using a dplyr syntax for the process.
library(data.table)
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
library(dplyr)
dt %>%
rowwise() %>% # for each row
do(fn(.)) %>% # apply your function
select(day=s, id=V1, cat=V2) %>% # rename columns
mutate(day = substr(day,1,10)) %>% # keep only the day
ungroup %>%
group_by(day,id,cat) %>%
summarise(N=n()) %>%
ungroup
# Source: local data frame [4 x 4]
#
# day id cat N
# (chr) (dbl) (chr) (int)
# 1 2012-07-13 1 a 15
# 2 2012-07-14 1 a 5
# 3 2012-07-14 2 b 5
# 4 2012-07-15 1 a 5
Related
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
Lets say we have, two time-series data.tables, one sampled by day, another by hour:
dtByDay
EURO TIME ... and some other columns
<num> <POSc>
1: 0.95 2017-01-20
2: 0.97 2017-01-21
3: 0.98 2017-01-22
...
dtByHour
TIME TEMP ... also some other columns
<POSc> <num>
1: 2017-01-20 00:00:00 22.45
2: 2017-01-20 01:00:00 23.50
3: 2017-01-20 02:00:00 23.50
...
and we need to merge them, so that to get all columns together. What's a nice what of doing it?
Evidently dtByDay[dtByHour] does not produce the desired outcome (as one could have wished) - you get `NA' in "EURO" column ...
Seems like roll = TRUE might give you funny behavior if a date is present in one data frame but no the other. So I wanted to post this alternative:
Starting with your original data frames:
dtbyday <- data.frame( EURO = c(0.95,0.97,0.98),
TIME = c(ymd("2017-01-20"),ymd("2017-01-21"),ymd("2017-01-22")))
dtbyhour <- data.frame( TEMP = c(22.45,23.50,23.40),
TIME = c(ymd_hms("2017-01-21 00:00:00"),ymd_hms("2017-01-21 01:00:00"),ymd_hms("2017-01-21 02:00:00")))
I converted the byhour$TIME to the same format as the byday$TIME using lubridate functions
dtbyhour <- dtbyhour %>%
rowwise() %>%
mutate( TIME = ymd( paste( year(TIME), month(TIME), day(TIME), sep="-" ) ) )
dtbyhour
# A tibble: 3 x 2
TEMP TIME
<dbl> <date>
1 22.45 2017-01-20
2 23.50 2017-01-20
3 23.40 2017-01-20
NOTE: The date changed because of time zone issues.
Then use dplyr::full_join to join by TIME, which will keep all records, and impute values whenever possible. You'll need to aggregate byHour values on a particular day...I calculated the mean TEMP below.
new.dt <- full_join( dtbyday, dtbyhour, by = c("TIME") ) %>%
group_by( TIME ) %>%
summarize( EURO = unique( EURO ),
TEMP = mean( TEMP, na.rm = TRUE ) )
# A tibble: 3 x 3
TIME EURO TEMP
<date> <dbl> <dbl>
1 2017-01-20 0.95 23.11667
2 2017-01-21 0.97 NaN
3 2017-01-22 0.98 NaN
Big thanks to comments above! - The solution is as easy as just adding roll=Inf argument when joining:
dtByHour[dtByDay, roll=Inf]
That's exactly what I needed. It takes dtByDay value and use it for all hours of this day. The output (from my application) is shown below.
For other applications, you may also consider roll="nearest". This will take the closest (from midnight) dtByDay value for all hours before and after midnight:
dtByHour[dtByDay, roll="nearest"]
I have a data frame where each row has a unique ID. I need to replicate each one of these rows based on the number of days between the start date and the max of the end date and the approval date.
ID <- c(1,2)
Value <- c(10,20)
StartDate <- c(as.Date("01/01/2015", '%d/%m/%Y'),
as.Date("01/01/2015", '%d/%m/%Y'))
EndDate <- c(as.Date("31/01/2015", '%d/%m/%Y'),
as.Date("15/01/2015", '%d/%m/%Y'))
AppDate <- c(as.Date("15/01/2015", '%d/%m/%Y'),
as.Date("15/02/2015", '%d/%m/%Y'))
df <- data.frame(ID, Value, StartDate, EndDate, AppDate)
df <- df[rep(row.names(df), ifelse(as.numeric(df$AppDate) >
as.numeric(df$EndDate),as.numeric(df$AppDate-df$StartDate),
as.numeric(df$EndDate-df$StartDate)) + 1),]
I then need to add a sequential list of dates from the start date to the max of the end date or approval date.
I've done this via 2 loops. The outer loop loops through the data frame for each unique ID. The second loop then goes through the ID and adds the date. Once the second loop has finished it passes the row to the outer loop as the new start point.
IDs <- unique(df$ID)
df$Days <- rep(as.Date("01/01/1999",'%d/%m/%Y'), nrow(df))
counter <- 1
for (i in 1:length(IDs)) {
ref <- IDs[i]
start <- 1
while (df$ID[counter] == ref) {
ifelse(start == 1, df$Days[counter] <- df$StartDate[counter],
df$Days[counter] <- df$StartDate[counter] + start -1)
ifelse (counter > nrow(df), break, counter <- counter + 1)
ifelse (counter > nrow(df), break, start <- start + 1)
}
}
My actual data set has over 6,000 ID's and once I've replicated the rows it ends up being over 500,000 rows. The loop took over 15 minutes to run so it's obviously very inefficient.
So I guess I have 2 questions.
1). What is the most efficient way to do this in R
2). What would be the most efficient way of doing this in general i.e. in say something like C++
thanks
Here is one solution that is vectorized. Note: Your code does not match the concept of taking the maximum of EndDate and AppDate, which I tried to do, but if that is not what you want, you can modify the code accordingly.
library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Days = rep(seq(min(StartDate), max(EndDate, df$AppDate), 'days'), ceiling(nrow(df) / n()))[1:n()])
Output will be as follows (just the first few rows):
head(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-01
2 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-02
3 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-03
4 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-04
5 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-05
6 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-06
tail(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-10
2 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-11
3 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-12
4 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-13
5 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-14
6 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-15
Normally, I would recommend the cross join SQL query that returns a cartesian product (all combination between two sets). However, you can replicate the cross join in R using merge() without any by arguments and with all=True. From there, filter for EndDate cut-off:
# CALCULATE CONDITIONAL END DATE
df$TrueEndDate <- as.Date(ifelse(df$AppDate > df$EndDate,
df$AppDate,
df$EndDate), origin="1970-01-01")
# CREATE A SEQUENTIAL DATES DATA FRAME (HERE IS 60 DAYS FROM 2015-01-01)
dates <- data.frame(Date=as.Date(unlist(lapply(0:60, function(x)
as.Date("2015-01-01") + x)),
origin="1970-01-01"))
# RUN CROSS JOIN MERGE, PULLING ONLY NEEDED FIELDS
mergedf <- merge(df[c('ID', 'StartDate', 'TrueEndDate')], dates, all=TRUE)
# FILTER OUT DATES PAST ROW'S TRUE END DATE
mergedf <- mergedf[(mergedf$Date <= mergedf$TrueEndDate),]
# CLEANUP
mergedf <- mergedf[with(mergedf, order(ID)), ] # ORDER BY ID
row.names(mergedf) <- 1:nrow(mergedf) # RESET ROW NAMES
Should you be curious on the equivalent cross join SQL (which you can have R call on a RDMS engine and import as final data frame, may help for performance issues):
SELECT ID.ID, ID.Value, ID.StartDate,
CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate
ELSE ID.EndDate
END As TrueEndDate,
Dates.Dates
FROM ID, Dates
WHERE Dates.Dates <= CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate ELSE ID.EndDate
END
ORDER BY ID.ID, Dates.Dates
I need to do a fast aggregation by id_client of dates: min, max, difference of dates in months and quantity of months.
Example table:
tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3),
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))
Format dates:
tbl$fecha<-as.Date(as.character(tbl$fecha))
My first approach was ddply:
tbl2<-ddply(tbl, .(id_cliente), summarize, cant=length(id_cliente),
max=max(fecha), min=min(fecha),
dif=length(seq(from=min, to=max, by='month')))
I got the desired result, but with my real table takes too much time.
So I tried tapply:
tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), secuencia),
hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
min=tapply(tbl$fecha, list(tbl$id_cliente), min),
max=tapply(tbl$fecha, list(tbl$id_cliente), max)
))
The result was:
> tbl3
dif hay min max
6 4 15706 15857
1 1 15706 15706
5 3 15706 15826
In this case I got instead of dates, numbers. So since the following works, I tried using as.Date inside tapply:
as.Date(15706, origin='1970-01-01')
MIN<-function(x){as.Date(min(x), origin='1970-01-01')}
The function works but with tapply doesn't.
tbl3<-data.frame(cbind(min=tapply(tbl$fecha, list(tbl$id_cliente), MIN)))
And I still got the number instead of date.
How can I solve this? Thanks.
I know this is a bit late, but I figured I would put this here for the people still googling this issue.
Interestingly, tapply returns the correct results when you keep the date column in text format and then you can convert to a date after:
tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3),
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))
tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), seq),
hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
min=tapply(tbl$fecha, list(tbl$id_cliente), min),
max=tapply(tbl$fecha, list(tbl$id_cliente), max)))
head(tbl3)
# dif hay min max
# 1, 2, 3, 4 4 2013-01-01 2013-06-01
# 1 1 2013-01-01 2013-01-01
# 1, 2, 3 3 2013-01-01 2013-05-01
With base R, the ?Date class is converted to the number of days from Jan. 1, 1970. Try using dplyr or data.table to preserve the date class:
dplyr
library(dplyr)
tbl %>% group_by(id_cliente) %>%
summarise(dif=length(seq(min(fecha), max(fecha), by='month')),
hay=length(fecha),
min=min(fecha),
max=max(fecha))
# Source: local data frame [3 x 5]
#
# id_cliente dif hay min max
# 1 1 6 4 2013-01-01 2013-06-01
# 2 2 1 1 2013-01-01 2013-01-01
# 3 3 5 3 2013-01-01 2013-05-01
data.table
library(data.table)
setDT(tbl)[,.(dif=length(seq(min(fecha), max(fecha), by='month')),
hay= .N,
min=min(fecha),
max=max(fecha)), by=id_cliente]
# id_cliente dif hay min max
# 1: 1 6 4 2013-01-01 2013-06-01
# 2: 2 1 1 2013-01-01 2013-01-01
# 3: 3 5 3 2013-01-01 2013-05-01
I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1