I'm calculating price differences between trades that have a specific time difference (say 60 seconds). I need this to be done with several assets and several trades. However, I could not figure a way to do this without an eternal for-loop.
Let's create some random prices:
library(birk)
library(tictoc)
library(dplyr)
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-10-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
My main function is the following:
For each trade (at the TradePrice column) I need to find closest trade that falls in the 60-second interval.
calc.spread <- function(df,c=60){
n<-length(df$PriorityDateTime)
difft <- dspread <- spread <- rep(0,n)
TimeF <- as.POSIXct(NA)
for (k in 1:n){
diffs <- as.POSIXct(df$PriorityDateTime) - as.POSIXct(df$PriorityDateTime[k])
idx <- which.closest(diffs,x=c)
TimeF[k]<- as.POSIXct(df$PriorityDateTime[idx])
difft[k] <- difftime(time1 = TimeF[k],time2 = df$PriorityDateTime[k], units = 'sec')
dspread[k] <- abs(df$TradePrice[k] - df$TradePrice[idx])
spread[k] <- 2*abs(log(df$TradePrice[k]) - log(df$TradePrice[idx]))
}
df <- data.frame(spread,dspread,difft,TimeF,PriorityDateTime=df$PriorityDateTime,id=df$id)
}
The function which.closest is just a wrapper for which.min(abs(vec - x)). As I have a data frame with multiple assets, I run:
c=60
spreads <- ndf %>% group_by(InstrumentSymbol) %>% do(calc.spread(.,c=c))
The problem is that I need to run this for 3-million row data frames. I have searched on the forum but couldn't find a way to run this code faster. Ddply is a little bit slower than using dplyr.
Is there any suggestion?
Being quite unsatisfied by my own previous answer, I asked here for help and turns out there is at least one way in data.table which is clearly faster. Also made a dplyr-related question here
s <- Sys.time()
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec');length(PriorityDateTime)
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
res2 <- setDT(res)
res2 <- res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
res2, on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
, .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
res2[]
e <- Sys.time()
> e-s
Time difference of 1.23701 mins
You can then apply your calc.spread function directly to the minpricewithin60 column.
You might have made a mistake in the sense that you are not looking for the minimum difference within 60 secs difference as described, but instead you are looking for a trade which took place as close as possible to 60secs in past or future:
idx <- which.closest(diffs,x=c)
Using this a trade which took place 1 sec ago would be discarded for a trade that happened closer to 60 secs away, I don't think that this is what you want. You probably want the lowest price difference for all trades within 60 secs which can be done by:
res$idx[i] <<- which.min(pricediff)[1]
See the code below:
library(lubridate)
library(dplyr)
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.frame(stringsAsFactors = F)
res$dspread <- res$idx <- res$spread <- NA
sapply(1:nrow(res),function(i){
within60 <- abs(difftime(ndf$datetime[i],ndf$datetime,"secs"))<=60
samesymbol <- res$InstrumentSymbol[i]==res$InstrumentSymbol
isdifferenttrade <- 1:nrow(res)!=i
pricediff <- ifelse(within60&samesymbol&isdifferenttrade,abs(res$TradePrice[i]-res$TradePrice), Inf)
res$dspread[i] <<- min(pricediff)
res$idx[i] <<- which.min(pricediff)[1] #in case several elements have same price
res$spread[i] <<- 2*abs(log(res$TradePrice[i])-log(res$TradePrice[res$idx[i]]))
} )
head(res)
What I used was apply which is similar to (and can be even slower than) for loops. If this is any faster for your real data, it is because I did the operations in a way which needed less steps.
Let me know, otherwise you can try the same in a for loop, or we'd have to try with data.table which I am less familiar with. These are generally time consuming of course because you need to define conditions based on each row of data.
PriorityDateTime TradePrice InstrumentSymbol id datetime spread idx
1 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00 2.1972246 3
2 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30 1.3862944 4
3 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00 2.1972246 1
4 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30 1.3862944 2
5 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00 1.0216512 3
6 2018-10-27 10:02:30 6 asset2 6 2018-10-27 10:02:30 0.8109302 4
dspread
1 2
2 2
3 2
4 2
5 2
6 2
Related
I am looking to run tests on the speed it takes to perform database queries. I am looping a query X number of times and would like to store the sys.time calculated to execute from each query in a data frame. I have tested the loops and functions, which work, but when I attempt to store the calculated sys.time into the data frame is get the following error:
Error in `$<-.data.frame`(`*tmp*`, query_time, value = 0.359616041183472) : replacement has 1 row, data has 0
I am unsure of what my issue is, any help is appreciated. Below is the code I am using.
Empty data frame to store the query results
# Empty data frame to store query results
query_results <- data.frame(query_date = as.POSIXct(character()),
query_type = character(),
query_records = numeric(),
query_time = as.POSIXct(character()))
Function to run database query
dbQueryTimes <- function(num_records){
start_time <- Sys.time()
fetch(dbSendQuery(con, "SELECT * FROM database_table"), n = num_records)
end_time <- Sys.time()
query_results$query_time <- end_time - start_time
}
Function to input number of iterations to run and number of records to return in query
funct_query_select_all <- function(num_iterations, num_records){
for (i in 1:num_iterations) {
dbQueryTimes(num_records)
}}
Running the query function
funct_query_select_all(5, 10000)
here is a minimal example that is something like what you're trying to do:
## empty dataframe to store results
query_results <- data.frame(query_date = as.POSIXct(character()),
query_type = character(),
query_records = numeric(),
query_time = as.POSIXct(character()))
## to make reproducible example
set.seed(4919)
## for 12 new "records"
for(i in 1:12){
## new results
query_date <- Sys.Date()
start_time <- Sys.time()
query_type <- sample(LETTERS, size=1)
query_records <- sample(0:99, size=1)
end_time <- Sys.time()
query_time <- end_time - start_time
## row bind results to existing dataframe
query_results <- rbind(query_results,
data.frame(query_date, query_type,
query_records, query_time))
}
output
query_date query_type query_records query_time
1 2021-03-22 P 44 3.838539e-05 secs
2 2021-03-22 A 55 2.694130e-05 secs
3 2021-03-22 J 79 3.147125e-05 secs
4 2021-03-22 D 84 2.264977e-05 secs
5 2021-03-22 U 38 2.312660e-05 secs
6 2021-03-22 K 13 2.384186e-05 secs
7 2021-03-22 T 72 2.527237e-05 secs
8 2021-03-22 M 55 2.479553e-05 secs
9 2021-03-22 N 9 2.527237e-05 secs
10 2021-03-22 X 64 2.598763e-05 secs
11 2021-03-22 B 91 2.574921e-05 secs
12 2021-03-22 Y 65 2.503395e-05 secs
there are likely more elegant ways to solve this problem, but the loop above makes a single row then rbinds that row to the existing dataframe. as pointed out in the comments, the way you added a single element in your example doesn't work.
Below is an example of what I want to achieve with a reproducible example.
I have a data.table with months as the time id. I want to make some computations on the data for the last 5 yrs, last 10 yrs etc. to the last month. (i.e. the last 5*12 months, last 10*12 months, etc)
I have a way of doing it, but I suspect it goes through many unnecessary intermediate variables.
library(lubridate) #For easy creation of time-series
library(data.table)
set.seed(5)
DT <- data.table(
Month = as.Date(sapply(0:329, function(i)(as.Date('1990-01-01')%m+%months(i))), origin = '1970-01-01'),
Value = round(runif(330, min = 20, max = 40), digits = 2)
)
> DT
Month Value
1: 1990-01-01 24.00
2: 1990-02-01 33.70
3: 1990-03-01 38.34
4: 1990-04-01 25.69
5: 1990-05-01 22.09
---
326: 2017-02-01 20.91
327: 2017-03-01 38.96
328: 2017-04-01 28.91
329: 2017-05-01 26.09
330: 2017-06-01 35.16
## Create a vector of the first months marking the start of the 60 or 120 month period
last.month <- max(DT[['Month']])
first.months <- as.Date(sapply(seq(5, 25, by = 5), function(i)(last.month
%m-% months(i*12 - 1))), origin = '1970-01-01')
## Construction of table of interest
yrs <- paste0(seq(5, 25, by = 5), 'Yrs')
features <- data.table(
Period = factor(yrs, levels = yrs), Feature.1 = as.numeric(NA),
Feature.2 = as.numeric(NA)
)
for(i in 1:nrow(features)){
DT_n <- DT[Month>=first.months[i], ]
set(features, i, 'Feature.1', DT_n[, mean(Value)]) #mean used as an example operation
set(features, i, 'Feature.2', DT_n[, var(Value)]) #var used as an example operation
}
Finally, this is the table I am interested in -
> features
Period Feature.1 Feature.2
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
What might be the best way in the data.table parlance to achieve this goal? Any improvement in terms of unnecessary variable reduction or efficiency is appreciated.
Thank you!
Another approach:
rbindlist(lapply(first.months,
function(m) data.table(val_mean = mean(DT[Month >= m]$Value),
val_var = var(DT[Month >= m]$Value)))
)[, Period := yrs][]
which gives:
val_mean val_var Period
1: 29.68817 35.80375 5Yrs
2: 29.25542 39.50981 10Yrs
3: 29.64950 37.41900 15Yrs
4: 29.63454 34.51793 20Yrs
5: 29.84373 35.90916 25Yrs
Or a variation on the approach above with setNames and the idcol-parameter of rbindlist:
rbindlist(setNames(lapply(first.months,
function(m) data.table(val_mean = mean(DT$Value[DT$Month >= m]),
val_var = var(DT$Value[DT$Month >= m]))),
yrs),
idcol = 'Period')
which gives:
Period val_mean val_var
1: 5Yrs 29.68817 35.80375
2: 10Yrs 29.25542 39.50981
3: 15Yrs 29.64950 37.41900
4: 20Yrs 29.63454 34.51793
5: 25Yrs 29.84373 35.90916
Here's another data.table approach you can try out. After constructing the first.months and yrs vectors, you can put them into a separate data.table:
m <- data.table(firstmonths = first.months, yrs = yrs, key = "yrs")
And then use non-equi joins to compute the results:
rbindlist(lapply(yrs, function(y) {
DT[m[y], on = .(Month >= firstmonths), .(mean = mean(Value),
var = var(Value),
Period = y)]
}))
# mean var Period
#1: 29.68817 35.80375 5Yrs
#2: 29.25542 39.50981 10Yrs
#3: 29.64950 37.41900 15Yrs
#4: 29.63454 34.51793 20Yrs
#5: 29.84373 35.90916 25Yrs
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE
I have a dataframe of time stamps which specify a categorical status. The status is valid until the next time stamp, at which time the category might change.
I'd like to be able to determine percentage of time spent in each category over regular time periods, like monthly, quarterly, or annually.
This seems like a common enough problem, but I've been unable to find an elegant solution or library to solve it.
For example, with the following sample dataframe:
date status
2016-02-20 09:11:00 a
2016-03-06 02:38:00 c
2016-03-10 15:20:00 b
2016-03-10 21:20:00 a
2016-03-11 11:51:00 b
2016-03-12 01:19:00 c
2016-03-22 14:39:00 c
2016-03-23 11:37:00 b
2016-03-25 17:38:00 c
2016-03-26 01:24:00 c
2016-03-26 12:40:00 a
2016-04-12 10:28:00 c
... I might want to report weekly from 3/1-3/7, 3/8-3/14, 3/15-3/21, the percent time in each week of 'a', 'b', and 'c' status.
I started brute force coding a solution to this (it's ugly...), when I decided maybe I should ask here whether there's a more elegant way to do it.
======== Edited to add an inelegant brute-force solution below ========
time_analysis <- function(df, starttime, endtime) {
# - assumes sorted by date
startindex <- sum(df$date <= starttime) # find the index of the entry which contains the start time
endindex <- sum(df$date <= endtime) + 1 # find the index of the entry which contains the end time
if ( (startindex == 0) || (endindex > nrow(df) ) ) {
print("Date outside of available data")
return(NULL)
}
df2 <- df[ startindex:endindex, ] # subset the dataframe to include the range, but still need to trim ends
df2$date[1] <- starttime # trim to the start time
df2$date[nrow(df2)] <- endtime # trim back the end time
df2$status[nrow(df2)] <- df2$status[nrow(df2)-1] # status hasn't changed yet, so still the previous status
duration <- diff(df2$date) # vector of the time within each segment, 1 fewer elements than the dataframe
units(duration) <- 'days'
duration <- as.numeric(duration) # need to convert to numeric, or else can't divide by total duration
df2 <- df2[ -nrow(df2), ] # remove the last row, to make length same as the duration vector
df2$duration <- duration # add the duration column
total <- sum(df2$duration) # to allow calculations within the ddply
return(ddply(df2[, c('status','duration')], 'status', function(x) { # calculate by each status category
return( c(
date = starttime,
totaldays = round(sum(x$duration), 2),
fraction = round(sum(x$duration) / total, 3)) )
} ))
}
And below would be a sample use, that would split the reporting into roughly 2-week chunks. I hate the use manual date coding and using a loop in R, but am too inexperienced to know a better way.
times <- c("2016-03-01","2016-03-15","2016-04-01","2016-04-15","2016-05-01","2016-05-15")
result <- data.frame()
for (i in 1:(length(times) - 1)) {
result <- rbind( result, time_analysis(d, times[i], times[i+1]) )
}
print(result, row.names = FALSE)
Yielding (other than some errors for dates out of range):
status date totaldays fraction
a 2016-03-01 5.71 0.409
b 2016-03-01 0.81 0.058
c 2016-03-01 7.43 0.532
a 2016-03-15 5.47 0.322
b 2016-03-15 2.25 0.132
c 2016-03-15 9.28 0.546
=====
And after posting, found a much nicer way to generate the times:
times <- as.character( seq( as.Date("2016-03-01"), as.Date("2016-05-15"), by = '2 weeks' ) )
Here's an approach that combines the cut.POSIXt() S3 specific with a nested data.table aggregation.
## define data
library(data.table);
dt <- data.table(date=as.POSIXct(c('2016-02-20 09:11:00','2016-03-06 02:38:00','2016-03-10 15:20:00','2016-03-10 21:20:00','2016-03-11 11:51:00','2016-03-12 01:19:00','2016-03-22 14:39:00','2016-03-23 11:37:00','2016-03-25 17:38:00','2016-03-26 01:24:00','2016-03-26 12:40:00','2016-04-12 10:28:00')),status=c('a','c','b','a','b','c','c','b','c','c','a','c'));
## solution
dt[,{ n1 <- .N; .SD[,.(pct=.N/n1*100),.(status)]; },.(month=cut(df$date,'month'))];
## month status pct
## 1: 2016-02-01 a 100
## 2: 2016-03-01 c 50
## 3: 2016-03-01 b 30
## 4: 2016-03-01 a 20
## 5: 2016-04-01 c 100
I have a data.table object with two date columns, from and to. I want to create a new column to determine if a specific weekday is in between the date range.
[Data]
library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]
> head(DT)
from to from_wd to_wd
1: 2015-08-06 2015-08-10 5 2
2: 2015-08-07 2015-08-10 6 2
3: 2015-08-08 2015-08-18 7 3
4: 2015-08-09 2015-08-16 1 1
5: 2015-08-10 2015-08-13 2 5
6: 2015-08-11 2015-08-13 3 5
[My Approach]
In this case, I want to add a new boolean column flag, which returns TRUE if Wednesday is in the range of [from, to].
This is my attempt:
DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]
> table(DT$flag)
0 1
21 80
[Question]
The code took some time to run, and as you can imagine, it will take more time if nrow(DT) gets larger.
My question is: Is there a better way to do this? Better in terms of speed and code readability (I believe my code is not intuitive at all).
Here's one approach:
next_wday <- function(d,wd=4L){
wddiff = wd - wday(d)
d + wddiff + (wddiff < 0L)*7L
}
DT[, flag2 := +(next_wday(from) <= to)]
# test:
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 44 0
# 1 0 57
The idea is that you compare to against the next Thursday**. The replacement line could be written a number of different ways.
Benchmark
The OP mentioned that from and to could be up to 200 days apart so...
set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to <- from + pmin(200,rpois(length(from),1))
DT <- data.table(from,to)
system.time(DT[, flag2 := +(next_wday(from) <= to)])
# user system elapsed
# 2.11 0.03 2.14
# David Arenburg's solution
system.time({
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
list(from = temp2, to = temp2)
}
]
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
DT[, flag := 0L][indx, flag := 1L]
})
# user system elapsed
# 6.75 0.14 6.89
# check agreement
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 714666 0
# 1 0 285334
I'm using IDate because it is the date format that comes with the data.table package and is (?) faster to work with. There are a couple of ways one could make the code even faster:
First, it might be faster to restrict attention to rows where to-from is less than 6 (since any gap 6 or greater will have every weekday), like
DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in #grubjesic's answer.
Depending on the data on one's real data, additional improvements might be found.
The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow.
** or whatever wday being 4 means.
You could also try the foverlaps approach
First will create data set of all the Wednesday starting from min(from) and ending at max(to)
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
.(from = temp2, to = temp2)
}
]
Then run foverlaps and extract desired rows
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
Then a simple update by reference will do
DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
# 0 1
# 44 57
Here's my example:
library(parallel)
process <- function(){
from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)
DT <- data.frame(from,to)
Ncores <- detectCores()
flagList <- mclapply(1:nrow(DT),function(id){
4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")
},mc.cores=Ncores)
flag <- unlist(flagList)
return(cbind(DT,flag))
}
It takes just 15 sec for 100k rows on my i7 processor. Hope this helps.