I am trying to convert the timestamps in the stock data from Google Finance API to a more usable datetime format.
I have used data.table::fread to read the data here:
fread(<url>)
datetime open high low close volume
1: a1497619800 154.230 154.2300 154.2300 154.2300 500
2: 1 153.720 154.3200 153.7000 154.2500 1085946
3: 2 153.510 153.8000 153.2000 153.7700 34882
4: 3 153.239 153.4800 153.1400 153.4800 24343
5: 4 153.250 153.3000 152.9676 153.2700 20212
As you can see, the "datetime" format is rather strange. The format is described in this link:
The full timestamps are denoted by the leading 'a'. Like this: a1092945600. The number after the 'a' is a Unix timestamp. [...]
The numbers without a leading 'a' are "intervals". So, for example, the second row in the data set below has an interval of 1. You can multiply this number by our interval size [...] and add it to the last Unix Timestamp.
In my case, the "interval size" is 300 seconds (5 minutes). This format is restarted at the start of each new day and so trying to format it is quite difficult!
I can pull out the index positions of where the day starts are by using grep and searching for "a";
newDay <- grep(df$V1, pattern = "a")
Then my idea was to split the dataframe into chunks depending on index positions then extend the unix times on each day separately followed by combing them back to a data.table, before storing.
data.table::split looks like it will do the job, but I am unsure of how to supply it the day breaks to split by index positions, or if there is a more logical way to achieve the same result without having to break it down to each day.
Thanks.
You may use grepl to search for "a" in "datetime", which results in a boolean vector. cumsum the boolean to create a grouping variable - for each "a" (TRUE), the counter will increase by one.
Within each group, convert the first element to POSIXct, using an appropriate format and origin (and timezone, tz?). Add multiples of the 'interval size' (300 sec), using zero for the first element and the "datetime" multiples for the others.
d[ , time := {
t1 <- as.POSIXct(datetime[1], format = "a%s", origin = "1970-01-01")
.(t1 + c(0, as.numeric(datetime[-1]) * 300))
}
, by = .(cumsum(grepl("^a", datetime)))]
d
# datetime time
# 1: a1497619800 2017-06-16 15:30:00
# 2: 1 2017-06-16 15:35:00
# 3: 2 2017-06-16 15:40:00
# 4: 3 2017-06-16 15:45:00
# 5: 4 2017-06-16 15:50:00
# 6: a1500000000 2017-07-14 04:40:00
# 7: 3 2017-07-14 04:55:00
# 8: 5 2017-07-14 05:05:00
# 9: 7 2017-07-14 05:15:00
Some toy data:
d <- fread(input = "datetime
a1497619800
1
2
3
4
a1500000000
3
5
7")
With:
DT[grep('^a', date), datetime := as.integer(gsub('\\D+','',date))
][, datetime := zoo::na.locf(datetime)
][nchar(date) < 4, datetime := datetime + (300 * as.integer(date))
][, datetime := as.POSIXct(datetime, origin = '1970-01-01', tz = 'America/New_York')][]
you get:
date close high low open volume datetime
1: a1500298200 153.57 153.7100 153.57 153.5900 1473 2017-07-17 09:30:00
2: 1 153.51 153.8700 153.33 153.7500 205057 2017-07-17 09:35:00
3: 2 153.49 153.7800 153.34 153.5800 70023 2017-07-17 09:40:00
4: 3 153.68 153.7300 153.42 153.5400 53050 2017-07-17 09:45:00
5: 4 153.06 153.7500 153.06 153.7200 120899 2017-07-17 09:50:00
---
2348: 937 143.94 144.0052 143.91 143.9917 36651 2017-08-25 15:40:00
2349: 938 143.90 143.9958 143.90 143.9400 40769 2017-08-25 15:45:00
2350: 939 143.94 143.9500 143.87 143.8900 56616 2017-08-25 15:50:00
2351: 940 143.97 143.9700 143.89 143.9400 56381 2017-08-25 15:55:00
2352: 941 143.74 143.9700 143.74 143.9655 179811 2017-08-25 16:00:00
Used data:
DT <- fread('https://www.google.com/finance/getprices?i=300&p=30d&f=d,t,o,h,l,c,v&df=cpct&q=IBM', skip = 7, header = FALSE)
setnames(DT, 1:6, c('date','close','high','low','open','volume'))
Related
I have 2 data sets, each containing a date-time value in POSIXlt format, and some other numeric and character variables.
I want to combine both data sets based on the date-time column.
But the date stamps of both data sets do not match, so I need to combine them by nearest date (before or after).
In my example, data value "e" from 2016-03-01 23:52:00 needs to be combined with "binH" at 2016-03-02 00:00:00, not "binG".
Is there a function that would allow me to combine my data sets by nearest date-time value, even if it is after?
I have found ways of combining dates to the next previous date using the cut() function, or the roll=Inf function in data.tables. But I couldn't get my timestamps into any format roll='nearest' would accept.
>df1
date1 value
1 2016-03-01 17:52:00 a
2 2016-03-01 18:01:30 b
3 2016-03-01 18:05:00 c
4 2016-03-01 20:42:30 d
5 2016-03-01 23:52:00 e
>df2
date2 bin_name
1 2016-03-01 17:00:00 binA
2 2016-03-01 18:00:00 binB
3 2016-03-01 19:00:00 binC
4 2016-03-01 20:00:00 binD
5 2016-03-01 21:00:00 binE
6 2016-03-01 22:00:00 binF
7 2016-03-01 23:00:00 binG
8 2016-03-02 00:00:00 binH
9 2016-03-02 01:00:00 binI
data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.
(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):
new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )
date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u
old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )
date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b
setkey( new, date )
setkey( old, date )
combined <- new[ old, roll = "nearest" ]
combined
date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c
I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:
combined <- old[ new, roll = "nearest" ]
combined
date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v
I had a similar problem, but instead of using data.table or tidyverse I created my own function amerge for "approximate merge". It takes 4 arguments:
two data frames,
a vector of column names for "firm" (not approximate) merge - these must exist in both data frames,
and the name of a single column (in both data frames) for approximate merge. It will work for any numeric values, including dates.
The idea was to merge rows 1-to-1 of best matches, and not loose any rows from any data frame. Here is my commented code with a working example.
amerge <- function(d1, d2, firm=NULL, approx=NULL) {
rt = Sys.time()
# Take care of conflicting column names
n2 = data.frame(oldname = names(d2), newname = names(d2))
n2$newname = as.character(n2$newname)
n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")
# Add unique row IDs
if (length(c(firm, approx))>1) {
d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
} else {
d1$ID1 = factor(d1[,c(approx,firm)])
d2$ID2 = factor(d2[,c(approx,firm)])
}
# Perform initial merge on the 'firm' parameters, if any
# Otherwise match all to all
if (length(firm)>0) {
t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
} else {
names(d2)= c(n2$newname,"ID2")
t1 = data.frame()
for (i1 in 1:nrow(d1)) {
trow = d1[i1,]
t1 = rbind(t1, cbind(trow, d2))
}
}
# Match by the most approximate record
if (length(approx)==1) {
# Calculate the differential for approximate merging
t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
# Sort data by ascending DIFF, so that best matching records are used first
t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
t2 = data.frame()
d2$used = 0
# For each record of d1, find match from d2
for (i1 in na.omit(unique(t1$ID1))) {
tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
# If there are non-missing records, get the one with minimum DIFF (top one)
if (nrow(tx)>0) {
tx = tx[1,]
# If matching record found, remove it from the pool, so it's not used again
t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
# And mark it as used
d2$used[d2$ID2==tx$ID2] = 1
} else {
# If there are no non-missing records, just get the first one from the top
tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
}
t2 = rbind(t2,tx)
}
} else {
t2 = t1
}
# Make the records the same order as d1
t2 = t2[match(d1$ID1, t2$ID1),]
# Add unmatched records from d2 to the end of output
if (any(d2$used==0)) {
tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
tx = tx[!duplicated(tx$ID2),]
tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
t2 = rbind(t2,tx)
t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
}
t2$DIFF = t2$ID1 = t2$ID2 = NULL
cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
return(t2)
}
And the example:
new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))
amerge(old, new, firm="ID", approx="date")
It outputs:
ID date old date.2 new
2 1 2016-03-07 12:20:00 a 2016-03-07 12:20:00 u
6 1 2016-04-02 12:20:00 b 2016-04-02 12:20:00 v
7 1 2016-03-01 10:09:00 c <NA> <NA>
10 1 2015-04-12 10:09:00 d <NA> <NA>
13 1 2016-03-03 12:20:00 e 2016-03-02 12:20:00 t
16 2 2016-04-12 11:03:00 <NA> 2016-04-12 11:03:00 x
So works for my purpose as intended - there is exactly one copy of each row from both data frames - matched by shortest time difference. One note: the function copies date.2 into date column where the date would be missing.
First of all, a similar problem:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
The story
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.
The data
Edit 1:
Here are two example datasets that allow the execution of the code below.
The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
Edit 2:
I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.
Edit 3:
Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?
The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.
example of emissions.GN:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.
example of AC events (CO and MT are analogous):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
The function
I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
The function executes succesfully for the events AC and CO
The function gives the desired result as discribed above when called on the events AC and CO:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
The function returns an error when called on the MT event
The following function call results in the error below:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.
What I have tried so far
Firstly, In the similar problem linked above, someone pointed out the following idea:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
This did not change anything (the function works for AC and CO but fails for MT).
Secondly, I made absolutely sure that none of my datatables contained NA's.
More information
If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
Just addressing this objective (since I don't know foverlaps well.)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
To check/verify one of these counts...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.
I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1