R: how to keep legitimate NAs in a merged zoo object - r

I have multiple time-series objects with regular interval of five minutes, but they can have different start and end times. They can also log at different times, not necessarily at minutes 5,10,15, etc.
I want to merge those objects, but I want to keep the legitimate NAs intact. For example, one object start logging at a later time, then the NAs at the beginning are legitimate NAs. The same if one object stops logging earlier, then the NAs at the end are legitimate.
But there is not option to keep both NAs intact with na.locf.
Here is an example of my problem:
lines1="Index,x1
2014-01-01 00:00:00,73.06
2014-01-01 00:05:00,73.11
2014-01-01 00:10:00,73.16
2014-01-01 00:15:00,73.22"
lines2="Index,x2
2014-01-01 00:11:00,71.11
2014-01-01 00:16:00,70.12
2014-01-01 00:21:00,70.16
2014-01-01 00:26:00,70.19
2014-01-01 00:31:00,69.16"
lines3="Index,x3
2014-01-01 00:23:00,0
2014-01-01 00:28:00,1
2014-01-01 00:33:00,1
2014-01-01 00:38:00,0
2014-01-01 00:43:00,0"
df1=read.table(text = lines1, header = TRUE, sep = ",")
df2=read.table(text = lines2, header = TRUE, sep = ",")
df3=read.table(text = lines3, header = TRUE, sep = ",")
z1 = zoo(df1$x1, as.POSIXct(df1$Index))
z2 = zoo(df2$x2, as.POSIXct(df2$Index))
z3 = zoo(df3$x3, as.POSIXct(df3$Index))
z = merge(z1,z2,z3)
z
z.na.locf = na.locf(z)
z.na.locf
timesteps = seq(as.POSIXct("2014-01-01 00:00:00"),
as.POSIXct("2014-01-01 01:00:00"),
by = "5 min")
z.timesteps = na.locf(z, xout=timesteps)
z.timesteps
The merged object is this:
> z
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 NA 71.11 NA
2014-01-01 00:15:00 73.22 NA NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA NA 0
2014-01-01 00:26:00 NA 70.19 NA
2014-01-01 00:28:00 NA NA 1
2014-01-01 00:31:00 NA 69.16 NA
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note that the NAs in the beginning of z1 is legitimate, also in the end of z3, and in the beginning and end of z2. The NAs that need to be replaced are the ones in the middle of data. The problem is if I tried to fill in the missing values in the middle of the data, the legitimate NAs are gone too:
> z.na.locf
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 73.22 70.12 NA
2014-01-01 00:21:00 73.22 70.16 NA
2014-01-01 00:23:00 73.22 70.16 0
2014-01-01 00:26:00 73.22 70.19 0
2014-01-01 00:28:00 73.22 70.19 1
2014-01-01 00:31:00 73.22 69.16 1
2014-01-01 00:33:00 73.22 69.16 1
2014-01-01 00:38:00 73.22 69.16 0
2014-01-01 00:43:00 73.22 69.16 0
Note that for z1 and z2, the legitimate NAs in the end are gone.
Furthermore, if I want to re-sample the data to have the same regular timestamp, both NAs at the beginning and in the end are gone too.
> z.timesteps
z1 z2 z3
2014-01-01 00:00:00 73.06 71.11 0
2014-01-01 00:05:00 73.11 71.11 0
2014-01-01 00:10:00 73.16 71.11 0
2014-01-01 00:15:00 73.22 71.11 0
2014-01-01 00:20:00 73.22 70.12 0
2014-01-01 00:25:00 73.22 70.16 0
2014-01-01 00:30:00 73.22 70.19 1
2014-01-01 00:35:00 73.22 69.16 1
2014-01-01 00:40:00 73.22 69.16 0
2014-01-01 00:45:00 73.22 69.16 0
2014-01-01 00:50:00 73.22 69.16 0
2014-01-01 00:55:00 73.22 69.16 0
2014-01-01 01:00:00 73.22 69.16 0
Is there a way we can achieve what I need? Thanks for your help.

na.fill can help here. The following line of code will preserve runs of NAs at the beginning and at the end but fill in the remaining NAs using na.locf:
zz <- na.locf(z, na.rm = FALSE) + 0 * na.fill(z, fill = c(NA, 0, NA))
giving:
> zz
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA 70.16 0
2014-01-01 00:26:00 NA 70.19 0
2014-01-01 00:28:00 NA 70.19 1
2014-01-01 00:31:00 NA 69.16 1
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note 1: We could reduce the read.table / zoo lines to three lines of the form:
z1 <- read.zoo(text = lines1, header = TRUE, sep = ",", tz = "")
Note 2: Perhaps what you want to do next is:
timesteps <- seq(start(zz), start(zz) + 3600, by = "5 min")
m <- merge(zz, zoo(, timesteps))
m.na <- na.locf(m, na.rm = FALSE) + 0 * na.fill(m, fill = c(NA, 0, NA))
window(m.na, timesteps)

Related

R: yes-no factor based on previous entries

I've got a timeseries dataset — data from meteostation. So there's 3 columns: time - time and date; p - rain, mm; h - water level,m.
I need to make a new column factor_rain, with 1 and 0 values. 1 - if water level(df$h) was influenced by rain (df$p). This can be if there was a rain for the last 5 hours (5 entries).
In other cases, there should be 0.
A part of dataset is here:
df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00", "2017-06-04 17:00:00",
"2017-06-04 19:00:00", "2017-06-04 21:00:00", "2017-06-04 23:00:00",
"2017-06-05 9:00:00", "2017-06-05 11:00:00",
"2017-06-05 13:00:00", "2017-06-05 16:00:00",
"2017-06-05 19:00:00", "2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00", "2017-06-06 18:00:00",
"2017-06-06 19:00:00"),
p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12,
NA, NA, NA, NA, NA, NA, NA, NA, NA),
h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
33,NA,NA,NA,29,NA,NA,NA,NA))
I was trying the simplest way I thought — it works only for one case unfortunately:
> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
longer object length is not a multiple of shorter object length
Is there any way to fix it? If you can suggest how to use real time (smth from xts library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.
By the way I need to get this as a result:
> df
time p h factor_rain
1 2017-06-04 9:00:00 NA 23 0
2 2017-06-04 13:00:00 NA NA 0
3 2017-06-04 17:00:00 16.4 NA 0
4 2017-06-04 19:00:00 NA NA 0
5 2017-06-04 21:00:00 NA NA 0
6 2017-06-04 23:00:00 NA 32 1
7 2017-06-05 9:00:00 NA NA 0
8 2017-06-05 11:00:00 NA NA 0
9 2017-06-05 13:00:00 NA 28 0
10 2017-06-05 16:00:00 NA NA 0
11 2017-06-05 19:00:00 12.0 NA 0
12 2017-06-05 21:00:00 NA 33 1
13 2017-06-05 23:00:00 NA NA 0
14 2017-06-06 9:00:00 NA NA 0
15 2017-06-06 11:00:00 NA NA 0
16 2017-06-06 13:00:00 NA 29 0
17 2017-06-06 16:00:00 NA NA 0
18 2017-06-06 17:00:00 NA NA 0
19 2017-06-06 18:00:00 NA NA 0
20 2017-06-06 19:00:00 NA NA 0
You can use
df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
# time p h factorrain
# 1 2017-06-04 9:00:00 NA 23 FALSE
# 2 2017-06-04 13:00:00 NA NA FALSE
# 3 2017-06-04 17:00:00 16.4 NA TRUE
# 4 2017-06-04 19:00:00 NA NA TRUE
# 5 2017-06-04 21:00:00 NA NA TRUE
# 6 2017-06-04 23:00:00 NA 32 TRUE
# 7 2017-06-05 9:00:00 NA NA TRUE
# 8 2017-06-05 11:00:00 NA NA FALSE
# 9 2017-06-05 13:00:00 NA 28 FALSE
# 10 2017-06-05 16:00:00 NA NA FALSE
# 11 2017-06-05 19:00:00 12.0 NA TRUE
# 12 2017-06-05 21:00:00 NA 33 TRUE
# 13 2017-06-05 23:00:00 NA NA TRUE
# 14 2017-06-06 9:00:00 NA NA TRUE
# 15 2017-06-06 11:00:00 NA NA TRUE
# 16 2017-06-06 13:00:00 NA 29 FALSE
# 17 2017-06-06 16:00:00 NA NA FALSE
# 18 2017-06-06 17:00:00 NA NA FALSE
# 19 2017-06-06 18:00:00 NA NA FALSE
# 20 2017-06-06 19:00:00 NA NA FALSE
Or, a similar approach with apply,
df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE
A solution can be achieved by using non-equi join from data.table.
library(data.table)
df$time <- as.POSIXct(df$time, format = "%Y-%m-%d %H:%M:%S")
setDT(df)
df[,timeLow := time-5*60*60]
df[df,.(time, p, h = i.h), on=.(time < time, time >= timeLow)][
,.(factor_rain = ifelse(!is.na(first(h)), any(!is.na(p)),FALSE)),by=.(time)][
df,.(time, p, h, factor_rain),on="time"]
# time p h factor_rain
# 1: 2017-06-04 09:00:00 NA 23 FALSE
# 2: 2017-06-04 13:00:00 NA NA FALSE
# 3: 2017-06-04 17:00:00 16.4 NA FALSE
# 4: 2017-06-04 19:00:00 NA NA FALSE
# 5: 2017-06-04 21:00:00 NA NA FALSE
# 6: 2017-06-04 23:00:00 NA 32 FALSE <-- There is no rain in last 5 hours
# 7: 2017-06-05 09:00:00 NA NA FALSE
# 8: 2017-06-05 11:00:00 NA NA FALSE
# 9: 2017-06-05 13:00:00 NA 28 FALSE
# 10: 2017-06-05 16:00:00 NA NA FALSE
# 11: 2017-06-05 19:00:00 12.0 NA FALSE
# 12: 2017-06-05 21:00:00 NA 33 TRUE
# 13: 2017-06-05 23:00:00 NA NA FALSE
# 14: 2017-06-06 09:00:00 NA NA FALSE
# 15: 2017-06-06 11:00:00 NA NA FALSE
# 16: 2017-06-06 13:00:00 NA 29 FALSE
# 17: 2017-06-06 16:00:00 NA NA FALSE
# 18: 2017-06-06 17:00:00 NA NA FALSE
# 19: 2017-06-06 18:00:00 NA NA FALSE
# 20: 2017-06-06 19:00:00 NA NA FALSE
Note: The solution can be optimized a bit. I'll take up optimization in a while.

foverlaps and within in data.table

Have 2 tables
dums:
start end 10min
2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-02 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45
maps: start and end are 10 minute interval blocks spanning 2013-4-1 00:00:00 to 2013-04-04
I want to add the column 3 of dt1 to map as long as the start and end time are within the 10 minute blocks and keep appending the columns
ideally the output should be
start end 10min
4/1/2013 0:00:00 4/1/2013 0:10:00 0.05 0
4/1/2013 0:10 4/1/2013 0:20 0.05 0
4/1/2013 0:20 4/1/2013 0:30 0.05 0
4/1/2013 0:30 4/1/2013 0:40 0.05 0
4/1/2013 0:40 4/1/2013 0:50 0.05 0.01
4/1/2013 0:50 4/1/2013 1:00 0.05 0.01
I tried
setkey(dums,start,end)
setkey(map,start,end)
foverlaps(map,dums,type="within",nomatch=0L)
I keep getting the error:
Error in foverlaps(map, dums, type = "within", nomatch = 0L) : All entries in column start should be <= corresponding entries in column end in data.table 'y'
Any pointers or alternative approaches?
Thanks
The error message
All entries in column start should be <= corresponding entries in column end in data.table 'y'
is probably caused by a typo in the dataset.
dums[start > end, with = TRUE]
returns 4 and row 4 of dums is:
start end min10
1: 2013-04-02 02:22:00 2013-04-01 04:33:12 0.2
After changing start to 2013-04-01 02:22:00 OP's code runs fine.
However, to achieve the expected output the result of foverlaps() needs to be reshaped from long to wide format.
This can be done in two ways:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ min10,
value.var = "min10")
i.start i.end 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.1 NA NA NA NA NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 NA NA NA NA NA NA 0.35 NA 0.45
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 NA NA NA NA NA NA 0.35 NA 0.45
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 NA NA NA NA NA NA 0.35 NA 0.45
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 NA NA NA NA NA NA 0.35 NA 0.45
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 NA NA NA NA NA NA 0.35 NA 0.45
or, more in line with OP's expected result:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ rowid(i.start),
value.var = "min10")
i.start i.end 1 2 3 4 5
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.10 NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.35 0.45 NA NA NA
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 0.35 0.45 NA NA NA
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 0.35 0.45 NA NA NA
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 0.35 0.45 NA NA NA
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 0.35 0.45 NA NA NA
Note that the parameter type = "within" has been skipped for brevity.
Data
# corrected
dums <- fread(
" 2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-01 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45"
)
dums <- dums[, .(start = as.POSIXct(paste(V1, V2, V3)),
end = as.POSIXct(paste(V4, V5, V6)),
min10 = V7)]
setkey(dums, start, end)
ts <- seq(as.POSIXct("2013-04-01 00:00:00 UTC"),
as.POSIXct("2013-04-04 00:00:00 UTC"),
by = "10 min")
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))
That's a good catch with the POSIXct time being off for 1 row. I feel super silly to have glossed over such an error in the input data.
The ultimate goal is to have 3 column variables : YYYY-DD-MM ; start time (POSIXCt), end time (POSIXCt).
The start and end time being 10 minute windows.
The number of days is 365. So effectively looking at 365 * 144 (10 minute slices for a day). The catch is I have 450k rows of "dums" data and the min10 is not evenly spaced discrete intervals, it is a continuous data. If I have to aggregate (sum,means,sd etc) , is there any way to use the dcast + aggregate +foverlaps within+ grouping? I can do with a for loop just placing the min10 value from start to end but it looks super time consuming and inefficient.
The output would be
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.15
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.80
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))
# plus do something on the lines
dums[, .(count=.N, sum=sum(min10)), by = ID1]

how to take averaged diurnal for each month for two columns with ggplot2

I have a time series data of two columns, and I want a graph with averaged hourly pattern for each month, like the graph attached but with two time series.
timestamp ET_control ET_treatment
1 2016-01-01 00:00:00 NA NA
2 2016-01-01 00:30:00 NA NA
3 2016-01-01 01:00:00 NA NA
4 2016-01-01 01:30:00 NA NA
5 2016-01-01 02:00:00 NA NA
6 2016-01-01 02:30:00 NA NA
7 2016-01-01 03:00:00 NA NA
8 2016-01-01 03:30:00 NA NA
9 2016-01-01 04:00:00 NA NA
10 2016-01-01 04:30:00 NA NA
11 2016-01-01 05:00:00 NA NA
12 2016-01-01 05:30:00 NA NA
13 2016-01-01 06:00:00 NA NA
14 2016-01-01 06:30:00 NA NA
15 2016-01-01 07:00:00 NA NA
16 2016-01-01 07:30:00 NA NA
17 2016-01-01 08:00:00 NA NA
18 2016-01-01 08:30:00 NA NA
19 2016-01-01 09:00:00 NA NA
20 2016-01-01 09:30:00 NA NA
21 2016-01-01 10:00:00 NA NA
22 2016-01-01 10:30:00 NA NA
23 2016-01-01 11:00:00 NA NA
24 2016-01-01 11:30:00 0.09863437 NA
25 2016-01-01 12:00:00 0.11465258 NA
26 2016-01-01 12:30:00 0.12356855 NA
27 2016-01-01 13:00:00 0.09246215 0.085398782
28 2016-01-01 13:30:00 0.08843156 0.072877001
29 2016-01-01 14:00:00 0.08536019 0.081885947
30 2016-01-01 14:30:00 0.08558541 NA
31 2016-01-01 15:00:00 0.05571436 NA
32 2016-01-01 15:30:00 0.04087248 0.038582547
33 2016-01-01 16:00:00 0.04233724 NA
34 2016-01-01 16:30:00 0.02150660 0.019560578
35 2016-01-01 17:00:00 0.01803765 0.019691155
36 2016-01-01 17:30:00 NA 0.005190489
37 2016-01-01 18:00:00 NA NA
38 2016-01-01 18:30:00 NA NA
39 2016-01-01 19:00:00 NA NA
40 2016-01-01 19:30:00 NA NA
41 2016-01-01 20:00:00 NA NA
42 2016-01-01 20:30:00 NA NA
43 2016-01-01 21:00:00 NA NA
44 2016-01-01 21:30:00 NA NA
45 2016-01-01 22:00:00 NA NA
46 2016-01-01 22:30:00 NA NA
47 2016-01-01 23:00:00 NA NA
48 2016-01-01 23:30:00 NA NA
49 2016-01-02 00:00:00 NA NA
50 2016-01-02 00:30:00 NA NA
given t is your data.frame with packages dplyr and ggplot2:
t <- t %>% mutate(
month = format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%b"),
hour=format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%H"))
tm <- t %>% group_by(month, hour) %>%
summarize(ET_control_mean=mean(ET_control, na.rm=T))
ggplot(tm, aes(x=hour, y=ET_control_mean)) + geom_point() + facet_wrap(~ month)
if you want to have both columns in your graph, you should transform your data into the 'long' format.

Flag first instance of an event occurring contingent on other variable's value

New to R and to solving such a problem as the one below, so not sure about how certain functionality is achieved in particular instances.
I have a dataframe as such:
df <- data.frame(DATETIME = seq(from = as.POSIXct('2014-01-01 00:00', tz = "GMT"), to = as.POSIXct('2014-01-01 06:00', tz = "GMT"), by='15 mins'),
Price = c(23,22,23,24,27,31,33,34,31,26,24,23,19,18,19,19,23,25,26,26,27,30,26,25,24),
TroughPriceFlag = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0))
df <- data.table(df)
df
DATETIME Price TroughPriceFlag
1: 2014-01-01 00:00:00 23 0
2: 2014-01-01 00:15:00 22 1
3: 2014-01-01 00:30:00 23 0
4: 2014-01-01 00:45:00 24 0
5: 2014-01-01 01:00:00 27 0
6: 2014-01-01 01:15:00 31 0
7: 2014-01-01 01:30:00 33 0
8: 2014-01-01 01:45:00 34 0
9: 2014-01-01 02:00:00 31 0
10: 2014-01-01 02:15:00 26 0
11: 2014-01-01 02:30:00 24 0
12: 2014-01-01 02:45:00 23 0
13: 2014-01-01 03:00:00 19 0
14: 2014-01-01 03:15:00 18 1
15: 2014-01-01 03:30:00 19 0
16: 2014-01-01 03:45:00 19 0
17: 2014-01-01 04:00:00 23 0
18: 2014-01-01 04:15:00 25 0
19: 2014-01-01 04:30:00 26 0
20: 2014-01-01 04:45:00 26 0
21: 2014-01-01 05:00:00 27 0
22: 2014-01-01 05:15:00 30 0
23: 2014-01-01 05:30:00 26 0
24: 2014-01-01 05:45:00 25 0
25: 2014-01-01 06:00:00 24 0
What I wish to do is two things:
(1) From where we observe a TroughPrice, flag the first instance where the price has risen by 10 or more dollars. That is, find the first instance where deltaPrice >= 10 since the trough price.
As an example: from the trough price of 22 (row 2), in the next interval price is increased to 23 which is a change of 1 dollar, so no flag. From the trough price of 22 (again row 2, since always with reference to the trough price in question), two intervals later the price is 24 dollars, so the price has increased by 2 dollars since the trough, so again no flag. However, from the trough price of 22, 5 intervals later the price has increased to 33 dollars, which is an increase of 11 dollars and is the first time the price has increased above 10 dollars. Thus the flag is 1.
(2) Determine the number of 15 minute periods which have passed between the trough price and the first instance the price has risen by 10 or more dollars.
The resulting dataframe should look like this:
DATETIME Price TroughPriceFlag FirstOver10CentsFlag CountPeriods
1 2014-01-01 00:00:00 23 0 0 NA
2 2014-01-01 00:15:00 22 1 0 5
3 2014-01-01 00:30:00 23 0 0 NA
4 2014-01-01 00:45:00 24 0 0 NA
5 2014-01-01 01:00:00 27 0 0 NA
6 2014-01-01 01:15:00 31 0 0 NA
7 2014-01-01 01:30:00 33 0 1 NA
8 2014-01-01 01:45:00 34 0 0 NA
9 2014-01-01 02:00:00 31 0 0 NA
10 2014-01-01 02:15:00 26 0 0 NA
11 2014-01-01 02:30:00 24 0 0 NA
12 2014-01-01 02:45:00 23 0 0 NA
13 2014-01-01 03:00:00 19 0 0 NA
14 2014-01-01 03:15:00 18 1 0 8
15 2014-01-01 03:30:00 19 0 0 NA
16 2014-01-01 03:45:00 19 0 0 NA
17 2014-01-01 04:00:00 23 0 0 NA
18 2014-01-01 04:15:00 25 0 0 NA
19 2014-01-01 04:30:00 26 0 0 NA
20 2014-01-01 04:45:00 26 0 0 NA
21 2014-01-01 05:00:00 27 0 0 NA
22 2014-01-01 05:15:00 30 0 1 NA
23 2014-01-01 05:30:00 26 0 0 NA
24 2014-01-01 05:45:00 25 0 0 NA
25 2014-01-01 06:00:00 24 0 0 NA
I'm not really sure where to start, since the time gaps can be quite large and I've only used indexing in the context of a few steps forward/backward. Please help!
Thanks in advance
You can chain operation with data.table package, the idea would be to group by cumsum of the ThroughPriceFlag:
library(data.table)
df[, col1:=pmatch(Price-Price[1]>10,T, nomatch=0), cumsum(TroughPriceFlag)][
, count:=which(col1==1)-1,cumsum(TroughPriceFlag)][
TroughPriceFlag==0, count:=NA]
#> df
# DATETIME Price TroughPriceFlag col1 count
# 1: 2014-01-01 00:00:00 23 0 0 NA
# 2: 2014-01-01 00:15:00 22 1 0 5
# 3: 2014-01-01 00:30:00 23 0 0 NA
# 4: 2014-01-01 00:45:00 24 0 0 NA
# 5: 2014-01-01 01:00:00 27 0 0 NA
# 6: 2014-01-01 01:15:00 31 0 0 NA
# 7: 2014-01-01 01:30:00 33 0 1 NA
# 8: 2014-01-01 01:45:00 34 0 0 NA
# 9: 2014-01-01 02:00:00 31 0 0 NA
#10: 2014-01-01 02:15:00 26 0 0 NA
#11: 2014-01-01 02:30:00 24 0 0 NA
#12: 2014-01-01 02:45:00 23 0 0 NA
#13: 2014-01-01 03:00:00 19 0 0 NA
#14: 2014-01-01 03:15:00 18 1 0 8
#15: 2014-01-01 03:30:00 19 0 0 NA
#16: 2014-01-01 03:45:00 19 0 0 NA
#17: 2014-01-01 04:00:00 23 0 0 NA
#18: 2014-01-01 04:15:00 25 0 0 NA
#19: 2014-01-01 04:30:00 26 0 0 NA
#20: 2014-01-01 04:45:00 26 0 0 NA
#21: 2014-01-01 05:00:00 27 0 0 NA
#22: 2014-01-01 05:15:00 30 0 1 NA
#23: 2014-01-01 05:30:00 26 0 0 NA
#24: 2014-01-01 05:45:00 25 0 0 NA
#25: 2014-01-01 06:00:00 24 0 0 NA

R: Compare data.table and pass variable while respecting key

I have two data.tables:
original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)
I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:
original
id date.time diff.seq
1 RE01 2015-11-01 01:00:00 NA
2 RE01 2015-11-01 11:00:00 NA
3 RE01 2015-11-01 21:00:00 1
4 RE01 2015-11-02 07:00:00 1
5 RE01 2015-11-02 17:00:00 NA
6 RE02 2015-11-03 03:00:00 NA
7 RE02 2015-11-03 13:00:00 NA
8 RE02 2015-11-03 23:00:00 NA
9 RE02 2015-11-04 09:00:00 2
10 RE02 2015-11-04 19:00:00 2
I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?
As said in the comments, this is very straight forward using data.table::foverlaps
You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns
original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
# id date.time seq
# 1: RE01 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2
Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update
indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
# id date.time end diff.seq
# 1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00 2

Resources