I have a data.table containing time series of hourly observations from different locations (sites). There are gaps -- missing hours -- in each sequence. I want to fill out the sequence of hours for each site, so each sequence has a row for each hour (although data will be missing, NA).
Example data:
library(data.table)
library(lubridate)
DT <- data.table(site = rep(LETTERS[1:2], each = 3),
date = ymd_h(c("2017080101", "2017080103", "2017080105",
"2017080103", "2017080105", "2017080107")),
# x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3),
x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3),
key = c("site", "date"))
DT
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: B 2017-08-01 03:00:00 2.1
# 5: B 2017-08-01 05:00:00 2.2
# 6: B 2017-08-01 07:00:00 2.3
The desired result DT2 would contain all the hours between the first (minimum) date and the last (maximum) date for each site, with x missing where the new rows are inserted:
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
I have tried to join DT with a date sequence constructed from min(date) and max(date). This is in the right direction, but the date range is over all sites rather than for each individual site, the filled in rows have missing site, and the sort order (key) is wrong:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date"]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: NA 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: B 2017-08-01 03:00:00 2.1
# 5: NA 2017-08-01 04:00:00 NA
# 6: A 2017-08-01 05:00:00 1.3
# 7: B 2017-08-01 05:00:00 2.2
# 8: NA 2017-08-01 06:00:00 NA
# 9: B 2017-08-01 07:00:00 2.3
So I naturally tried adding by = site:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date", by=.(site)]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: NA <NA> NA
# 5: B 2017-08-01 03:00:00 2.1
# 6: B 2017-08-01 05:00:00 2.2
# 7: B 2017-08-01 07:00:00 2.3
But this doesn't work either. Can anyone suggest the right data.table formulation to give the desired filled-out DT2 shown above?
library(data.table)
library(lubridate)
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by =
'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)
DT
site date x
1: A 2017-08-01 01:00:00 1.1
2: A 2017-08-01 02:00:00 NA
3: A 2017-08-01 03:00:00 1.2
4: A 2017-08-01 04:00:00 NA
5: A 2017-08-01 05:00:00 1.3
6: B 2017-08-01 03:00:00 2.1
7: B 2017-08-01 04:00:00 NA
8: B 2017-08-01 05:00:00 2.2
9: B 2017-08-01 06:00:00 NA
10: B 2017-08-01 07:00:00 2.3
Thanks to both Frank and Wen for putting me on the right track. I found a compact data.table solution. The result DT2 is also keyed on site and date, as in the input table (which is desirable although I didn't request this in the OP). This is a reformulation of Wen's solution, in data.table syntax, which I assume will be slightly more efficient on large datasets.
DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], site, date), ]
DT2
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"
EDIT1: As mentioned by Frank, the on= syntax can also be used. The following DT3 formulation gives the correct answer, but DT3 is not keyed, while the DT2 result is keyed. That means an 'extra' setkey() would be needed if a keyed result was desired.
DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], on = c("site", "date"), ]
DT3
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE
Is there a way to write the DT3 expression to give a keyed result, other than expressly using setkey()?
EDIT2: Frank's comment suggests an additional formulation DT4 using keyby = .EACHI. In this case .SD is inserted as j, which is required when by or keyby is used. This gives the correct answer and the result is keyed like the DT2 formulation.
DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")),
by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE
Related
Let's say I have a dataframe with contains time series as below:
Date value
2000-01-01 00:00:00 4.6
2000-01-01 01:00:00 N/A
2000-01-01 02:00:00 5.3
2000-01-01 03:00:00 6.0
2000-01-01 04:00:00 N/A
2000-01-01 05:00:00 N/A
2000-01-01 06:00:00 N/A
2000-01-01 07:00:00 6.0
I want to find an efficient way to calculate the size of the gap (number of consecutive N/As) and add it to a new column of my dataframe to get the following:
Date value gap_size
2000-01-01 00:00:00 4.6 0
2000-01-01 01:00:00 N/A 1
2000-01-01 02:00:00 5.3 0
2000-01-01 03:00:00 6.0 0
2000-01-01 04:00:00 N/A 3
2000-01-01 05:00:00 N/A 3
2000-01-01 06:00:00 N/A 3
2000-01-01 07:00:00 6.0 0
My dataframe in reality has more than 6 millions row so I am looking for the cheapest way in terms of computation. Note that my time series is equi-spaced over the whole dataset (1 hour).
You could try using rle in this case to generate run lengths. First, convert your value column to logical using is.na and apply rle which provides the run lengths of the different values of the input vector. In this case, the two categories are TRUE and FALSE, and you're counting how long they run for. You can then replicate this by the run length to get the output you're looking for.
x = c(1,2,4,NA,NA,6,NA,19,NA,NA)
res = rle(is.na(x))
rep(res$values*res$lengths,res$lengths)
#> [1] 0 0 0 2 2 0 1 0 2 2
Set to data.table with setDT() and:
dt[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == "N/A")]
Date value gap
1: 2000-01-01 00:00:00 4.6 0
2: 2000-01-01 01:00:00 N/A 1
3: 2000-01-01 02:00:00 5.3 0
4: 2000-01-01 03:00:00 6.0 0
5: 2000-01-01 04:00:00 N/A 3
6: 2000-01-01 05:00:00 N/A 3
7: 2000-01-01 06:00:00 N/A 3
8: 2000-01-01 07:00:00 6.0 0
Data:
dt <- structure(list(Date = c("2000-01-01 00:00:00", "2000-01-01 01:00:00",
"2000-01-01 02:00:00", "2000-01-01 03:00:00", "2000-01-01 04:00:00",
"2000-01-01 05:00:00", "2000-01-01 06:00:00", "2000-01-01 07:00:00"
), value = c("4.6", "N/A", "5.3", "6.0", "N/A", "N/A", "N/A",
"6.0")), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))
Is there a more efficient way to row bind (or an efficient merge) two or more massive time series with data table? The time series have some different columns, so I use fill = TRUE.
I want all the rows in each time series to appear in the final data.table. I can do it below, but the time series stamps are not ordered in dt3 below. I have to create dt4 to get the ordered stamps.
I wonder if there is a more efficient way of doing a kind of rbind/time series merge in data.table?
library(data.table)
tm <- seq(as.POSIXct("2018-05-12 00:00"), as.POSIXct("2018-05-14"), by = "hours")
dt <- data.table(time = tm, x = seq(1, length(tm), by = 1))
set.seed(1)
dt2 <- data.table(time = tm[sample(length(tm), size = 8)] + rnorm(n = 8, 0, 60),
y = rnorm(8))
# Can a one liner here get me the output in `dt4` with some kind of row bind?
# Is there a way to do a row bind here instead that avoids the creation of a new object dt4 that takes the sorted rows?
dt3 <- rbind(dt, dt2, fill = TRUE)
dt4 <- dt3[order(time)]
tail(dt4, 20)
# time x y
# 1: 2018-05-13 08:00:00 33 NA
# 2: 2018-05-13 09:00:00 34 NA
# 3: 2018-05-13 10:00:00 35 NA
# 4: 2018-05-13 11:00:00 36 NA
# 5: 2018-05-13 12:00:00 37 NA
# 6: 2018-05-13 13:00:00 38 NA
# 7: 2018-05-13 14:00:00 39 NA
# 8: 2018-05-13 14:59:41 NA 0.94383621
# 9: 2018-05-13 15:00:00 40 NA
# 10: 2018-05-13 16:00:00 41 NA
# 11: 2018-05-13 16:01:30 NA 0.82122120
# 12: 2018-05-13 17:00:00 42 NA
# 13: 2018-05-13 17:00:44 NA -0.04493361
# 14: 2018-05-13 18:00:00 43 NA
# 15: 2018-05-13 19:00:00 44 NA
# 16: 2018-05-13 20:00:00 45 NA
# 17: 2018-05-13 21:00:00 46 NA
# 18: 2018-05-13 22:00:00 47 NA
# 19: 2018-05-13 23:00:00 48 NA
# 20: 2018-05-14 00:00:00 49 NA
If you have the time columns set as keys
setkey(dt, time)
setkey(dt2, time)
Then you can use merge.data.table
merge(dt,dt2,all=TRUE)
Note, if the time series are already known to be sorted (which dt is, but dt2 is not), you can speed up a bit more by just setting the 'sorted' attribute of the data.tables, rather than calling setkey.
attr(dt, 'sorted') = 'time'
I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).
I have data
dt <- data.table(beg=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:08:00")), end=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:10:00")))
> dt
beg end
1: 2018-01-01 01:01:00 2018-01-01 01:10:00
2: 2018-01-01 01:05:00 2018-01-01 01:10:00
3: 2018-01-01 01:08:00 2018-01-01 01:10:00
and
times <- seq(from=min(dt$beg),to=max(dt$end),by="mins")
and I would like to count, as efficiently as possible, for each time in times how many intervals in df include the time.
I understand that
count <- NA
for(i in 1:length(times)){
count[i] <- sum(dt$beg<times[i] & dt$end>times[i])
}
would yield the solution
> data.table(times, count)
time count
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
but I am wondering whether there is a more time-efficient solution, e.g., using data.table.
This can be a solution
times = as.data.table(times)
ans = dt[times, .(x.beg, x.end, i.x),on = .(beg < x , end > x),allow.cartesian = TRUE]
ans[,sum(!is.na(x.end)), by = .(i.x)]
i.x V1
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
Cheers!
I'm trying to aggregate two data frames (df1 and df2).
The first contains 3 variables: ID, Date1 and Date2.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID, Date3 and Value.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.
A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
foverlap joins in data.table
With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.
Here's a base R solution using sapply():
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3