R: fast counting of rows that match vector of conditional - r

I have data
dt <- data.table(beg=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:08:00")), end=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:10:00")))
> dt
beg end
1: 2018-01-01 01:01:00 2018-01-01 01:10:00
2: 2018-01-01 01:05:00 2018-01-01 01:10:00
3: 2018-01-01 01:08:00 2018-01-01 01:10:00
and
times <- seq(from=min(dt$beg),to=max(dt$end),by="mins")
and I would like to count, as efficiently as possible, for each time in times how many intervals in df include the time.
I understand that
count <- NA
for(i in 1:length(times)){
count[i] <- sum(dt$beg<times[i] & dt$end>times[i])
}
would yield the solution
> data.table(times, count)
time count
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
but I am wondering whether there is a more time-efficient solution, e.g., using data.table.

This can be a solution
times = as.data.table(times)
ans = dt[times, .(x.beg, x.end, i.x),on = .(beg < x , end > x),allow.cartesian = TRUE]
ans[,sum(!is.na(x.end)), by = .(i.x)]
i.x V1
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
Cheers!

Related

using data.table to create sequence from starting points and increments

I would like to use data.table to repeatedly add an increment to a starting point.
library(data.table)
dat <- data.table(time=seq(from=as.POSIXct("2018-01-01 01:00:01"),to=as.POSIXct("2018-01-01 01:00:10"), by="secs"), int=c(2,3,3,1,10,10,10,10,10,10), x=2*1:10)
> dat
time inc x
1: 2018-01-01 01:00:01 2 2
2: 2018-01-01 01:00:02 3 4
3: 2018-01-01 01:00:03 3 6
4: 2018-01-01 01:00:04 1 8
5: 2018-01-01 01:00:05 10 10
6: 2018-01-01 01:00:06 10 12
7: 2018-01-01 01:00:07 10 14
8: 2018-01-01 01:00:08 10 16
9: 2018-01-01 01:00:09 10 18
10: 2018-01-01 01:00:10 10 20
That is, starting in row 1, I would like to add the value of inc to time, yielding a new time. I then need to add the value of inc at that new time, to arrive at a third time. The result would then be
> res
time inc x
1: 2018-01-01 01:00:00 2 2
2: 2018-01-01 01:00:02 3 6
3: 2018-01-01 01:00:05 10 12
I would probably know how to do this in a loop, but I wonder whether data.table can handle these sorts of problems as well.
Since the values in time are continuous, my ideas was to use the cumulative values of inc to index, along the lines of
index <- dat[...,cumsum(...inc...),...]
dat[index]
but I cannot get cumsum() to ignore the values in between the points of interest. Perhaps this can be done in the i part of data.table but I would not know how. Anyone an idea?
# start with finding the next time
dat[, next.time := time + int][!dat, on = .(next.time = time), next.time := NA]
# do this in a loop for the actual problem, and stop when final column is all NA
dat[dat, on = .(next.time = time), t1 := i.next.time]
dat[dat, on = .(t1 = time), t2 := i.next.time]
dat
# time int x next.time t1 t2
# 1: 2018-01-01 01:00:01 2 2 2018-01-01 01:00:03 2018-01-01 01:00:06 <NA>
# 2: 2018-01-01 01:00:02 3 4 2018-01-01 01:00:05 <NA> <NA>
# 3: 2018-01-01 01:00:03 3 6 2018-01-01 01:00:06 <NA> <NA>
# 4: 2018-01-01 01:00:04 1 8 2018-01-01 01:00:05 <NA> <NA>
# 5: 2018-01-01 01:00:05 10 10 <NA> <NA> <NA>
# 6: 2018-01-01 01:00:06 10 12 <NA> <NA> <NA>
# 7: 2018-01-01 01:00:07 10 14 <NA> <NA> <NA>
# 8: 2018-01-01 01:00:08 10 16 <NA> <NA> <NA>
# 9: 2018-01-01 01:00:09 10 18 <NA> <NA> <NA>
#10: 2018-01-01 01:00:10 10 20 <NA> <NA> <NA>

calculations in data.table syntax [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I would like to execute a self-join in data.table, to obtain the periods between time intervals.
Example data
active <- data.table(id=c(1,1,1,2,2,3), no=c(1,2,3,1,2,1), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id no beg end
1: 1 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2 2018-01-01 01:30:00 2018-01-01 01:40:00
3: 1 3 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 2 1 2018-01-01 01:30:00 2018-01-01 01:40:00
5: 2 2 2018-01-01 01:50:00 2018-01-01 02:00:00
6: 3 1 2018-01-01 01:50:00 2018-01-01 02:00:00
What I want to reach is to get the inactive periods in between the active ones,
> res
id no ibeg iend
1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00
but my question is more general about the calculations in the syntax: When executing
res <- active[active, .(id=x.id, ibeg=i.end, iend=x.beg), on=.(no=(no-1), id=id)]
I match on on=.(no=no-1) but obtain an error message that column [no-1] cannot be found. I tried parentheses around no-1 but to no avail. Are calculations banned from the on=argument or is there a trick?
You can use
inactive = active[, .(no=no[-.N], ibeg=end[-.N], iend=beg[-1]), by=id]
# id no ibeg iend
# 1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
# 2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
# 3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00

data.table in R: creating variables from x into i

I have two data tables,
a <- data.table(id=c(1,2,1,2,1,2), time=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:11:00","2018-01-01 01:11:00","2018-01-01 01:12:00","2018-01-01 01:12:00")), beg=as.POSIXct(c("2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:01:00","2018-01-01 01:05:00")), end=as.POSIXct(c("2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00")))
> a
id time beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00
which has 650m lines by 4 columns, and
b <- data.table(id=c(1,2), abeg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:11:00")), aend=as.POSIXct(c("2018-01-01 01:11:00","2018-01-01 01:12:00")))
> b
id abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:11:00 2018-01-01 01:12:00
which has about 13m lines by 7 columns.
I would like to join b into a but keep all lines and columns of a. I understand that this is a left-join and would execute it as
b[a, .(id=i.id, time=i.time, beg=i.beg, end=i.end, abeg=x.abeg, aend=x.aend), on=.(id=id, abeg<=time, aend>=time)]
to obtain
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
However, executing this on a Mac takes longer than 7 hours, when I had to abort. I joined on a 50m-rows-subset of a and this took about 8 minutes. I would like to avoid a loop over subsets, so I wonder whether I can make it more efficient.
For example, I suspect the assign command := can be used somehow. In data.table join then add columns to existing data.frame without re-copy it is explained how this can be done when all variables in b are kept and amended by variables from a. However, I seem to have the reverse case: I want to keep all columns in a and amend it by columns from b.
Here's an join with update by reference which I think does what you intend to do:
a[b, on=.(id=id, time>=abeg, time<=aend), `:=`(abeg = i.abeg, aend = i.aend)]
The resulting a is then:
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00

How to generate a sequence of dates and times with a specific start date/time in R

I am looking to generate or complete a column of dates and times. I have a dataframe of four numeric columns and one POSIXct time column that looks like this:
CH_1 CH_2 CH_3 CH_4 date_time
1 -10096 -11940 -9340 -9972 2018-07-24 10:45:01
2 -10088 -11964 -9348 -9960 <NA>
3 -10084 -11940 -9332 -9956 <NA>
4 -10088 -11956 -9340 -9960 <NA>
5 -10084 -11944 -9332 -9976 <NA>
6 -10076 -11940 -9340 -9948 <NA>
7 -10088 -11956 -9352 -9960 <NA>
8 -10084 -11944 -9348 -9980 <NA>
9 -10076 -11964 -9348 -9976 <NA>
0 -10076 -11956 -9348 -9964 <NA>
I would like to sequentially generate dates and times for the date_time column, increasing by 1 second until the dataframe is filled. (i.e. the next date/time should be 2018-07-24 10:45:02). This is meant to be reproducible for multiple datasets and the number of rows that need filled is not always known, but the start date/time will always be present in that first cell.
I know that the solution is likely within seq.Date (or similar), but the problem I have is that I won't always know the end date/time, which is what most examples I have found require. Any help would be appreciated!
Here's a tidyverse solution, using Zygmunt Zawadzki's example data:
library(lubridate)
library(tidyverse)
df %>% mutate(date_time = date_time[1] + seconds(row_number()-1))
Output:
date_time
1 2018-01-01 00:00:00
2 2018-01-01 00:00:01
3 2018-01-01 00:00:02
4 2018-01-01 00:00:03
5 2018-01-01 00:00:04
6 2018-01-01 00:00:05
7 2018-01-01 00:00:06
8 2018-01-01 00:00:07
9 2018-01-01 00:00:08
10 2018-01-01 00:00:09
11 2018-01-01 00:00:10
Data:
df <- data.frame(date_time = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
No need for lubridate, just,R code:
x <- data.frame(date = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
startDate <- x[["date"]][1]
x[["date2"]] <- startDate + (seq_len(nrow(x)) - 1)
x
# date date2
# 1 2018-01-01 2018-01-01 00:00:00
# 2 <NA> 2018-01-01 00:00:01
# 3 <NA> 2018-01-01 00:00:02
# 4 <NA> 2018-01-01 00:00:03
# 5 <NA> 2018-01-01 00:00:04
# 6 <NA> 2018-01-01 00:00:05
# 7 <NA> 2018-01-01 00:00:06
# 8 <NA> 2018-01-01 00:00:07
# 9 <NA> 2018-01-01 00:00:08
# 10 <NA> 2018-01-01 00:00:09
# 11 <NA> 2018-01-01 00:00:10

R: data.table aggregate using external grouping vector

I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]

Resources