data.table in R: creating variables from x into i - r

I have two data tables,
a <- data.table(id=c(1,2,1,2,1,2), time=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:11:00","2018-01-01 01:11:00","2018-01-01 01:12:00","2018-01-01 01:12:00")), beg=as.POSIXct(c("2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:01:00","2018-01-01 01:05:00")), end=as.POSIXct(c("2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00")))
> a
id time beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00
which has 650m lines by 4 columns, and
b <- data.table(id=c(1,2), abeg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:11:00")), aend=as.POSIXct(c("2018-01-01 01:11:00","2018-01-01 01:12:00")))
> b
id abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:11:00 2018-01-01 01:12:00
which has about 13m lines by 7 columns.
I would like to join b into a but keep all lines and columns of a. I understand that this is a left-join and would execute it as
b[a, .(id=i.id, time=i.time, beg=i.beg, end=i.end, abeg=x.abeg, aend=x.aend), on=.(id=id, abeg<=time, aend>=time)]
to obtain
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
However, executing this on a Mac takes longer than 7 hours, when I had to abort. I joined on a 50m-rows-subset of a and this took about 8 minutes. I would like to avoid a loop over subsets, so I wonder whether I can make it more efficient.
For example, I suspect the assign command := can be used somehow. In data.table join then add columns to existing data.frame without re-copy it is explained how this can be done when all variables in b are kept and amended by variables from a. However, I seem to have the reverse case: I want to keep all columns in a and amend it by columns from b.

Here's an join with update by reference which I think does what you intend to do:
a[b, on=.(id=id, time>=abeg, time<=aend), `:=`(abeg = i.abeg, aend = i.aend)]
The resulting a is then:
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00

Related

calculations in data.table syntax [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I would like to execute a self-join in data.table, to obtain the periods between time intervals.
Example data
active <- data.table(id=c(1,1,1,2,2,3), no=c(1,2,3,1,2,1), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id no beg end
1: 1 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2 2018-01-01 01:30:00 2018-01-01 01:40:00
3: 1 3 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 2 1 2018-01-01 01:30:00 2018-01-01 01:40:00
5: 2 2 2018-01-01 01:50:00 2018-01-01 02:00:00
6: 3 1 2018-01-01 01:50:00 2018-01-01 02:00:00
What I want to reach is to get the inactive periods in between the active ones,
> res
id no ibeg iend
1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00
but my question is more general about the calculations in the syntax: When executing
res <- active[active, .(id=x.id, ibeg=i.end, iend=x.beg), on=.(no=(no-1), id=id)]
I match on on=.(no=no-1) but obtain an error message that column [no-1] cannot be found. I tried parentheses around no-1 but to no avail. Are calculations banned from the on=argument or is there a trick?
You can use
inactive = active[, .(no=no[-.N], ibeg=end[-.N], iend=beg[-1]), by=id]
# id no ibeg iend
# 1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
# 2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
# 3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00

How to generate a sequence of dates and times with a specific start date/time in R

I am looking to generate or complete a column of dates and times. I have a dataframe of four numeric columns and one POSIXct time column that looks like this:
CH_1 CH_2 CH_3 CH_4 date_time
1 -10096 -11940 -9340 -9972 2018-07-24 10:45:01
2 -10088 -11964 -9348 -9960 <NA>
3 -10084 -11940 -9332 -9956 <NA>
4 -10088 -11956 -9340 -9960 <NA>
5 -10084 -11944 -9332 -9976 <NA>
6 -10076 -11940 -9340 -9948 <NA>
7 -10088 -11956 -9352 -9960 <NA>
8 -10084 -11944 -9348 -9980 <NA>
9 -10076 -11964 -9348 -9976 <NA>
0 -10076 -11956 -9348 -9964 <NA>
I would like to sequentially generate dates and times for the date_time column, increasing by 1 second until the dataframe is filled. (i.e. the next date/time should be 2018-07-24 10:45:02). This is meant to be reproducible for multiple datasets and the number of rows that need filled is not always known, but the start date/time will always be present in that first cell.
I know that the solution is likely within seq.Date (or similar), but the problem I have is that I won't always know the end date/time, which is what most examples I have found require. Any help would be appreciated!
Here's a tidyverse solution, using Zygmunt Zawadzki's example data:
library(lubridate)
library(tidyverse)
df %>% mutate(date_time = date_time[1] + seconds(row_number()-1))
Output:
date_time
1 2018-01-01 00:00:00
2 2018-01-01 00:00:01
3 2018-01-01 00:00:02
4 2018-01-01 00:00:03
5 2018-01-01 00:00:04
6 2018-01-01 00:00:05
7 2018-01-01 00:00:06
8 2018-01-01 00:00:07
9 2018-01-01 00:00:08
10 2018-01-01 00:00:09
11 2018-01-01 00:00:10
Data:
df <- data.frame(date_time = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
No need for lubridate, just,R code:
x <- data.frame(date = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
startDate <- x[["date"]][1]
x[["date2"]] <- startDate + (seq_len(nrow(x)) - 1)
x
# date date2
# 1 2018-01-01 2018-01-01 00:00:00
# 2 <NA> 2018-01-01 00:00:01
# 3 <NA> 2018-01-01 00:00:02
# 4 <NA> 2018-01-01 00:00:03
# 5 <NA> 2018-01-01 00:00:04
# 6 <NA> 2018-01-01 00:00:05
# 7 <NA> 2018-01-01 00:00:06
# 8 <NA> 2018-01-01 00:00:07
# 9 <NA> 2018-01-01 00:00:08
# 10 <NA> 2018-01-01 00:00:09
# 11 <NA> 2018-01-01 00:00:10

Expand all time steps between two dates

I have a sequence of events with start and end dates:
library(lubridate)
df<-tibble(StartDate=ymd_hm(c("2018-01-01 00:10","2018-01-02 00:20","2018-01-05 08:20"),tz="EET"),
EndDate=ymd_hm(c("2018-01-01 00:10","2018-01-02 01:30","2018-01-05 08:30"),tz="EET"),
Event=c("Event1","Event2","Event3"))
For each event I would like to have all 10 min occurrences. I can do this with loops and lists:
DateTime=list()
Event=list()
for (i in 1:nrow(df)){
DateTime[[i]]<-seq(df$StartDate[i],df$EndDate[i],by="10 min")
Event[[i]]<-rep(df$Event[i],times=length(DateTime[[i]]))
}
result<-tibble(DateTime=do.call("c",DateTime),Event=do.call("c",Event))
Desired output:
> result
# A tibble: 11 x 2
DateTime Event
<dttm> <chr>
1 2018-01-01 00:10:00 Event1
2 2018-01-02 00:20:00 Event2
3 2018-01-02 00:30:00 Event2
4 2018-01-02 00:40:00 Event2
5 2018-01-02 00:50:00 Event2
6 2018-01-02 01:00:00 Event2
7 2018-01-02 01:10:00 Event2
8 2018-01-02 01:20:00 Event2
9 2018-01-02 01:30:00 Event2
10 2018-01-05 08:20:00 Event3
11 2018-01-05 08:30:00 Event3
But I am looking for a more delicate way, perhaps using tidyverse functions.
Please note that you might need to change "EET" with your system time zone in order for the example to be fully reproducible.
Thanks
An option would be to use map2 for getting the sequence between corresponding elements of 'StartDate' and 'EndDate', and then do unnest
library(tidyverse)
df %>%
transmute(DateTime = map2(StartDate, EndDate, seq, by = "10 min"),
Event) %>%
unnest %>%
select(DateTime, Event)
# A tibble: 11 x 2
# DateTime Event
# <dttm> <chr>
# 1 2018-01-01 00:10:00 Event1
# 2 2018-01-02 00:20:00 Event2
# 3 2018-01-02 00:30:00 Event2
# 4 2018-01-02 00:40:00 Event2
# 5 2018-01-02 00:50:00 Event2
# 6 2018-01-02 01:00:00 Event2
# 7 2018-01-02 01:10:00 Event2
# 8 2018-01-02 01:20:00 Event2
# 9 2018-01-02 01:30:00 Event2
#10 2018-01-05 08:20:00 Event3
#11 2018-01-05 08:30:00 Event3

R: data.table aggregate using external grouping vector

I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]

R: fast counting of rows that match vector of conditional

I have data
dt <- data.table(beg=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:08:00")), end=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:10:00")))
> dt
beg end
1: 2018-01-01 01:01:00 2018-01-01 01:10:00
2: 2018-01-01 01:05:00 2018-01-01 01:10:00
3: 2018-01-01 01:08:00 2018-01-01 01:10:00
and
times <- seq(from=min(dt$beg),to=max(dt$end),by="mins")
and I would like to count, as efficiently as possible, for each time in times how many intervals in df include the time.
I understand that
count <- NA
for(i in 1:length(times)){
count[i] <- sum(dt$beg<times[i] & dt$end>times[i])
}
would yield the solution
> data.table(times, count)
time count
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
but I am wondering whether there is a more time-efficient solution, e.g., using data.table.
This can be a solution
times = as.data.table(times)
ans = dt[times, .(x.beg, x.end, i.x),on = .(beg < x , end > x),allow.cartesian = TRUE]
ans[,sum(!is.na(x.end)), by = .(i.x)]
i.x V1
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
Cheers!

Resources