using data.table to create sequence from starting points and increments - r

I would like to use data.table to repeatedly add an increment to a starting point.
library(data.table)
dat <- data.table(time=seq(from=as.POSIXct("2018-01-01 01:00:01"),to=as.POSIXct("2018-01-01 01:00:10"), by="secs"), int=c(2,3,3,1,10,10,10,10,10,10), x=2*1:10)
> dat
time inc x
1: 2018-01-01 01:00:01 2 2
2: 2018-01-01 01:00:02 3 4
3: 2018-01-01 01:00:03 3 6
4: 2018-01-01 01:00:04 1 8
5: 2018-01-01 01:00:05 10 10
6: 2018-01-01 01:00:06 10 12
7: 2018-01-01 01:00:07 10 14
8: 2018-01-01 01:00:08 10 16
9: 2018-01-01 01:00:09 10 18
10: 2018-01-01 01:00:10 10 20
That is, starting in row 1, I would like to add the value of inc to time, yielding a new time. I then need to add the value of inc at that new time, to arrive at a third time. The result would then be
> res
time inc x
1: 2018-01-01 01:00:00 2 2
2: 2018-01-01 01:00:02 3 6
3: 2018-01-01 01:00:05 10 12
I would probably know how to do this in a loop, but I wonder whether data.table can handle these sorts of problems as well.
Since the values in time are continuous, my ideas was to use the cumulative values of inc to index, along the lines of
index <- dat[...,cumsum(...inc...),...]
dat[index]
but I cannot get cumsum() to ignore the values in between the points of interest. Perhaps this can be done in the i part of data.table but I would not know how. Anyone an idea?

# start with finding the next time
dat[, next.time := time + int][!dat, on = .(next.time = time), next.time := NA]
# do this in a loop for the actual problem, and stop when final column is all NA
dat[dat, on = .(next.time = time), t1 := i.next.time]
dat[dat, on = .(t1 = time), t2 := i.next.time]
dat
# time int x next.time t1 t2
# 1: 2018-01-01 01:00:01 2 2 2018-01-01 01:00:03 2018-01-01 01:00:06 <NA>
# 2: 2018-01-01 01:00:02 3 4 2018-01-01 01:00:05 <NA> <NA>
# 3: 2018-01-01 01:00:03 3 6 2018-01-01 01:00:06 <NA> <NA>
# 4: 2018-01-01 01:00:04 1 8 2018-01-01 01:00:05 <NA> <NA>
# 5: 2018-01-01 01:00:05 10 10 <NA> <NA> <NA>
# 6: 2018-01-01 01:00:06 10 12 <NA> <NA> <NA>
# 7: 2018-01-01 01:00:07 10 14 <NA> <NA> <NA>
# 8: 2018-01-01 01:00:08 10 16 <NA> <NA> <NA>
# 9: 2018-01-01 01:00:09 10 18 <NA> <NA> <NA>
#10: 2018-01-01 01:00:10 10 20 <NA> <NA> <NA>

Related

aggregate per variable over which sums are calculated in R data.table

I have some time data
library(data.table); library(lubridate); set.seed(42)
dat <- rbind(data.table(time=as.POSIXct("2019-01-01 08:00:00") + round(runif(10,60,1e4)), val=runif(10),group=1)[order(time), id:=seq_len(.N)],
data.table(time=as.POSIXct("2019-02-01 18:00:00") + round(runif(10,60,1e4)), val=runif(10),group=2)[order(time), id:=seq_len(.N)])
> dat[order(group,id)]
time val group id
1: 2019-01-01 08:23:19 0.117487362 1 1
2: 2019-01-01 08:48:24 0.934672247 1 2
3: 2019-01-01 09:27:00 0.940014523 1 3
4: 2019-01-01 09:47:19 0.462292823 1 4
5: 2019-01-01 09:49:51 0.474997082 1 5
6: 2019-01-01 09:57:48 0.560332746 1 6
7: 2019-01-01 10:03:02 0.978226428 1 7
8: 2019-01-01 10:18:35 0.255428824 1 8
9: 2019-01-01 10:32:33 0.457741776 1 9
10: 2019-01-01 10:36:15 0.719112252 1 10
11: 2019-02-01 18:14:39 0.003948339 2 1
12: 2019-02-01 18:23:59 0.811055141 2 2
13: 2019-02-01 19:05:39 0.007334147 2 3
14: 2019-02-01 19:15:03 0.906601408 2 4
15: 2019-02-01 19:26:11 0.832916080 2 5
16: 2019-02-01 20:19:30 0.611778643 2 6
17: 2019-02-01 20:30:46 0.737595618 2 7
18: 2019-02-01 20:31:03 0.207658973 2 8
19: 2019-02-01 20:37:50 0.685169729 2 9
20: 2019-02-01 20:44:50 0.388108283 2 10
and I would like to calculate the sum of val during the following hour for each value of time. For example, for ID 1, this would be the sum of val for IDs 1 and 2 (because time for ID 3 is more than one hour after ID 1), for ID 2, the sum of val for IDs 2 to 4, and so forth. This yields the desired output (for group 1 only)
> res
time val id new1 new2
1: 2019-01-01 08:23:19 0.1174874 1 1.052160 1.052160
2: 2019-01-01 08:48:24 0.9346722 2 2.336979 2.336979
3: 2019-01-01 09:27:00 0.9400145 3 3.671292 3.671292
4: 2019-01-01 09:47:19 0.4622928 4 3.908132 3.908132
5: 2019-01-01 09:49:51 0.4749971 5 3.445839 NA
6: 2019-01-01 09:57:48 0.5603327 6 2.970842 NA
7: 2019-01-01 10:03:02 0.9782264 7 2.410509 NA
8: 2019-01-01 10:18:35 0.2554288 8 1.432283 NA
9: 2019-01-01 10:32:33 0.4577418 9 1.176854 NA
10: 2019-01-01 10:36:15 0.7191123 10 0.719112 NA
where two behaviors at the end are possible:
where the sequence is treated as is;
where sums are only calculated until there is not id for which there is an id with time at least an hour later, and all others are set NA (preferred).
I suspect that solving this requires me to subset within j but this is a problem I frequently run into and can't solve. I have not yet understood the general approach to this.
It could be a loop with join
dat1 <- dat[order(id)]
out <- rbindlist(lapply(dat1$id, function(i) {
d1 <- dat1[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time < time[2] %m+% hours(1), new2 := new1]
out
# time val new1 new2
# 1: 2019-01-01 08:23:19 0.1174874 1.0521596 1.052160
# 2: 2019-01-01 08:48:24 0.9346722 2.3369796 2.336980
# 3: 2019-01-01 09:27:00 0.9400145 3.6712924 3.671292
# 4: 2019-01-01 09:47:19 0.4622928 3.9081319 3.908132
# 5: 2019-01-01 09:49:51 0.4749971 3.4458391 NA
# 6: 2019-01-01 09:57:48 0.5603327 2.9708420 NA
# 7: 2019-01-01 10:03:02 0.9782264 2.4105093 NA
# 8: 2019-01-01 10:18:35 0.2554288 1.4322829 NA
# 9: 2019-01-01 10:32:33 0.4577418 1.1768540 NA
#10: 2019-01-01 10:36:15 0.7191123 0.7191123 NA
Update
For the new data, we can split by group and apply the same method
f1 <- function(data) {
lst1 <- split(data, data[["group"]])
rbindlist(lapply(lst1, function(.dat) {
out <- rbindlist(lapply(.dat$id, function(i) {
d1 <- .dat[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time[.N]-time > hours(1), new2 := new1][]
})
)}
f1(dat1)
# time val new1 new2
#1: 2019-01-01 08:23:19 0.117487362 1.0521596 1.0521596
#2: 2019-01-01 08:48:24 0.934672247 2.3369796 2.3369796
#3: 2019-01-01 09:27:00 0.940014523 3.6712924 3.6712924
#4: 2019-01-01 09:47:19 0.462292823 3.9081319 3.9081319
#5: 2019-01-01 09:49:51 0.474997082 3.4458391 NA
#6: 2019-01-01 09:57:48 0.560332746 2.9708420 NA
#7: 2019-01-01 10:03:02 0.978226428 2.4105093 NA
#8: 2019-01-01 10:18:35 0.255428824 1.4322829 NA
#9: 2019-01-01 10:32:33 0.457741776 1.1768540 NA
#10: 2019-01-01 10:36:15 0.719112252 0.7191123 NA
#11: 2019-02-01 18:14:39 0.003948339 0.8223376 0.8223376
#12: 2019-02-01 18:23:59 0.811055141 1.7249907 1.7249907
#13: 2019-02-01 19:05:39 0.007334147 1.7468516 1.7468516
#14: 2019-02-01 19:15:03 0.906601408 1.7395175 1.7395175
#15: 2019-02-01 19:26:11 0.832916080 1.4446947 NA
#16: 2019-02-01 20:19:30 0.611778643 2.6303112 NA
#17: 2019-02-01 20:30:46 0.737595618 2.0185326 NA
#18: 2019-02-01 20:31:03 0.207658973 1.2809370 NA
#19: 2019-02-01 20:37:50 0.685169729 1.0732780 NA
#20: 2019-02-01 20:44:50 0.388108283 0.3881083 NA

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

How to generate a sequence of dates and times with a specific start date/time in R

I am looking to generate or complete a column of dates and times. I have a dataframe of four numeric columns and one POSIXct time column that looks like this:
CH_1 CH_2 CH_3 CH_4 date_time
1 -10096 -11940 -9340 -9972 2018-07-24 10:45:01
2 -10088 -11964 -9348 -9960 <NA>
3 -10084 -11940 -9332 -9956 <NA>
4 -10088 -11956 -9340 -9960 <NA>
5 -10084 -11944 -9332 -9976 <NA>
6 -10076 -11940 -9340 -9948 <NA>
7 -10088 -11956 -9352 -9960 <NA>
8 -10084 -11944 -9348 -9980 <NA>
9 -10076 -11964 -9348 -9976 <NA>
0 -10076 -11956 -9348 -9964 <NA>
I would like to sequentially generate dates and times for the date_time column, increasing by 1 second until the dataframe is filled. (i.e. the next date/time should be 2018-07-24 10:45:02). This is meant to be reproducible for multiple datasets and the number of rows that need filled is not always known, but the start date/time will always be present in that first cell.
I know that the solution is likely within seq.Date (or similar), but the problem I have is that I won't always know the end date/time, which is what most examples I have found require. Any help would be appreciated!
Here's a tidyverse solution, using Zygmunt Zawadzki's example data:
library(lubridate)
library(tidyverse)
df %>% mutate(date_time = date_time[1] + seconds(row_number()-1))
Output:
date_time
1 2018-01-01 00:00:00
2 2018-01-01 00:00:01
3 2018-01-01 00:00:02
4 2018-01-01 00:00:03
5 2018-01-01 00:00:04
6 2018-01-01 00:00:05
7 2018-01-01 00:00:06
8 2018-01-01 00:00:07
9 2018-01-01 00:00:08
10 2018-01-01 00:00:09
11 2018-01-01 00:00:10
Data:
df <- data.frame(date_time = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
No need for lubridate, just,R code:
x <- data.frame(date = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
startDate <- x[["date"]][1]
x[["date2"]] <- startDate + (seq_len(nrow(x)) - 1)
x
# date date2
# 1 2018-01-01 2018-01-01 00:00:00
# 2 <NA> 2018-01-01 00:00:01
# 3 <NA> 2018-01-01 00:00:02
# 4 <NA> 2018-01-01 00:00:03
# 5 <NA> 2018-01-01 00:00:04
# 6 <NA> 2018-01-01 00:00:05
# 7 <NA> 2018-01-01 00:00:06
# 8 <NA> 2018-01-01 00:00:07
# 9 <NA> 2018-01-01 00:00:08
# 10 <NA> 2018-01-01 00:00:09
# 11 <NA> 2018-01-01 00:00:10

R: fast counting of rows that match vector of conditional

I have data
dt <- data.table(beg=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:08:00")), end=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:10:00")))
> dt
beg end
1: 2018-01-01 01:01:00 2018-01-01 01:10:00
2: 2018-01-01 01:05:00 2018-01-01 01:10:00
3: 2018-01-01 01:08:00 2018-01-01 01:10:00
and
times <- seq(from=min(dt$beg),to=max(dt$end),by="mins")
and I would like to count, as efficiently as possible, for each time in times how many intervals in df include the time.
I understand that
count <- NA
for(i in 1:length(times)){
count[i] <- sum(dt$beg<times[i] & dt$end>times[i])
}
would yield the solution
> data.table(times, count)
time count
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
but I am wondering whether there is a more time-efficient solution, e.g., using data.table.
This can be a solution
times = as.data.table(times)
ans = dt[times, .(x.beg, x.end, i.x),on = .(beg < x , end > x),allow.cartesian = TRUE]
ans[,sum(!is.na(x.end)), by = .(i.x)]
i.x V1
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
Cheers!

Remove rows after a certain date based on a condition in R

There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]

Resources