aggregate per variable over which sums are calculated in R data.table - r

I have some time data
library(data.table); library(lubridate); set.seed(42)
dat <- rbind(data.table(time=as.POSIXct("2019-01-01 08:00:00") + round(runif(10,60,1e4)), val=runif(10),group=1)[order(time), id:=seq_len(.N)],
data.table(time=as.POSIXct("2019-02-01 18:00:00") + round(runif(10,60,1e4)), val=runif(10),group=2)[order(time), id:=seq_len(.N)])
> dat[order(group,id)]
time val group id
1: 2019-01-01 08:23:19 0.117487362 1 1
2: 2019-01-01 08:48:24 0.934672247 1 2
3: 2019-01-01 09:27:00 0.940014523 1 3
4: 2019-01-01 09:47:19 0.462292823 1 4
5: 2019-01-01 09:49:51 0.474997082 1 5
6: 2019-01-01 09:57:48 0.560332746 1 6
7: 2019-01-01 10:03:02 0.978226428 1 7
8: 2019-01-01 10:18:35 0.255428824 1 8
9: 2019-01-01 10:32:33 0.457741776 1 9
10: 2019-01-01 10:36:15 0.719112252 1 10
11: 2019-02-01 18:14:39 0.003948339 2 1
12: 2019-02-01 18:23:59 0.811055141 2 2
13: 2019-02-01 19:05:39 0.007334147 2 3
14: 2019-02-01 19:15:03 0.906601408 2 4
15: 2019-02-01 19:26:11 0.832916080 2 5
16: 2019-02-01 20:19:30 0.611778643 2 6
17: 2019-02-01 20:30:46 0.737595618 2 7
18: 2019-02-01 20:31:03 0.207658973 2 8
19: 2019-02-01 20:37:50 0.685169729 2 9
20: 2019-02-01 20:44:50 0.388108283 2 10
and I would like to calculate the sum of val during the following hour for each value of time. For example, for ID 1, this would be the sum of val for IDs 1 and 2 (because time for ID 3 is more than one hour after ID 1), for ID 2, the sum of val for IDs 2 to 4, and so forth. This yields the desired output (for group 1 only)
> res
time val id new1 new2
1: 2019-01-01 08:23:19 0.1174874 1 1.052160 1.052160
2: 2019-01-01 08:48:24 0.9346722 2 2.336979 2.336979
3: 2019-01-01 09:27:00 0.9400145 3 3.671292 3.671292
4: 2019-01-01 09:47:19 0.4622928 4 3.908132 3.908132
5: 2019-01-01 09:49:51 0.4749971 5 3.445839 NA
6: 2019-01-01 09:57:48 0.5603327 6 2.970842 NA
7: 2019-01-01 10:03:02 0.9782264 7 2.410509 NA
8: 2019-01-01 10:18:35 0.2554288 8 1.432283 NA
9: 2019-01-01 10:32:33 0.4577418 9 1.176854 NA
10: 2019-01-01 10:36:15 0.7191123 10 0.719112 NA
where two behaviors at the end are possible:
where the sequence is treated as is;
where sums are only calculated until there is not id for which there is an id with time at least an hour later, and all others are set NA (preferred).
I suspect that solving this requires me to subset within j but this is a problem I frequently run into and can't solve. I have not yet understood the general approach to this.

It could be a loop with join
dat1 <- dat[order(id)]
out <- rbindlist(lapply(dat1$id, function(i) {
d1 <- dat1[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time < time[2] %m+% hours(1), new2 := new1]
out
# time val new1 new2
# 1: 2019-01-01 08:23:19 0.1174874 1.0521596 1.052160
# 2: 2019-01-01 08:48:24 0.9346722 2.3369796 2.336980
# 3: 2019-01-01 09:27:00 0.9400145 3.6712924 3.671292
# 4: 2019-01-01 09:47:19 0.4622928 3.9081319 3.908132
# 5: 2019-01-01 09:49:51 0.4749971 3.4458391 NA
# 6: 2019-01-01 09:57:48 0.5603327 2.9708420 NA
# 7: 2019-01-01 10:03:02 0.9782264 2.4105093 NA
# 8: 2019-01-01 10:18:35 0.2554288 1.4322829 NA
# 9: 2019-01-01 10:32:33 0.4577418 1.1768540 NA
#10: 2019-01-01 10:36:15 0.7191123 0.7191123 NA
Update
For the new data, we can split by group and apply the same method
f1 <- function(data) {
lst1 <- split(data, data[["group"]])
rbindlist(lapply(lst1, function(.dat) {
out <- rbindlist(lapply(.dat$id, function(i) {
d1 <- .dat[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time[.N]-time > hours(1), new2 := new1][]
})
)}
f1(dat1)
# time val new1 new2
#1: 2019-01-01 08:23:19 0.117487362 1.0521596 1.0521596
#2: 2019-01-01 08:48:24 0.934672247 2.3369796 2.3369796
#3: 2019-01-01 09:27:00 0.940014523 3.6712924 3.6712924
#4: 2019-01-01 09:47:19 0.462292823 3.9081319 3.9081319
#5: 2019-01-01 09:49:51 0.474997082 3.4458391 NA
#6: 2019-01-01 09:57:48 0.560332746 2.9708420 NA
#7: 2019-01-01 10:03:02 0.978226428 2.4105093 NA
#8: 2019-01-01 10:18:35 0.255428824 1.4322829 NA
#9: 2019-01-01 10:32:33 0.457741776 1.1768540 NA
#10: 2019-01-01 10:36:15 0.719112252 0.7191123 NA
#11: 2019-02-01 18:14:39 0.003948339 0.8223376 0.8223376
#12: 2019-02-01 18:23:59 0.811055141 1.7249907 1.7249907
#13: 2019-02-01 19:05:39 0.007334147 1.7468516 1.7468516
#14: 2019-02-01 19:15:03 0.906601408 1.7395175 1.7395175
#15: 2019-02-01 19:26:11 0.832916080 1.4446947 NA
#16: 2019-02-01 20:19:30 0.611778643 2.6303112 NA
#17: 2019-02-01 20:30:46 0.737595618 2.0185326 NA
#18: 2019-02-01 20:31:03 0.207658973 1.2809370 NA
#19: 2019-02-01 20:37:50 0.685169729 1.0732780 NA
#20: 2019-02-01 20:44:50 0.388108283 0.3881083 NA

Related

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

using data.table to create sequence from starting points and increments

I would like to use data.table to repeatedly add an increment to a starting point.
library(data.table)
dat <- data.table(time=seq(from=as.POSIXct("2018-01-01 01:00:01"),to=as.POSIXct("2018-01-01 01:00:10"), by="secs"), int=c(2,3,3,1,10,10,10,10,10,10), x=2*1:10)
> dat
time inc x
1: 2018-01-01 01:00:01 2 2
2: 2018-01-01 01:00:02 3 4
3: 2018-01-01 01:00:03 3 6
4: 2018-01-01 01:00:04 1 8
5: 2018-01-01 01:00:05 10 10
6: 2018-01-01 01:00:06 10 12
7: 2018-01-01 01:00:07 10 14
8: 2018-01-01 01:00:08 10 16
9: 2018-01-01 01:00:09 10 18
10: 2018-01-01 01:00:10 10 20
That is, starting in row 1, I would like to add the value of inc to time, yielding a new time. I then need to add the value of inc at that new time, to arrive at a third time. The result would then be
> res
time inc x
1: 2018-01-01 01:00:00 2 2
2: 2018-01-01 01:00:02 3 6
3: 2018-01-01 01:00:05 10 12
I would probably know how to do this in a loop, but I wonder whether data.table can handle these sorts of problems as well.
Since the values in time are continuous, my ideas was to use the cumulative values of inc to index, along the lines of
index <- dat[...,cumsum(...inc...),...]
dat[index]
but I cannot get cumsum() to ignore the values in between the points of interest. Perhaps this can be done in the i part of data.table but I would not know how. Anyone an idea?
# start with finding the next time
dat[, next.time := time + int][!dat, on = .(next.time = time), next.time := NA]
# do this in a loop for the actual problem, and stop when final column is all NA
dat[dat, on = .(next.time = time), t1 := i.next.time]
dat[dat, on = .(t1 = time), t2 := i.next.time]
dat
# time int x next.time t1 t2
# 1: 2018-01-01 01:00:01 2 2 2018-01-01 01:00:03 2018-01-01 01:00:06 <NA>
# 2: 2018-01-01 01:00:02 3 4 2018-01-01 01:00:05 <NA> <NA>
# 3: 2018-01-01 01:00:03 3 6 2018-01-01 01:00:06 <NA> <NA>
# 4: 2018-01-01 01:00:04 1 8 2018-01-01 01:00:05 <NA> <NA>
# 5: 2018-01-01 01:00:05 10 10 <NA> <NA> <NA>
# 6: 2018-01-01 01:00:06 10 12 <NA> <NA> <NA>
# 7: 2018-01-01 01:00:07 10 14 <NA> <NA> <NA>
# 8: 2018-01-01 01:00:08 10 16 <NA> <NA> <NA>
# 9: 2018-01-01 01:00:09 10 18 <NA> <NA> <NA>
#10: 2018-01-01 01:00:10 10 20 <NA> <NA> <NA>

Melt or Replicate rows in a data table a certain number of times and include counter in R

I would like to "expand" a dataframe, duplicating the information on some columns the number of times indicated by a fifth column.
What would the most efficiency to achieve this task with R? (Open to Data Table or Dplyer, reshape solutions).
Original Dataframe/DataTable:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 2
4: 3 C 2016-01-04 <NA> NA
5: 4 D 2016-01-05 2016-01-02 5
Desired Dataframe/DataTable
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Reproducible data:
DT <- data.table(
f_1 = factor(c(1,2,2,3,4)),
f_2 = factor(c("A", "A", "B", "C", "D")),
d_1 = as.Date(c("2016-01-01","2016-01-02","2016-01-03","2016-01-04","2016-01-05")),
d_2 = as.Date(c(NA,NA,"2016-01-01",NA,"2016-01-02")),
i_1 = as.integer(c(NA,NA,2,NA,5)))
Thanks and sorry if it is duplicated. I am struggling with this kind of reshaping exercises.
Here is a data.table solution. Basically, group by those columns that you want to duplicate and generate sequence of integers using the number in i_1
DT[, .(i_1=if(!is.na(i_1)) seq_len(i_1) else i_1),
by=c(names(DT)[-ncol(DT)])]
output:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Or another way using data.table. For each row, create a sequence of numbers using i_1 and add the original data to that sequence with c(.SD[, -"i_1], ..... and finally remove the by column
DT[, c(.SD[, -"i_1"], .(i_1=if (!is.na(i_1)) seq_len(i_1) else i_1)),
by=seq_len(DT[,.N])][,-1L]
Are you OK replacing i_1 with 1 when it's NA? If so, the following would be slightly more readable:
First, repeat the rows the specified number of times (ad hoc accounting for the missing values of i_1, using replace courtesy of #Frank):
DT_out = DT[rep(1:.N, replace(i_1, is.na(i_1), 1L))]
This could be just DT[rep(1:.N, i_1)] if we've already replaced DT[is.na(i_1), i_1 := 1L].
All that's left is to update the values of i_1. There are simpler versions of this, depending on your data's particulars. Here I think is the more general version:
DT_out[!is.na(i_1), i_1 := rowidv(.SD), .SDcols = !'i_1'][]
# f_1 f_2 d_1 d_2 i_1
# 1: 1 A 2016-01-01 <NA> NA
# 2: 2 A 2016-01-02 <NA> NA
# 3: 2 B 2016-01-03 2016-01-01 1
# 4: 2 B 2016-01-03 2016-01-01 2
# 5: 3 C 2016-01-04 <NA> NA
# 6: 4 D 2016-01-05 2016-01-02 1
# 7: 4 D 2016-01-05 2016-01-02 2
# 8: 4 D 2016-01-05 2016-01-02 3
# 9: 4 D 2016-01-05 2016-01-02 4
# 10: 4 D 2016-01-05 2016-01-02 5
rowid and rowidv give the row number within the groups defined by the variables it's passed. You can compare with rowid(f_2), rowid(f_1), and rowid(f_1, f_2) to get an idea of what I mean. rowidv(.SD) is a shorthand for rowid(f_1, f_2, d_1, d_2), since we exclude i_1 from the columns in .SD.

Create Ranking based on Dates(Annual) in R

We will start with the following DataTable:
id date
1: 1 2015-12-31
2: 1 2014-12-31
3: 1 2013-12-31
4: 1 2012-12-31
5: 1 2011-12-31
6: 2 2015-12-31
7: 2 2014-12-31
8: 2 2014-01-25
9: 2 2013-01-25
10: 2 2012-01-25
library(data.table)
DT <- data.table(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
as.IDate(c("2015-12-31", "2014-12-31", "2013-12-31", "2012-12-31",
"2011-12-31", "2015-12-31", "2014-12-31", "2014-01-25",
"2013-01-25", "2012-01-25")))
setnames(DT, c("id", "date"))
For every unique id, I want to create a ranking. The most current date for a specific id should have a rank of 0. After I should remove one year from that date to get the ranking -1. If the month is not the same as the date of rank 0, we should stop the ranking. For example, at the line 8, for the id=2, since the month is not december we should stop the ranking.
We would get the following result:
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
I have the following code so far (given by #Frank and #akrun) :
DT <- DT[order(id, -date)]
DT <- DT[,rank_year := { z = month(date) + year(date)*12
as.integer( (z - z[1L])/12) # 12 months
}, by = id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 -1
9: 2 2013-01-25 -2
10: 2 2012-01-25 -3
Ok, I guess I would do it like
DT[, rank_year := replace(
year(date) - year(date)[1L],
month(date) != month(date[1L]),
NA_integer_
), by=id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
See ?replace for details on how this works.
One way of extending the old answer is
DT[, r := {
z = month(date) + year(date)*12
res = (z - z[1L])/12
as.integer( replace(res, res %% 1 != 0, NA) )
}, by=id]

Remove rows after a certain date based on a condition in R

There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]

Resources