Date difference between end date to start date - r

I have a data alooks like below.
id from data to date
1 2015-03-09 2015-03-14
2 2015-02-22 2015-02-24
2 2015-05-06 2015-05-17
3 2015-02-12 2015-02-16
4 2015-03-10 2015-03-16
4 2015-03-22 2015-04-07
4 2015-06-07 2015-07-07
4 2015-07-06 2015-07-07
4 2015-08-02 2015-08-07
I want to create a seperate variable which is the difference between the to date and the next from date grouped by id.
So the first time of the id will be NA.I tried the below method based on the other answer in stackoverflow and I could not
achieve that.
library(data.table)
chf1 = data.table(id = chf$id,from date = chf$f.date,to_date = chf$t.date)
setkey(chf1,id)
chf1[,diff:=c(NA,difftime(from_date, to_date, units = "days")),by=id]
The output look like
id from_date to_date difference
1 2015-03-09 2015-03-14 NA
2 2015-02-22 2015-02-24 NA
2 2015-05-06 2015-05-17 71
3 2015-02-12 2015-02-16 NA
4 2015-03-10 2015-03-16 NA
4 2015-03-22 2015-04-07 6
4 2015-06-07 2015-06-10 64
4 2015-07-06 2015-07-07 26
4 2015-08-02 2015-08-07 26

There are three issues in the code
1) chf1$from_date, chf1$to_date gets the whole column, so there is no effect of grouping by 'id'
2) difftime gives output with the same length as the initial column length.
3) As difftime takes the difference between each element of 'from_date' with corresponding element of 'to_date', there is no need for by = id
Therefore, the code can be
chf1[, diff1:=difftime(from_date, to_date, units = "days")]
chf1
# id from_date to_date diff1
#1: 1 2015-03-09 2015-03-14 -5 days
##2: 2 2015-02-22 2015-02-24 -2 days
#3: 2 2015-05-06 2015-05-17 -11 days
#4: 3 2015-02-12 2015-02-16 -4 days
#5: 4 2015-03-10 2015-03-16 -6 days
#6: 4 2015-03-22 2015-04-07 -16 days
#7: 4 2015-06-07 2015-07-07 -30 days
#8: 4 2015-07-06 2015-07-07 -1 days
#9: 4 2015-08-02 2015-08-07 -5 days
Based on the description in OP's code, if we need to get the difference between the next value of 'from_date', after grouping by 'id', use the difftime on the shifted 'from_date' with that of 'to_date' and assign (:= it to 'diff1'.
chf1[, diff1 := difftime(shift(from_date, type = "lead"), to_date,
units = "days") , by = id]
chf1
# id from_date to_date diff1
#1: 1 2015-03-09 2015-03-14 NA days
#2: 2 2015-02-22 2015-02-24 71 days
#3: 2 2015-05-06 2015-05-17 NA days
#4: 3 2015-02-12 2015-02-16 NA days
#5: 4 2015-03-10 2015-03-16 6 days
#6: 4 2015-03-22 2015-04-07 61 days
#7: 4 2015-06-07 2015-07-07 -1 days
#8: 4 2015-07-06 2015-07-07 26 days
#9: 4 2015-08-02 2015-08-07 NA days
Or it could be
chf1[, diff1 := difftime(from_date, shift(to_date), units = "days"), by = id]
data
chf <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L),
f.date = structure(c(16503,
16488, 16561, 16478, 16504, 16516, 16593, 16622, 16649), class = "Date"),
t.date = structure(c(16508, 16490, 16572, 16482, 16510, 16532,
16623, 16623, 16654), class = "Date")), .Names = c("id",
"f.date", "t.date"), row.names = c(NA, -9L), class = "data.frame")
chf1 = data.table(id = chf$id,from_date = chf$f.date,to_date = chf$t.date)

Related

Splitting a dateTime vector if time is greater than x between vector components

I have the following data:
df <- data.frame(index = 1:85,
times = c(seq(as.POSIXct("2020-10-03 21:31:00 UTC"),
as.POSIXct("2020-10-03 22:25:00 UTC")
"min"),
seq(as.POSIXct("2020-11-03 10:10:00 UTC"),
as.POSIXct("2020-11-03 10:39:00 UTC"),
"min")
))
if we look at row 55 and 56 there is a clear divide in times:
> df[55:56, ]
index times
55 55 2020-10-03 22:25:00
56 56 2020-11-03 10:10:00
I would like to add a third categorical column split based on the splits,
e.g. row df$split[55, ] = A and row df$split[56, ] = B
logic like
If time gap between rows is greater than 5 mins start new category for subsequent rows until the next instance where time gap > 5 mins.
thanks
You could use
library(dplyr)
df %>%
mutate(cat = 1 + cumsum(c(0, diff(times)) > 5))
which returns
index times cat
1 1 2020-10-03 21:31:00 1
2 2 2020-10-03 21:32:00 1
3 3 2020-10-03 21:33:00 1
4 4 2020-10-03 21:34:00 1
5 5 2020-10-03 21:35:00 1
6 6 2020-10-03 21:36:00 1
7 7 2020-10-03 21:37:00 1
8 8 2020-10-03 21:38:00 1
...
53 53 2020-10-03 22:23:00 1
54 54 2020-10-03 22:24:00 1
55 55 2020-10-03 22:25:00 1
56 56 2020-11-03 10:10:00 2
57 57 2020-11-03 10:11:00 2
58 58 2020-11-03 10:12:00 2
59 59 2020-11-03 10:13:00 2
If you need letters or something else, you could for example use
df %>%
mutate(cat = LETTERS[1 + cumsum(c(0, diff(times)) > 5)])
to convert the categories 1 and 2 into A and B.

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

Create new variable based on function of other variables

How can I pass column entires as arguments to a function, then creating a new column which is a function of the other two? For example, taking this excellent function to add months to a date, and taking this example data frame:
df <- structure(
list(
date = structure(
c(
17135,
17105,
17105,
17074,
17286,
17317,
17317,
17347,
17105,
17317
),
class = "Date"
),
monthslater = c(10,
11, 13, 14, 3, 3, 3, 3, 4, NA)
),
.Names = c("date", "monthslater"),
row.names = c(NA, 10L),
class = "data.frame"
)
I would like to create a new column where I pass the entries from columns date and monthslater to the function add.months I would have thought that something like this would work:
df$newdate <- add.months(df$date, df$monthslater)
But it doesn't.
The full code for the function is:
add.months <- function(date,n) seq(date, by = paste(n, "months"), length = 2)[2]
Using %m+% from the lubridate-package:
library(lubridate)
df$newdate <- df$date %m+% months(df$monthslater)
gives:
> df
date monthslater newdate
1 2016-11-30 10 2017-09-30
2 2016-10-31 11 2017-09-30
3 2016-10-31 13 2017-11-30
4 2016-09-30 14 2017-11-30
5 2017-04-30 3 2017-07-30
6 2017-05-31 3 2017-08-31
7 2017-05-31 3 2017-08-31
8 2017-06-30 3 2017-09-30
9 2016-10-31 4 2017-02-28
10 2017-05-31 4 2017-09-30
In a similar way you can also add days or years:
df$newdate2 <- df$date %m+% days(df$monthslater)
df$newdate3 <- df$date %m+% years(df$monthslater)
which gives:
> df
date monthslater newdate newdate2 newdate3
1 2016-11-30 10 2017-09-30 2016-12-10 2026-11-30
2 2016-10-31 11 2017-09-30 2016-11-11 2027-10-31
3 2016-10-31 13 2017-11-30 2016-11-13 2029-10-31
4 2016-09-30 14 2017-11-30 2016-10-14 2030-09-30
5 2017-04-30 3 2017-07-30 2017-05-03 2020-04-30
6 2017-05-31 3 2017-08-31 2017-06-03 2020-05-31
7 2017-05-31 3 2017-08-31 2017-06-03 2020-05-31
8 2017-06-30 3 2017-09-30 2017-07-03 2020-06-30
9 2016-10-31 4 2017-02-28 2016-11-04 2020-10-31
10 2017-05-31 4 2017-09-30 2017-06-04 2021-05-31
For your immediate, specific issue, consider mapply to pass those two vectors element-wise into defined function. And since monthslater includes NA, add a tryCatch to defined function.
add.months <- function(date, n) {
tryCatch(seq(date, by = paste(n, "months"), length = 2)[2],
warning = function(w) return(NA),
error = function(e) return(NA))
}
df$newdate <- as.Date(mapply(add.months, df$date, df$monthslater), origin="1970-01-01")
df
# date monthslater newdate
# 1 2016-11-30 10 2017-09-30
# 2 2016-10-31 11 2017-10-01
# 3 2016-10-31 13 2017-12-01
# 4 2016-09-30 14 2017-11-30
# 5 2017-04-30 3 2017-07-30
# 6 2017-05-31 3 2017-08-31
# 7 2017-05-31 3 2017-08-31
# 8 2017-06-30 3 2017-09-30
# 9 2016-10-31 4 2017-03-03
# 10 2017-05-31 NA <NA>
Also, do note the author's item involving end of February and hence #9 is extended 3 days ahead.
Or with base R:
df$newdate <- mapply(add.months, df[[1]], df[[2]], SIMPLIFY = FALSE)
> df
date monthslater newdate
1 2016-11-30 10 2017-09-30
2 2016-10-31 11 2017-10-01
3 2016-10-31 13 2017-12-01
4 2016-09-30 14 2017-11-30
5 2017-04-30 3 2017-07-30
6 2017-05-31 3 2017-08-31
7 2017-05-31 3 2017-08-31
8 2017-06-30 3 2017-09-30
9 2016-10-31 4 2017-03-03
10 2017-05-31 4 2017-10-01

Conditional (inequality) join in data.table

I'm just trying to figure out how to do a conditional join on two data.tables.
I've written a sqldf conditional join to give me the circuits whose start or finish times are within the other's start/finish times.
sqldf("select dt2.start, dt2.finish, dt2.counts, dt1.id, dt1.circuit
from dt2
left join dt1 on (
(dt2.start >= dt1.start and dt2.start < dt1.finish) or
(dt2.finish >= dt1.start and dt2.finish < dt1.finish)
)")
This gives me the correct result, but it's too slow for my large-ish data set.
What's the data.table way to do this without a vector scan?
Here's my data:
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
Using non-equi joins:
ans = dt1[dt2, on=.(start <= end, end > start),
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]
The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:
dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end
and
dt1: start=================end
dt2: start----------end ## end == end
The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.
PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.
Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.
No idea if this performs faster, but here's a shot at a data table method. I reshape dt1 and use findInterval to identify where the times in dt2 line up with times in dt1.
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
# > dt1
# circuit start end id
# 1: b 2014-02-28 16:00:00 2014-02-28 17:30:00 1001
# 2: a 2014-02-28 17:52:00 2014-02-28 18:51:59 1002
# 3: b 2014-02-28 18:00:00 2014-02-28 21:00:00 1003
# 4: a 2014-02-28 18:52:00 2014-02-28 19:00:00 1004
# 5: b 2014-03-01 00:05:00 2014-03-01 02:55:00 1005
# 6: c 2014-02-28 23:00:00 2014-03-01 06:30:00 1006
# 7: a 2014-03-01 01:40:00 2014-03-01 04:59:59 1007
# 8: a 2014-03-01 05:00:00 2014-03-01 06:00:00 1008
# 9: b 2014-03-01 04:30:00 2014-03-01 07:30:00 1009
# > dt2
# start end seconds counts
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1
## reshapes dt1 from wide to long
## puts start and end times into one column and sorts by time
## this is so that you can use findInterval later
dt3 <- dt1[,list(time = c(start,end)), by = "circuit,id"][order(time)]
dt3[,ntvl := seq_len(nrow(dt3))]
# circuit id time ntvl
# 1: b 1001 2014-02-28 16:00:00 1
# 2: b 1001 2014-02-28 17:30:00 2
# 3: a 1002 2014-02-28 17:52:00 3
# 4: b 1003 2014-02-28 18:00:00 4
# 5: a 1002 2014-02-28 18:51:59 5
# 6: a 1004 2014-02-28 18:52:00 6
# 7: a 1004 2014-02-28 19:00:00 7
# 8: b 1003 2014-02-28 21:00:00 8
# 9: c 1006 2014-02-28 23:00:00 9
# 10: b 1005 2014-03-01 00:05:00 10
# 11: a 1007 2014-03-01 01:40:00 11
# 12: b 1005 2014-03-01 02:55:00 12
# 13: b 1009 2014-03-01 04:30:00 13
# 14: a 1007 2014-03-01 04:59:59 14
# 15: a 1008 2014-03-01 05:00:00 15
# 16: a 1008 2014-03-01 06:00:00 16
# 17: c 1006 2014-03-01 06:30:00 17
# 18: b 1009 2014-03-01 07:30:00 18
## map interval to id
dt4 <- dt3[,list(ntvl = seq(from = min(ntvl), to = max(ntvl)-1), by = 1),by = "circuit,id"]
setkey(dt4, ntvl)
# circuit id ntvl
# 1: b 1001 1
# 2: a 1002 3
# 3: a 1002 4
# 4: b 1003 4
# 5: b 1003 5
# 6: b 1003 6
# 7: a 1004 6
# 8: b 1003 7
# 9: c 1006 9
# 10: c 1006 10
# 11: b 1005 10
# 12: c 1006 11
# 13: b 1005 11
# 14: a 1007 11
# 15: c 1006 12
# 16: a 1007 12
# 17: c 1006 13
# 18: a 1007 13
# 19: b 1009 13
# 20: c 1006 14
# 21: b 1009 14
# 22: c 1006 15
# 23: b 1009 15
# 24: a 1008 15
# 25: c 1006 16
# 26: b 1009 16
# 27: b 1009 17
# circuit id ntvl
## finds intervals in dt2
dt2[,`:=`(ntvl_start = findInterval(start, dt3[["time"]], rightmost.closed = FALSE),
ntvl_end = findInterval(end, dt3[["time"]], rightmost.closed = FALSE))]
# start end seconds counts ntvl_start ntvl_end
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1 1 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1 1 2
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0 2 2
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1 2 3
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2 3 7
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1 7 8
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0 8 8
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1 8 9
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2 9 10
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3 10 12
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2 12 12
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3 12 16
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2 16 17
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1 17 18
## joins, by start time, then by end time
## the commented out lines may be a better alternative
## if there are many NA values
setkey(dt2, ntvl_start)
dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_start_na <- dt2[!dt4]
setkey(dt2, ntvl_end)
dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_end_na <- dt2[!dt4]
## bring them all together and remove duplicates
dt_ans <- unique(rbind(dt_ans_start, dt_ans_end), by = c("start", "id"))
dt_ans <- dt_ans[!(is.na(id) & counts > 0)]
dt_ans[,ntvl := NULL]
setkey(dt_ans,start)
# start end counts id circuit
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 1 1001 b
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1 1001 b
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 0 NA NA
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 1 1002 a
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1002 a
# 6: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1003 b
# 7: 2014-02-28 19:00:00 2014-02-28 21:00:00 1 1003 b
# 8: 2014-02-28 21:00:00 2014-02-28 22:59:59 0 NA NA
# 9: 2014-02-28 22:59:59 2014-03-01 00:04:59 1 1006 c
# 10: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1006 c
# 11: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1005 b
# 12: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1006 c
# 13: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1005 b
# 14: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1007 a
# 15: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1006 c
# 16: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1007 a
# 17: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1006 c
# 18: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1007 a
# 19: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1009 b
# 20: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1006 c
# 21: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1009 b
# 22: 2014-03-01 06:30:00 2014-03-01 07:30:00 1 1009 b
# start end counts id circuit

calculating differences in times, data grouped by rows

I have a data set in the following format
ID DATETIME VALUE
1 4/2/2012 10:00 300
1 5/2/2012 23:00 150
1 6/3/2012 10:00 650
2 1/2/2012 10:00 450
2 2/2/2012 13:00 240
3 6/5/2012 09:00 340
3 7/5/2012 23:00 240
I would like to first calculate the time difference from first instance per ID to each subsequent time.
ID DATETIME VALUE DIFTIME(days)
1 4/2/2012 10:00 300 0
1 5/2/2012 23:00 150 1.3
1 6/3/2012 10:00 650 33
2 1/2/2012 10:00 450 0
2 2/2/2012 13:00 240 1
3 6/5/2012 09:00 340 0
3 7/5/2012 23:00 240 1
And then I'd like to make this a wide format
ID 0 1 1.3 33
1 300 na 150 na 650
2 450 240 na na
3 340 240 na na
Here a solution using data.table and reshape2 packages:
library(data.table)
DT <- as.data.table(dat)
DT[, `:=`(DIFTIME, c(0, diff(as.Date(DATETIME)))), by = "ID"]
## ID VALUE DATETIME DIFTIME
## 1: 1 300 2012-02-04 10:00:00 0
## 2: 1 150 2012-02-05 23:00:00 1
## 3: 1 650 2012-03-06 10:00:00 30
## 4: 2 450 2012-02-01 10:00:00 0
## 5: 2 240 2012-02-02 13:00:00 1
## 6: 3 340 2012-05-06 09:00:00 0
## 7: 3 240 2012-05-07 23:00:00 1
library(reshape2)
dcast(formula = ID ~ DIFTIME, data = DT[, list(ID, DIFTIME, VALUE)])
## ID 0 1 30
## 1 1 300 150 650
## 2 2 450 240 NA
## 3 3 340 240 NA
data in handy format
Here my dat:
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), DATETIME = structure(c(1328346000,
1328479200, 1331024400, 1328086800, 1328184000, 1336287600, 1336424400
), class = c("POSIXct", "POSIXt"), tzone = ""), VALUE = c(300L,
150L, 650L, 450L, 240L, 340L, 240L)), .Names = c("ID", "DATETIME",
"VALUE"), class = "data.frame", row.names = c(NA, 7L))

Resources