Find un-arrangeable consecutive time intervals with exactly n days difference - r

I have a data as follow and I need to group them based on dates that time_right + 1 = time_left (in other rows). The group id is equal to the minimum id of those records that satisfy this condition.
input = data.frame(id = c(1:6),
time_left = c("2016-01-01", "2016-09-05", "2016-09-06","2016-09-08", "2016-09-12","2016-09-15"),
time_right = c("2016-09-07", "2016-09-11", "2016-09-12", "2016-09-14", "2016-09-18","2016-09-21"))
Input
id time_left time_right
1 1 2016-01-01 2016-09-07
2 2 2016-09-05 2016-09-11
3 3 2016-09-06 2016-09-12
4 4 2016-09-08 2016-09-14
5 5 2016-09-12 2016-09-18
6 6 2016-09-15 2016-09-21
Output:
id time_left time_right group_id
1 1 2016-01-01 2016-09-07 1
2 2 2016-09-05 2016-09-11 2
3 3 2016-09-06 2016-09-12 3
4 4 2016-09-08 2016-09-14 1
5 5 2016-09-12 2016-09-18 2
6 6 2016-09-15 2016-09-21 1
Is there anyway to do it with dplyr?

Related

How to print a date when the input is number of days since 01-01-60?

I received a set of dates, but it turns out that time is reported in days since 01-01-1960 in this specific data set.
D_INDDTO
1 20758
2 20856
3 21062
4 19740
5 21222
6 21203
The specific date of interest for Patient 1 is 20758 days since 01-01-60
I want to create a new covariate u$date containing the specific date of interest i d%m%y%. I tried
library(tidyverse)
u %>% mutate(date=as.date(D_INDDTO,origin="1960-01-01")
But that did not solve it.
u <- structure(list(D_INDDTO = c(20758, 20856, 21062, 19740, 21222,
21203, 20976, 20895, 18656, 18746)), row.names = c(NA, 10L), class = "data.frame")
Try this:
#Code 1
u %>% mutate(date=as.Date("1960-01-01")+D_INDDTO)
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 2
u %>% mutate(date=as.Date(D_INDDTO,origin="1960-01-01"))
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 3
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d%m%y'))
Output:
D_INDDTO date
1 20758 311016
2 20856 060217
3 21062 310817
4 19740 170114
5 21222 070218
6 21203 190118
7 20976 060617
8 20895 170317
9 18656 290111
10 18746 290411
If more customization is required:
#Code 4
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d-%m-%Y'))
Output:
D_INDDTO date
1 20758 31-10-2016
2 20856 06-02-2017
3 21062 31-08-2017
4 19740 17-01-2014
5 21222 07-02-2018
6 21203 19-01-2018
7 20976 06-06-2017
8 20895 17-03-2017
9 18656 29-01-2011
10 18746 29-04-2011

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

Fill in missing rows for dates by group [duplicate]

This question already has answers here:
Efficient way to Fill Time-Series per group
(2 answers)
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 4 years ago.
I have a data table like this, just much bigger:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m-
%d"), as.Date("2017-06-01","%Y-%m-%d"),
as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m-
%d"), as.Date("2017-05-01","%Y-%m-%d"),
as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m-
%d"), as.Date("2017-04-01","%Y-%m-%d"),
as.Date("2017-05-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-02-01 1 120
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-02-01 NA NA
1 11 2017-03-01 NA NA
1 11 2017-04-01 NA NA
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-01-01 NA NA
2 55 2017-02-01 1 120
2 55 2017-03-01 NA NA
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-02-01 NA NA
3 38 2017-03-01 NA NA
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
3 38 2017-06-01 NA NA
I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using
ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")
and then merge it to the original data with
ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)
but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?
We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'
ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)],
on = .(customer_id, account_id, time)]
# customer_id account_id time tenor variable_x
# 1: 1 11 2017-01-01 1 87
# 2: 1 11 2017-02-01 NA NA
# 3: 1 11 2017-03-01 NA NA
# 4: 1 11 2017-04-01 NA NA
# 5: 1 11 2017-05-01 2 90
# 6: 1 11 2017-06-01 3 100
# 7: 2 55 2017-01-01 NA NA
# 8: 2 55 2017-02-01 1 120
# 9: 2 55 2017-03-01 NA NA
#10: 2 55 2017-04-01 2 130
#11: 2 55 2017-05-01 3 150
#12: 2 55 2017-06-01 4 12
#13: 3 38 2017-01-01 1 13
#14: 3 38 2017-02-01 NA NA
#15: 3 38 2017-03-01 NA NA
#16: 3 38 2017-04-01 2 15
#17: 3 38 2017-05-01 3 14
#18: 3 38 2017-06-01 NA NA
Or using tidyverse
library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
mutate(time = list(ts1)) %>%
unnest %>%
left_join(my_data)
Or with complete from tidyr
my_data %>%
complete(nesting(customer_id, account_id), time = ts1)
A different data.table approach:
my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"),
by = "month")), by = list(customer_id, account_id)]
merge(my_data2, my_data, all.x = TRUE)
customer_id account_id time tenor variable_x
1: 1 11 2017-01-01 1 87
2: 1 11 2017-02-01 NA NA
3: 1 11 2017-03-01 NA NA
4: 1 11 2017-04-01 NA NA
5: 1 11 2017-05-01 2 90
6: 1 11 2017-06-01 3 100
7: 2 55 2017-01-01 NA NA
8: 2 55 2017-02-01 1 120
9: 2 55 2017-03-01 NA NA
10: 2 55 2017-04-01 2 130
11: 2 55 2017-05-01 3 150
12: 2 55 2017-06-01 4 12
13: 3 38 2017-01-01 1 13
14: 3 38 2017-02-01 NA NA
15: 3 38 2017-03-01 NA NA
16: 3 38 2017-04-01 2 15
17: 3 38 2017-05-01 3 14
18: 3 38 2017-06-01 NA NA

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

R Removing earliest observations when duplicate IDs are present

I have a data frame which looks something like this:
ID = c(1,1,1,1,2,2,3,3,3,3,4,4)
TIME = as.POSIXct(c("2013-03-31 09:07:00", "2013-09-26 10:07:00", "2013-03-31 11:07:00",
"2013-09-26 12:07:00","2013-03-31 09:10:00","2013-03-31 11:11:00",
"2013-03-31 09:06:00","2013-09-26 09:04:00","2013-03-31 10:35:00",
"2013-09-26 09:07:00","2013-09-26 09:07:00","2013-09-26 10:07:00"))
var = c(0,0,1,1,0,1,0,0,1,1,0,1)
DF = data.frame(ID, TIME, var)
ID TIME var
1 1 2013-03-31 09:07:00 0
2 1 2013-09-26 10:07:00 0
3 1 2013-03-31 11:07:00 1
4 1 2013-09-26 12:07:00 1
5 2 2013-03-31 09:10:00 0
6 2 2013-03-31 11:11:00 1
7 3 2013-03-31 09:06:00 0
8 3 2013-09-26 09:04:00 0
9 3 2013-03-31 10:35:00 1
10 3 2013-09-26 09:07:00 1
11 4 2013-09-26 09:07:00 0
12 4 2013-09-26 10:07:00 1
I would like to remove the row containing the earliest TIME value when there are identical ID and var present in the data, ie. to end up with something like this:
ID2 = c(1,1,2,2,3,3,4,4)
TIME2 = as.POSIXct(c("2013-09-26 10:07:00","2013-09-26 12:07:00","2013-03-31 09:10:00",
"2013-03-31 11:11:00","2013-09-26 09:04:00","2013-09-26 09:07:00",
"2013-09-26 09:07:00","2013-09-26 10:07:00"))
var2 = c(0,1,0,1,0,1,0,1)
DF2 = data.frame(ID2, TIME2, var2)
ID2 TIME2 var2
1 1 2013-09-26 10:07:00 0
2 1 2013-09-26 12:07:00 1
3 2 2013-03-31 09:10:00 0
4 2 2013-03-31 11:11:00 1
5 3 2013-09-26 09:04:00 0
6 3 2013-09-26 09:07:00 1
7 4 2013-09-26 09:07:00 0
8 4 2013-09-26 10:07:00 1
As you can see it is not simply about avoiding the measurements performed in March 2013, since these are valid. It is only the measurements for which there are "duplicates" and have been performed again in September that should be affected (see for example that ID = 2 remains in DF2).
Hope you can help.
Sincerily,
ykl
Here's an option with dplyr:
library(dplyr)
DF %>% group_by(ID, var) %>% filter(n() == 1L | !TIME %in% min(TIME))
#Source: local data frame [8 x 3]
#Groups: ID, var
#
# ID TIME var
#1 1 2013-09-26 10:07:00 0
#2 1 2013-09-26 12:07:00 1
#3 2 2013-03-31 09:10:00 0
#4 2 2013-03-31 11:11:00 1
#5 3 2013-09-26 09:04:00 0
#6 3 2013-09-26 09:07:00 1
#7 4 2013-09-26 09:07:00 0
#8 4 2013-09-26 10:07:00 1
What this does:
Take the data frame DF
group it by ID and var
the filter function is used to filter out (subset) by row. it takes a logical vector
and returns rows for which the vector is TRUE. The logic is:
1) if the group has only 1 row, i.e. n() == 1L, then always return that row.
2) if the group has more than 1 rows, i.e. n() > 1L, then check if the TIME value is
equal to the minimum (earlist) TIME value of the group. By using ! we negate the vector so that it is FALSE whenever TIME is at its minimum. Those 1) and 2) conditions are combined with an OR (|).
An option using data.table
library(data.table)
setDT(DF)[ ,{if(.N==1) .SD else .SD[-which.min(TIME)]}, by=list(ID, var)]
# ID var TIME
#1: 1 0 2013-09-26 10:07:00
#2: 1 1 2013-09-26 12:07:00
#3: 2 0 2013-03-31 09:10:00
#4: 2 1 2013-03-31 11:11:00
#5: 3 0 2013-09-26 09:04:00
#6: 3 1 2013-09-26 09:07:00
#7: 4 0 2013-09-26 09:07:00
#8: 4 1 2013-09-26 10:07:00
Or a similar logical approach as showed by #docendo discimus
setDT(DF)[DF[,.N==1L|!TIME %in% min(TIME), by=list(ID, var)]$V1]

Resources