Comparing dates inside group using same reference - r

I have a data table for different patients ("Spell") and several temperature ("Temp") measures for each patient ("Episode"). I also have the date and time in which each temperature was taken.
Spell Episode Date Temp
1 3 2-1-17 21:00 40
1 2 2-1-17 20:00 36
1 1 1-1-17 10:00 37
2 3 2-1-17 15:00 36
2 2 2-1-17 10:00 37
2 1 1-1-17 8:00 36
3 1 3-1-17 10:00 40
4 3 4-1-17 15:00 36
4 2 3-1-17 12:00 40
4 1 3-1-17 10:00 39
5 7 3-1-17 17:30 36
5 6 2-1-17 17:00 36
5 5 2-1-17 16:00 37
5 1 1-1-17 9:00 36
5 4 1-1-17 14:00 39
5 3 1-1-17 13:00 40
5 2 1-1-17 11:00 39
I am interested in keeping all the measurements done 24h prior to the last one, I have grouped the observations by the spell and reverse date, but I am unsure on how to do the in-group comparison using the same reference (in this case, the first row for each group). The result should be:
Spell Episode Date Temp
1 3 2-1-17 21:00 40
1 2 2-1-17 20:00 36
2 3 2-1-17 15:00 36
2 2 2-1-17 10:00 37
3 1 3-1-17 10:00 40
4 3 4-1-17 15:00 36
5 7 3-1-17 17:30 36
Would appreciate any ideas that point me to the right direction.
Edit: Date is in d-m-yy H:M format. Here's dput from data:
structure(list(Spell = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), Episode = c(3L, 2L, 1L, 3L,
2L, 1L, 1L, 3L, 2L, 1L, 7L, 6L, 5L, 1L, 4L, 3L, 2L), Date = c("2-1-17 21:00",
"2-1-17 20:00", "1-1-17 10:00", "2-1-17 15:00", "2-1-17 10:00",
"1-1-17 8:00", "3-1-17 10:00", "4-1-17 15:00", "3-1-17 12:00",
"3-1-17 10:00", "3-1-17 17:30", "2-1-17 17:00", "2-1-17 16:00",
"1-1-17 9:00", "1-1-17 14:00", "1-1-17 13:00", "1-1-17 11:00"
), Temp = c(40L, 36L, 37L, 36L, 37L, 36L, 40L, 36L, 40L, 39L,
36L, 36L, 37L, 36L, 39L, 40L, 39L)), .Names = c("Spell", "Episode",
"Date", "Temp"), class = c("data.table", "data.frame"), row.names = c(NA,
-17L), .internal.selfref = <pointer: 0x00000000001f0788>)

library(dplyr)
df %>%
mutate(Date2 = as.numeric(strptime(df$Date, "%d-%m-%Y %H:%M"))) %>%
group_by(Spell) %>%
filter(Date2 >= (max(Date2) - 60*60*24)) %>%
select(-Date2)

Solution using only data.table :
# convert Date column to POSIXct
DT[,Date:=as.POSIXct(Date,format='%d-%m-%y %H:%M',tz='GMT')]
# filter the data.table
filteredDT <- DT[, .SD[as.numeric(difftime(max(Date),Date,units='hours')) <= 24], by = Spell]
> filteredDT
Spell Episode Date Temp
1: 1 3 2017-01-02 21:00:00 40
2: 1 2 2017-01-02 20:00:00 36
3: 2 3 2017-01-02 15:00:00 36
4: 2 2 2017-01-02 10:00:00 37
5: 3 1 2017-01-03 10:00:00 40
6: 4 3 2017-01-04 15:00:00 36
7: 5 7 2017-01-03 17:30:00 36

mydata$Date <- as.POSIXct(mydata$Date, format = '%d-%m-%y %H:%M', tz='GMT')
mydata <- mydata[with(mydata, order(Spell, -as.numeric(Date))),]
index <- with(mydata, tapply(Date, Spell, function(x){x >= max(x) - as.difftime(1, unit="days")}))
mydata[unlist(index),]
Spell Episode Date Temp
1: 1 3 2017-01-02 21:00:00 40
2: 1 2 2017-01-02 20:00:00 36
4: 2 3 2017-01-02 15:00:00 36
5: 2 2 2017-01-02 10:00:00 37
7: 3 1 2017-01-03 10:00:00 40
8: 4 3 2017-01-04 15:00:00 36
11: 5 7 2017-01-03 17:30:00 36

The solution below uses two functions from Hadley Wickham's lubridate() package. This package is very handy when dealing with dates and times so I wonder why it hasn't been used in any of the other answers.
Furthermore, data.table is used because the OP has provided sample data of data.table class.
library(data.table) # if not already loaded
# coerce Date to POSIXct
DT[, Date := lubridate::dmy_hm(Date)][
# for each, pick measurements within last 24 hours
, .SD[Date > max(Date) - lubridate::dhours(24L)], by = Spell][
# order, just for convenience
order(Spell, -Date)]
Spell Episode Date Temp
1: 1 3 2017-01-02 21:00:00 40
2: 1 2 2017-01-02 20:00:00 36
3: 2 3 2017-01-02 15:00:00 36
4: 2 2 2017-01-02 10:00:00 37
5: 3 1 2017-01-03 10:00:00 40
6: 4 3 2017-01-04 15:00:00 36
7: 5 7 2017-01-03 17:30:00 36
Please note that the expected result given by the OP shows an additional row (Spell 5, Episode 6) which is outside of the 24 hrs window.
Data
As provided by the OP
DT <- structure(list(Spell = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), Episode = c(3L, 2L, 1L, 3L,
2L, 1L, 1L, 3L, 2L, 1L, 7L, 6L, 5L, 1L, 4L, 3L, 2L), Date = c("2-1-17 21:00",
"2-1-17 20:00", "1-1-17 10:00", "2-1-17 15:00", "2-1-17 10:00",
"1-1-17 8:00", "3-1-17 10:00", "4-1-17 15:00", "3-1-17 12:00",
"3-1-17 10:00", "3-1-17 17:30", "2-1-17 17:00", "2-1-17 16:00",
"1-1-17 9:00", "1-1-17 14:00", "1-1-17 13:00", "1-1-17 11:00"
), Temp = c(40L, 36L, 37L, 36L, 37L, 36L, 40L, 36L, 40L, 39L,
36L, 36L, 37L, 36L, 39L, 40L, 39L)), .Names = c("Spell", "Episode",
"Date", "Temp"), class = c("data.table", "data.frame"), row.names = c(NA, -17L))

Related

Is there an R code to remove duplicate events of same id with two -three different column conditions?

I have a data frame with thousands of ids with several events per id and enrollment dates, course and record. Course is categorical, module1, module2, module3, molude4, module5 and withdrawn(any module). For example few rows looks like below
id event enrolment date Enrolment to course record
1 42 2012-07-01 2013-06-30 module 5 2
1 42 2018-07-01 2019-06-30 **module 4** 1
1 43 2012-07-01 2013-06-30 module 5 2
1 43 2018-07-01 2019-06-30 **module 4** 1
2 50 2017-04-01 2018-03-31 **module 5** 2
2 50 2017-07-01 2018-03-31 module 4 1
2 34 2017-04-01 2018-03-31 **module 5** 2
2 34 2017-07-01 2018-03-31 module 4 1
3 23 2014-08-20 2015-07-20 module 5 1
3 23 2014-08-20 2015-07-20 module 4 2
3 23 2015-07-04 2016-06-04 **withdrawn** 3
4 13 2017-09-01 2018-08-01 module 4 1
4 13 2017-09-01 2018-08-01 **module 5** 2
4 23 2017-09-01 2018-08-01 module 4 1
4 23 2017-09-01 2018-08-01 **module 5** 2
I would like to retain 2nd,4th,5th,7th,11th,13th, & 15th row in
the data frame (education)
I tried factoring course which wrongly assigns module 5 for events 42 & 43 and if I go by max date then it wrongly assigns module 4 to events 50 & 34
I would like data to look like below
id event status_date Course record
1 42 2018-07-01 module 4 1
1 43 2018-07-01 module 4 1
2 50 2017-04-01 module 5 2
2 34 2016-04-01 module 5 2
3 23 2015-07-04 withdrawn 3
4 13 2017-09-01 module 5 2
4 23 2017-09-01 module 5 2
If I have understood all the requirements clearly here is a function which selects the correct date in each group
library(dplyr)
select_dates <- function(start, end, course) {
#If there is same date return course with "module5"
if (n_distinct(start) == 1)
which.max(course == "module5")
else {
#Get courses which are currently enrolled
inds <- max(start) < end
#If any course has "module5" and no "withdrawn"
if (any(course[inds] == "module5") & all(course[inds] != "withdrawn"))
#return the course with "module5" which is currently enrolled
which.max(inds & course == "module5")
else
#return the currently enrolled course with a max date
which.max(start == max(start[inds]))
}
}
We then apply it for each id and event
df %>%
mutate_at(vars(enrolment_date, Enrolment_to), as.Date) %>%
group_by(id, event) %>%
slice(select_dates(enrolment_date, Enrolment_to, course))
# id event enrolment_date Enrolment_to course record
# <int> <int> <date> <date> <chr> <int>
#1 1 42 2018-07-01 2019-06-30 module4 1
#2 1 43 2018-07-01 2019-06-30 module4 1
#3 2 34 2017-04-01 2018-03-31 module5 2
#4 2 50 2017-04-01 2018-03-31 module5 2
#5 3 23 2015-07-04 2016-06-04 withdrawn 3
#6 4 13 2017-09-01 2018-08-01 module5 2
#7 4 23 2017-09-01 2018-08-01 module5 2
Note that you need to change the strings in the function ("module5" and "withdrawn") and the column names (enrolment_date and Enrolment_to) based on what you have in your data.
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L), event = c(42L, 42L, 43L, 43L, 50L, 50L,
34L, 34L, 23L, 23L, 23L, 13L, 13L, 23L, 23L), enrolment_date = c("2012-07-01",
"2018-07-01", "2012-07-01", "2018-07-01", "2017-04-01", "2017-07-01",
"2017-04-01", "2017-07-01", "2014-08-20", "2014-08-20", "2015-07-04",
"2017-09-01", "2017-09-01", "2017-09-01", "2017-09-01"), Enrolment_to = c("2013-06-30",
"2019-06-30", "2013-06-30", "2019-06-30", "2018-03-31", "2018-03-31",
"2018-03-31", "2018-03-31", "2015-07-20", "2015-07-20", "2016-06-04",
"2018-08-01", "2018-08-01", "2018-08-01", "2018-08-01"), course = c("module5",
"module4", "module5", "module4", "module5", "module4", "module5",
"module4", "module5", "module4", "withdrawn", "module4", "module5",
"module4", "module5"), record = c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 3L, 1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -15L))

How do you split a time series into separate events and assign an event ID?

I want split an irregular time series into separate events and assign each event a unique numerical ID for each site.
Here is an example data frame:
structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))
Each event is not the same length or number of timestamps, so I want to split them into separate events if more than 12 hours elapsed between a timestamp and the next timestamp at that site. Each event at the site should receive a unique numerical ID. Here's the outcome I would like:
site timestamp eventid
1 AllenBrook 9/5/12 6:30 1
2 AllenBrook 9/5/12 9:30 1
3 AllenBrook 9/5/12 12:30 1
4 AllenBrook 10/20/12 16:30 2
5 AllenBrook 10/20/12 19:30 2
6 AllenBrook 10/21/12 1:30 2
7 AllenBrook 10/21/12 4:30 2
8 Eastberk 9/5/12 4:14 1
9 Eastberk 9/5/12 7:14 1
10 Eastberk 9/5/12 7:44 1
11 Eastberk 10/1/12 11:29 2
12 Eastberk 10/1/12 14:29 2
13 Eastberk 10/1/12 17:29 2
Any coding solution will do, but bonus points for a tidyverse or data.table solution. Thanks for any help you can provide!
Using data.table, you can perhaps do the following:
library(data.table)
setDT(tmp)[, timestamp := as.POSIXct(timestamp, format="%m/%d/%y %H:%M")][,
eventid := 1L+cumsum(c(0L, diff(timestamp)>720)), by=.(site)]
diff(timestamp) calculates the time difference between adjacent rows. Then we check if the diff is greater than 12h (or 720mins). A common trick in R is to use cumsum to identify when an event happens in a series and group subsequent elements together with this event until the next event happens again. Since cumsum returns 1 less element, we use 0L to pad the beginning. 1+ merely starts the indexing from 1 instead of 0.
output:
site timestamp eventid
1: AllenBrook 2012-09-05 06:30:00 1
2: AllenBrook 2012-09-05 09:30:00 1
3: AllenBrook 2012-09-05 12:30:00 1
4: AllenBrook 2012-10-20 16:30:00 2
5: AllenBrook 2012-10-20 19:30:00 2
6: AllenBrook 2012-10-21 01:30:00 2
7: AllenBrook 2012-10-21 04:30:00 2
8: Eastberk 2012-09-05 04:14:00 1
9: Eastberk 2012-09-05 07:14:00 1
10: Eastberk 2012-09-05 07:44:00 1
11: Eastberk 2012-10-01 11:29:00 2
12: Eastberk 2012-10-01 14:29:00 2
13: Eastberk 2012-10-01 17:29:00 2
data:
tmp <- structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))

How to filter rows that are 365 days (+/90) from the first date?

This is actually an adaptation from another question I posted previously. I hope other users' can offer feedback on my code or offer better alternatives to it. Thank you!
I have a dataset that contains lab tests, and I only want lab tests that are 365 days (+/- 90 days) from the first lab test.
If a patient does not have any values that fall within the 365 +/- 90 days from the first date, we only output the first date, like in PATIENT_ID == 1.
I want to output the (i) first date, the (ii) next date within the 365 +/- 90 days range and closest to the 365-day point from the first date, then the (iii) next date within the 365 +/- 90 days range and closest to the 365-day point from the second date and so on. For PATIENT_ID == 2, both 30/05/2016 and 01/08/2016 are within the 365 +/- 90 day-range from the first date, but only the latter is chosen as it is closer to the 365-day mark. The third date 27/07/2017 is chosen because it is within the 365 +/- 90 day-range from the second date and so on.
Data:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1: 1 2012-11-19 31
2: 1 2012-11-21 30
3: 1 2012-11-23 31
4: 1 2012-11-26 30
5: 1 2012-11-28 30
6: 1 2012-12-01 30
7: 1 2012-12-05 29
8: 1 2012-12-06 30
9: 2 2015-07-23 43
10: 2 2015-08-05 41
11: 2 2015-08-19 44
12: 2 2015-09-02 41
13: 2 2015-09-30 40
14: 2 2015-12-23 45
15: 2 2016-03-16 46
16: 2 2016-05-30 40
17: 2 2016-08-01 46
18: 2 2017-07-27 44
19: 2 2018-10-15 49
20: 3 2011-08-11 30
...trunc...
Desired Output:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1 19/11/2012 31
2 23/07/2015 43
2 01/08/2016 46
2 27/07/2017 44
2 15/10/2018 49
3 11/08/2011 30
3 13/08/2012 36
4 01/10/2014 41
4 26/08/2015 42
dput data:
df <- structure(list(PATIENT_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
LAB_TEST_DATE = structure(c(15663, 15665, 15667, 15670, 15672,
15675, 15679, 15680, 16639, 16652, 16666, 16680, 16708, 16792,
16876, 16951, 17014, 17374, 17819, 15197, 15202, 15217, 15300,
15335, 15357, 15405, 15413, 15434, 15453, 15565, 16344, 16352,
16364, 16379, 16414, 16442, 16505, 16589, 16673), class = "Date"),
LAB_TEST = c(31L, 30L, 31L, 30L, 30L, 30L, 29L, 30L, 43L,
41L, 44L, 41L, 40L, 45L, 46L, 40L, 46L, 44L, 49L, 30L, 31L,
34L, 34L, 36L, 36L, 33L, 36L, 33L, 35L, 36L, 41L, 43L, 43L,
40L, 39L, 42L, 40L, 40L, 42L)), class = "data.frame", .Names = c("PATIENT_ID",
"LAB_TEST_DATE", "LAB_TEST"), row.names = c(NA, -39L))
Code:
I wrote a recursive function such that if the date is within the range and is the closest to the 365-day mark, then I will filter that date.
f <- function(d, ind = 1) {
datediff <- difftime(d, d[ind], units = "days")
ind.range <- which(datediff >= 275 & datediff <= 455)
ind.min <- which.min(abs(datediff - 365))
ind.next <- first(intersect(ind.range, ind.min))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
df %>% group_by(PATIENT_ID) %>% slice(f(LAB_TEST_DATE))
Here is a solution using data.table. Explanation inline.
library(data.table)
setDT(df)
#extract the first visit for each patient
firstDates <- df[, .SD[1L], by=PATIENT_ID]
#create the period for each lab test
df[, ':=' (STARTDATE=LAB_TEST_DATE+365-90, ENDDATE=LAB_TEST_DATE+365+90)]
#for each lab test, find the lab tests that are within 365 +/- 90 days
#after that lab test by performing a non-equi self join
withinPeriod <- df[
df,
.(PATIENT_ID, x.LAB_TEST, x.LAB_TEST_DATE, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST),
by=.EACHI,
on=.(PATIENT_ID, LAB_TEST_DATE >= STARTDATE, LAB_TEST_DATE <= ENDDATE)][
!is.na(x.LAB_TEST), -3L:-1L]
#find the lab test that is closest to the 365 days after that lab test
#and extract only relevant columns
selected <- withinPeriod[, .SD[which.min(abs(i.LAB_TEST_DATE + 365 - x.LAB_TEST_DATE))],
by=.(PATIENT_ID, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST)][,
.(PATIENT_ID, LAB_TEST_DATE=x.LAB_TEST_DATE, LAB_TEST=x.LAB_TEST)]
#cbind first dates with those selected
ans <- rbindlist(list(firstDates, unique(selected)), use.names=TRUE)
setorder(ans, PATIENT_ID, LAB_TEST_DATE)
ans
# PATIENT_ID LAB_TEST_DATE LAB_TEST
#1: 1 2012-11-19 31
#2: 2 2015-07-23 43
#3: 2 2016-08-01 46
#4: 2 2017-07-27 44
#5: 2 2018-10-15 49
#6: 3 2011-08-11 30
#7: 3 2012-08-13 36
#8: 4 2014-10-01 41
#9: 4 2015-08-26 42

Reformat dataframe by moving begin time and end time based on group to begin and end time in new dataframe without looping

I have a working set of code using a for loop below to process a dataframe but need to optimize it without a for loop if possible. I have searched for a while to find something like this but must not know the proper search terms. Thanks for any help.
The dataframe example (longer version at the bottom) has a datetime column and a bottle column. The bottle column begins at some number (1 below) and will repeat as samples are added and switch to 2 and so on till bottle 7 (in this case) and then RESTARTS at 1 and goes to 14 (in this case) and over and over. (Note that there are more than 2 times per bottle in the real file)
datetime bottle
6/9/2016 0:00 1
6/9/2016 0:15 1
6/9/2016 0:30 1
6/9/2016 0:45 1
6/9/2016 1:00 2
6/9/2016 1:15 2
6/9/2016 1:30 2
6/9/2016 1:45 3
6/9/2016 2:00 3
6/9/2016 2:15 4
6/9/2016 2:30 4
6/9/2016 2:45 5
6/9/2016 3:00 5
6/9/2016 3:15 6
6/9/2016 3:30 6
6/9/2016 3:45 7
6/9/2016 4:00 7
6/9/2016 4:15 7
6/9/2016 4:30 1
6/9/2016 4:45 1
6/9/2016 5:00 1
6/9/2016 5:15 2
6/9/2016 5:30 2
6/9/2016 5:45 2
6/9/2016 6:00 3
6/9/2016 6:15 3
6/9/2016 6:30 3
I need to create a new dataframe with bottle begin and end times. Note that each sequence of bottles is repeated.
bottle begin end
1 6/9/2016 0:00 6/9/2016 0:45
2 6/9/2016 1:00 6/9/2016 1:30
3 6/9/2016 1:45 6/9/2016 2:00
4 6/9/2016 2:15 6/9/2016 2:30
5 6/9/2016 2:45 6/9/2016 3:00
6 6/9/2016 3:15 6/9/2016 3:30
7 6/9/2016 3:45 6/9/2016 4:15
1 6/9/2016 4:30 6/9/2016 5:00
2 6/9/2016 5:15 6/9/2016 5:45
3 6/9/2016 6:00 6/9/2016 6:30
What I have done so far is the annotated code below. This works well but takes a long time on the full dataframe.
#create id number for each bottle using data.table
setDT(t2s_bottle_timing.df)[, id := .GRP, by = t2s_bottle]
#declare/set variables
x1 <- 1
x2 <- 1
x3 <- 1
i <- 1
N <- length(t2s_bottle_timing.df$t2s_bottle)
#renumber id column to have unique id for each bottle run
for (i in 2:(N-1)) {
x1 <- t2s_bottle_timing.df[(i) , 2] #load bottle numbers
x2 <- t2s_bottle_timing.df[(i+1) , 2] #load bottle numbers
if (x2 == x1) { t2s_bottle_timing.df[(i),3] <- x3 } #set id number
if (x2 != x1) { x3 <- x3 +1} #increment id number
t2s_bottle_timing.df[(i+1),3] <- x3 #load new id number into table
}
# get rid of unused stuff
rm(x1, x2, i, N, x3)
# summerise the raw dataframe to produce the bottle, begin, end dataframe
t2s_timing_output.df <- t2s_bottle_timing.df %>% group_by( id ,t2s_bottle )
%>% #group_by(id,bottle)
summarize(
begin = min(datetime),
end = max(datetime) )
So this works but I am eager to learn an alternative way and more efficient way to do this.
t2s_bottle_timing.df <- structure(list(datetime = structure(c(1465514100, 1465515000,
1465515900, 1465516800, 1465517700, 1465518600, 1465519500, 1465520400,
1465521300, 1465522200, 1465523100, 1465524000, 1465524900, 1465525800,
1465526700, 1465527600, 1465528500, 1465529400, 1465530300, 1465531200,
1465532100, 1465533000, 1465533900, 1465534800, 1465535700, 1465536600,
1465537500, 1465538400, 1465539300, 1465540200, 1465541100, 1465542000,
1465542900, 1465543800, 1465544700, 1465545600, 1465546500, 1465547400,
1465548300, 1465549200, 1465550100, 1465551000, 1465551900, 1465552800,
1465553700, 1465554600, 1465555500, 1465556400, 1465557300, 1465558200,
1465559100, 1465560000, 1465560900, 1465561800, 1465562700, 1465563600,
1465564500, 1465565400, 1465566300, 1465567200, 1465568100, 1465569000,
1465569900, 1465570800, 1465571700, 1465572600, 1465573500, 1465574400,
1465575300, 1465576200, 1465577100, 1465578000, 1465578900, 1465579800,
1465580700, 1465581600, 1465582500, 1465583400, 1465584300, 1465585200,
1465586100, 1465587000, 1465587900, 1465588800, 1465589700, 1465590600,
1465591500), tzone = "UTC", class = c("POSIXct", "POSIXt")),
t2s_bottle = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L,
7L, 7L, 7L, 7L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L)), .Names = c("datetime", "t2s_bottle"), row.names = c(NA,
-87L), spec = structure(list(cols = structure(list(datetime = structure(list(), class = c("collector_character",
"collector")), t2s_bottle = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("datetime", "t2s_bottle")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df",
"tbl", "data.frame"))
Your example confuses me a little bit, but if what you want is to create an index maybe cumsum with a logical could help:
t2s_bottle_timing.df %>%
mutate(index = cumsum(t2s_bottle != dplyr::lag(t2s_bottle, default = 0))) %>%
group_by(index, t2s_bottle) %>%
summarise(begin = min(datetime), end = max(datetime))
index t2s_bottle begin end
<int> <int> <dttm> <dttm>
1 1 1 2016-06-09 23:15:00 2016-06-10 00:15:00
2 2 2 2016-06-10 00:30:00 2016-06-10 02:15:00
3 3 3 2016-06-10 02:30:00 2016-06-10 04:30:00
4 4 4 2016-06-10 04:45:00 2016-06-10 06:00:00
5 5 5 2016-06-10 06:15:00 2016-06-10 07:45:00
6 6 6 2016-06-10 08:00:00 2016-06-10 09:00:00
7 7 7 2016-06-10 09:15:00 2016-06-10 10:15:00
8 8 1 2016-06-10 10:30:00 2016-06-10 11:15:00
9 9 2 2016-06-10 11:30:00 2016-06-10 13:00:00
10 10 3 2016-06-10 13:15:00 2016-06-10 13:30:00
11 11 4 2016-06-10 13:45:00 2016-06-10 15:00:00
12 12 5 2016-06-10 15:15:00 2016-06-10 15:45:00
13 13 6 2016-06-10 16:00:00 2016-06-10 17:15:00
14 14 7 2016-06-10 17:30:00 2016-06-10 18:45:00
15 15 8 2016-06-10 19:00:00 2016-06-10 20:45:00

Event dif_time fixing last date occurrence

I have some events identified by id, var1, var2 and date.
The desired output for dif_time is as follow:
id var1 var2 date1 date2 dif_time
1 120 1 2014-06-03 2014-06-30 26
1 120 1 2014-06-04 2014-06-30 26
1 120 4 2014-06-05 2014-06-30 25
2 220 1 2014-06-05 2014-06-30 23
2 220 1 2014-06-07 2014-06-30 23
3 120 2 2014-06-10 2014-06-30 15
3 120 2 2014-06-12 2014-06-30 15
3 120 1 2014-06-15 2014-06-30 15
5 220 3 2014-06-20 2014-06-30 10
I need to calculate the dif_time in days between date1 (the event date) and a control date date2.
The constrain is:
For an event (id,var1,var2) I need to find the last.date1 and calculate:
dif_time(days) = date2 - last.date1, for each event and report the result for the event.
I did not find a way to fixed last.date1, so your help is appreciated.
You could do this
dd<-data.frame(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 5L),
var1 = c(120L, 120L, 120L, 220L, 220L, 120L, 120L, 120L, 220L),
var2 = c(1L, 1L, 4L, 1L, 1L, 2L, 2L, 1L, 3L),
date1 = structure(c(16224, 16225, 16226, 16226, 16228, 16231, 16233, 16236, 16241), class = "Date"),
date2 = structure(c(16251, 16251, 16251, 16251, 16251, 16251, 16251, 16251, 16251), class = "Date")
)
last.date1<-with(dd, ave(date1, id, var1, var2, FUN=max, drop=T))
dd$date2-last.date1
dd$diff_time <- dd$date2-last.date1

Resources