R: Days since last event per ID - r

I am interested in finding the number of days since the last event per ID. The data looks like this:
df <- data.frame(date=as.Date(
c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"),
event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5)))
date event id
1 2000-07-06 0 1
2 2000-09-15 0 1
3 2000-10-15 1 1
4 2001-01-03 0 1
5 2001-03-17 1 1
6 2010-08-06 1 2
7 2010-09-15 0 2
8 2010-10-15 0 2
9 2011-01-03 0 2
10 2011-03-17 1 2
I am borrowing heavily from a data table solution here but this does not consider ID's.
library(data.table)
setDT(df)
setkey(df, date,id)
df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
It generates the following output
date lastevent event id tae
1: 2000-07-06 <NA> 0 1 NA days
2: 2000-09-15 <NA> 0 1 NA days
3: 2000-10-15 2000-10-15 1 1 NA days
4: 2001-01-03 2000-10-15 0 1 80 days
5: 2001-03-17 2001-03-17 1 1 153 days
6: 2010-08-06 2010-08-06 1 2 3429 days
7: 2010-09-15 2010-08-06 0 2 40 days
8: 2010-10-15 2010-08-06 0 2 70 days
9: 2011-01-03 2010-08-06 0 2 150 days
10: 2011-03-17 2011-03-17 1 2 223 days
My desired output however is as follows:
date lastevent event id tae
1: 2000-07-06 <NA> 0 1 NA days
2: 2000-09-15 <NA> 0 1 NA days
3: 2000-10-15 2000-10-15 1 1 NA days
4: 2001-01-03 2000-10-15 0 1 80 days
5: 2001-03-17 2001-03-17 1 1 153 days
6: 2010-08-06 2010-08-06 1 2 NA days
7: 2010-09-15 2010-08-06 0 2 40 days
8: 2010-10-15 2010-08-06 0 2 70 days
9: 2011-01-03 2010-08-06 0 2 150 days
10: 2011-03-17 2011-03-17 1 2 223 days
The only difference is the NA in row 6 and column tae. This is a related post that is unanswered. I have looked here, but the solution does not work in my case. There are many other questions like this but not for calculations per ID. Thank you!

df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"),
"%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,1,0,1),id = c(rep(1,5),rep(2,5)))
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
df[, tae := ifelse(shift(event)==1, date - shift(date), tae), by = id]
EDIT
More general solution
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001", "18/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011","19/03/2011"),
"%d/%m/%Y"),
event=c(1,0,0,0,0,0,1,1,1,0,1,0),id = c(rep(1,6),rep(5,6)))
##for event = 1 observations
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
##for event = 0 observations
for(d in df[event==0, date]){
# print(as.Date(d, origin = "1970-01-01"))
df[date == d & event == 0, tae := as.Date(d, origin = "1970-01-01") -
max(df[date<d & event==1,date]), by = id]
}
EDIT 2
Now, there must be a faster way to do this, but if first observation is event = 0, this won't result in any warning
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"),
"%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5)))
tempdt <- df[event==1,]
tempdt[,tae := date - shift(date), by = id]
df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)
for(i in unique(df[,id])){
# print(i)
for(d in df[date>df[id == i & event==1,min(date)] & event==0, date]){
# print(as.Date(d, origin = "1970-01-01"))
df[id == i & date == d & event == 0,
tae := as.Date(d, origin = "1970-01-01") - max(df[date<d &
event==1,date])]
}
}

Related

Merge partially overlapping date ranges in data.table

Suppose that I have two tables (DT_sportA and DT_sportB) that measure time periods in which two children (id) played sport "A" and "B".
library(data.table)
library(lubridate)
DT_sportA <- data.table(id = rep(1:2,each=2),
start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
# id start_date end_date
# 1: 1 2000-01-01 2000-02-03
# 2: 1 2002-01-15 2003-03-01
# 3: 2 2014-03-12 2014-04-03
# 4: 2 2016-10-14 2017-05-19
DT_sportB <- data.table(id = c(1L,1L,2L),
start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))
DT_sportB
# id start_date end_date
# 1: 1 2000-01-15 2000-02-01
# 2: 1 2002-01-15 2006-03-19
# 3: 2 2017-02-10 2017-02-20
I would like to generate a new table with all of the unique and overlapping date ranges with two categorical indicators denoting the sport played by the children. The desired DT should look like this:
id start_date end_date sportA sportB
1: 1 2000-01-01 2000-01-14 1 0
2: 1 2000-01-15 2000-02-01 1 1
3: 1 2000-02-02 2000-02-03 1 0
4: 1 2002-01-15 2002-03-01 1 1
5: 1 2002-03-02 2002-03-19 0 1
6: 2 2014-03-12 2014-04-03 1 0
7: 2 2016-10-14 2017-02-09 1 0
8: 2 2017-02-10 2017-02-20 1 1
9: 2 2017-02-21 2017-05-19 1 0
This is a fairly trivial toy example. The real data spans several million rows and approximately 20 "sports", which is why I am looking for a data.table solution.
Notes:
when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;
further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id and sport with ease;
your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14" is not in the data, but it is the end_date, suggesting that "2000-01-15" was reduced because the next row starts on that date ... but there is a start on "2000-02-02" for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date so that no id/sport/date range will match multiple rows, and I say "low number" and not 1 because Date-class objects are really numeric, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1 with dput(Sys.Date()-0.1).
sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
# sport id start_date end_date
# <char> <int> <Date> <Date>
# 1: sportA 1 2000-01-01 2000-02-03
# 2: sportA 1 2002-01-15 2003-03-01
# 3: sportA 2 2014-03-12 2014-04-03
# 4: sportA 2 2016-10-14 2017-05-19
# 5: sportB 1 2000-01-15 2000-02-01
# 6: sportB 1 2002-01-15 2006-03-19
# 7: sportB 2 2017-02-10 2017-02-20
I tend to like piping data.table, and since I'm still on R-4.0.5, I use magrittr::%>% for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |> pipe, as that has more restrictions on the RHS data placement.)
library(magrittr)
out <- sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
.[ !is.na(sport), ] %>%
.[, val := 1L ] %>%
dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
# id start_date end_date sportA sportB
# <int> <Date> <Date> <int> <int>
# 1: 1 2000-01-01 2000-01-14 1 0
# 2: 1 2000-01-15 2000-01-31 1 1
# 3: 1 2000-02-01 2000-02-02 1 0
# 4: 1 2002-01-15 2003-02-28 1 1
# 5: 1 2003-03-01 2006-03-19 0 1
# 6: 2 2014-03-12 2014-04-02 1 0
# 7: 2 2016-10-14 2017-02-09 1 0
# 8: 2 2017-02-10 2017-02-19 1 1
# 9: 2 2017-02-20 2017-05-19 1 0
Walk-through:
the first sports[, {...}] produces just the feasible date-ranges, per-id; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff to show this here:
sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
.[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
# id sd ed secdiff
# <int> <Date> <Date> <num>
# 1: 1 2000-01-01 2000-01-14 0.8640000
# 2: 1 2000-01-15 2000-01-31 0.8640000
# 3: 1 2000-02-01 2000-02-02 0.8640000
# 4: 1 2000-02-03 2002-01-14 0.8640000 # will be empty post-join
# 5: 1 2002-01-15 2003-02-28 0.8640000
# 6: 1 2003-03-01 2006-03-19 NA
# 7: 2 2014-03-12 2014-04-02 0.8640001
# 8: 2 2014-04-03 2016-10-13 0.8640001 # will be empty post-join
# 9: 2 2016-10-14 2017-02-09 0.8640001
# 10: 2 2017-02-10 2017-02-19 0.8640001
# 11: 2 2017-02-20 2017-05-19 NA
btw, the first operation on sports[..] in the previous bullet is {-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date))) twice;
left join sports onto this, on id and the date-ranges; this will produce NA values in the sport column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport);
assigning val := 1L is purely so that we have a value column during reshaping;
dcast reshapes and fills the missing values with 0.

Fixing the error associated with creating a sequence of dates between start date and end date in R

***NOTE: please do not link a similar post. I have found several other similar postings, but their responses have not resolved the errors I get: "Error in seq.int(0, to0 - from, by) : 'to' must be a finite number" or "from must be of length 1".... I'm looking to understand why these error statements occur and how to prevent them from occuring... Thanks!
I have a data frame like the following
id startdate enddate
1 01/01/2011 01/05/2011
1 02/03/2012 02/05/2012
2 03/04/2013 03/06/2013
3 04/06/2014 04/09/2014
I want to transform the data frame so as to create the following:
id date
1 01/01/2011
1 01/02/2011
1 01/03/2011
1 01/04/2011
1 01/05/2011
1 02/03/2012
1 02/04/2012
1 02/05/2012
2 03/04/2013
2 03/05/2013
2 03/06/2013
.... and so on to fill in the sequence of dates between startdate and enddate
I have tried the following....
one<-as.data.table(one)
one[, startdate:=as.character(startdate)]
one[, enddate:=as.character(enddate)]
one[, startdate:=as.Date(startdate, format="%m/%d/%Y")]
one[, enddate:=as.Date(enddate, format="%m/%d/%Y")]
one<-as.data.frame(one)
one%>%
rowwise() %>%
do(data.frame(id=.$id, date=seq(.$startdate,.$enddate,by="day")))
When I run this, I get the following error: Error in seq.int(0, to0 - from, by) : 'to' must be a finite number
Why is this? And how can I fix this piece of code?
With data.table, we can use Map. Convert the 'startdate' 'enddate' to Date class, use Map to get the sequence of corresponding elements, replicate the 'id' based on the lengths of the list output of dates, concatenate the list of dates to create the two column output
library(data.table)
one[, {lst1 <- Map(seq, as.IDate(startdate, "%m/%d/%Y"),
as.IDate(enddate, "%m/%d/%Y"),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
# id date
# 1: 1 2011-01-01
# 2: 1 2011-01-02
# 3: 1 2011-01-03
# 4: 1 2011-01-04
# 5: 1 2011-01-05
# 6: 1 2012-02-03
# 7: 1 2012-02-04
# 8: 1 2012-02-05
# 9: 2 2013-03-04
#10: 2 2013-03-05
#11: 2 2013-03-06
#12: 3 2014-04-06
#13: 3 2014-04-07
#14: 3 2014-04-08
#15: 3 2014-04-09
If there are multiple formats in 'date' columns, one option is anydate from anytime to automatically convert some of the formats to Date class
library(anytime)
one[, {lst1 <- Map(seq, anydate(startdate),
anydate(enddate),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
Or using tidyverse
library(dplyr)
library(purrr)
library(tidyr)
library(lubridate)
one %>%
transmute(id, date = map2(mdy(startdate), mdy(enddate), seq, by = 'day')) %>%
unnest(c(date))
# A tibble: 15 x 2
# id date
# <int> <date>
# 1 1 2011-01-01
# 2 1 2011-01-02
# 3 1 2011-01-03
# 4 1 2011-01-04
# 5 1 2011-01-05
# 6 1 2012-02-03
# 7 1 2012-02-04
# 8 1 2012-02-05
# 9 2 2013-03-04
#10 2 2013-03-05
#11 2 2013-03-06
#12 3 2014-04-06
#13 3 2014-04-07
#14 3 2014-04-08
#15 3 2014-04-09
data
one <- structure(list(id = c(1L, 1L, 2L, 3L), startdate = c("01/01/2011",
"02/03/2012", "03/04/2013", "04/06/2014"), enddate = c("01/05/2011",
"02/05/2012", "03/06/2013", "04/09/2014")), class = "data.frame", row.names = c(NA,
-4L))
setDT(one)

rcppRoll n = number of months instead of obs

I ran into an issue with the rcppRoll package. I want to use it to sum the value of the past 3 months, however, sometimes there is no data on 1 or more months. The "n = 3" considers the last three observations, rather than the last 3 months. I couldn't find a solid solution, so I am trying my luck here. Thank you in advance for any suggestions.
P.S. I prefer to work with data.table and rcpp_roll as my dataset is large and I am familiar with those.
Code:
library("data.table")
library("RcppRoll")
test = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8)
test = test[, var:= roll_sumr(value, n = 3, na.rm = TRUE), by = id]
id date value var
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 12
6: 1 2015-09 6 15
7: 1 2015-10 7 18
8: 1 2015-11 8 21
Expected output
prefered_outcome = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8,var = c(NA, NA, 6, 9, NA, NA, 18, 21))
id date value var
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 NA
6: 1 2015-09 6 NA
7: 1 2015-10 7 18
8: 1 2015-11 8 21
Define ym of yearmon class and check if the prior and second prior ym are one and two months back and if so use roll_sumr and otherwise use NA.
library(zoo)
ym <- test[, as.yearmon(date)]
test[, roll := ifelse(ym - 1/12 == shift(ym) & ym - 2/12 == shift(ym, 2),
roll_sumr(value, 3, na.rm = TRUE), NA), by = id ]
giving:
> test
id date value roll
1: 1 2015-01 1 NA
2: 1 2015-02 2 NA
3: 1 2015-03 3 6
4: 1 2015-04 4 9
5: 1 2015-08 5 NA
6: 1 2015-09 6 NA
7: 1 2015-10 7 18
8: 1 2015-11 8 21
You can add the missing months first and then performing the function. After that, the added months can be removed again
library(data.table)
library("RcppRoll")
library(zoo)
test = data.table(id = rep(1, 8),date = c("2015-01","2015-02","2015-03","2015-04","2015-08","2015-09","2015-10","2015-11"), value = 1:8)
test$date <- as.yearmon(test$date)
allMonths <- seq.Date(from=as.Date(test$date[1]),to=as.Date(test$date[length(test$date)]),by="month")
df2 <- data.frame(date=as.yearmon(allMonths))
df3 <- merge(test,df2, all=TRUE)
df3 <- df3[, var:= roll_sumr(value, n = 3, na.rm = TRUE), by = id]
df3

Count if between date range based on groups

I'm stuck trying to find a relatively simple way to count occurrences within a date range by group using R. I get the idea there has to be an easier way than what I'm trying.
I have over 6,000 groups, each group has anywhere from 1 to 100 IDs within, each with a start date and an end date anywhere from Jan 1, 1990 to today. I want to make a dataframe, one group per column, and one day per row, counting the number of IDs active per day from April 1, 2013 until March 31, 2018. For obvious reasons, using countifs in excel will not cut it.
I was trying to use this question as a starting point, as such:
df1 <- data.frame(group = c(1,1,2,3,3),
id = c(1,2,1,1,2),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01","2016-04-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05","2999-01-01"))
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 7))
report <- cbind(report,matrix(data=NA,nrow=7,ncol=3))
names(report) <- c('date',as.vector(unique(df1$group)))
daily <- function(i,...){
report[,i+1] <- sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))
}
for (i in unique(df1$group))
daily(i)
However, this doesn't seem to do anything (nor does it throw errors). Is there an easier way to do this? Am I way off base? Any help is appreciated for this non-programmer!
Additional help requested: I'm trying to modify Jaap's code in the answer below to include group start and group end times, so that the data table displays an NA when the group is not active.
Example data:
df2 <- data.frame(group = c(1,1,2,3,3),
groupopendate = c("2016-04-02","2016-04-02","2016-04-01","2016-04-02","2016-04-02"),
groupclosedate = c("2016-04-08","2016-04-08","2016-04-10","2016-04-09","2016-04-09"),
id = c(1,2,1,1,2),
startdate = c("2016-04-02","2016-04-04","2016-04-03","2016-04-02","2016-04-05"),
enddate = c("2016-04-04","2016-04-06","2016-04-10","2016-04-08","2016-04-08"))
Jaap's solution gives me this:
active grp1 grp2 grp3
1: 2016-04-02 1 0 1
2: 2016-04-03 1 1 1
3: 2016-04-04 1 1 1
4: 2016-04-05 1 1 2
5: 2016-04-06 0 1 2
6: 2016-04-07 0 1 2
However, what I want is such:
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 1 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 0
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA NA NA
Any help is appreciated!
A possible alternative solution using data.table:
# load the package & convert 'df1' to a data.table
library(data.table)
setDT(df1)
# convert the date columns to a date format
# not needed if they are
df1[, `:=` (startdate = as.Date(startdate), enddate = as.Date(enddate))]
# create a new data.table with the 'active' days
DT <- data.table(active = seq(from = as.Date("2016-04-01"), by = "day", length.out = 7))
# use a join and dcast to get the desired result
DT[df1
, on = .(active > startdate, active < enddate)
, allow = TRUE
, nomatch = 0
, .(active = x.active, group, id)
][, dcast(.SD, active ~ paste0("grp",group), value.var = "id", fun = length)]
which gives:
active grp1 grp2 grp3
1: 2016-04-01 1 1 0
2: 2016-04-02 1 1 1
3: 2016-04-03 1 1 1
4: 2016-04-04 0 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 1
7: 2016-04-07 1 1 1
NOTE: I've used paste0("grp",group) instead of just group in the dcast step as it leads to better columnnames (it is better not to use just numeric values as columnnames)
With regard to your additional example, you could solve that as follows:
setDT(df2)
df2[, c(2:3,5:6) := lapply(.SD, as.Date), .SDcols = c(2:3,5:6)]
DT <- data.table(active = seq(from = min(df2$groupopendate),
to = max(df2$groupclosedate),
by = "day"))
df2new <- df2[, .(active = seq.Date(startdate, enddate, by = "day"))
, by = .(group, id)
][, .N, by = .(group, active)
][df2[, .(active = seq.Date(groupopendate[1], groupclosedate[.N] - 1, by = "day"))
, by = .(group)]
, on = .(group, active)
][is.na(N), N := 0
][, dcast(.SD, active ~ paste0("grp",group))]
nms <- setdiff(names(df2new), "active")
DT[df2new
, on = .(active)
, (nms) := mget(paste0("i.",nms))][]
which gives:
> DT
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 2 1 1
5: 2016-04-05 1 1 2
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 2
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA 1 NA
I've figured it out! As usual, as soon as you post a question, you figure out the answer. I was overcomplicating it by putting in the function, when I could just put the sapply in the for loop.
If anyone is interested:
for (i in unique(df1$group))
{report[,i+1] <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))}

Run length sequence by time and ID

This problem does not seem to have been put out here before.
I want to find the number of subjects that score 1 for 6 consecutive hours.
The subjects have not been scored every hour so if an hour is missing the hours are not consecutive and the output for that 6-hour period should be NA.
The reason for assigning NA would be that we do not know how the subject has scored on the missing hour. This problem can be used to count consecutive hits, but only count it if a subject has participated.
My dataframe looks like this:
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
hour<-c(1,2,3,7,8,9,10,11,12,17,18,19,1,2,3,4,5,6,8,9,15)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1)
df<-data.frame(ID,hour,A)
I have tried to use the rle function (I am sure its possible) but I cannot get it to condition on both hour and ID.
The output would be like this:
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
hour<-c(1,2,3,7,8,9,10,11,12,17,18,19,1,2,3,4,5,6,8,9,15)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1)
six<-c(NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA)
df<-data.frame(ID,hour,A,six)
Thank you in advance.
I believe the original data set I gave was too small to make the solutions more generalizable.
I just tried the codes with this dataset and found that this will result in a wrong result.
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4)
hour<-c(1,2,3,7,8,9,10,12,13,17,18,19,1,2,3,4,5,6,8,9,15,1:23,27,28,29,30,31)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,rep(1,28))
df<-data.frame(ID,hour,A)
For the new dataset the output should be:
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4)
hour<-c(1,2,3,7,8,9,10,12,13,17,18,19,1,2,3,4,5,6,8,9,15,1:23,27,28,29,30,31)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,rep(1,28))
six<-c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
df<-data.frame(ID,hour,A,six)
Here is an approach in tidyverse with the updated data set:
library(tidyverse)
df %>%
group_by(ID) %>%
expand(hour = seq(min(hour), max(hour))) %>%
left_join(df) %>%
mutate(rle = rep(rle(A)$lengths, times = rle(A)$lengths)) %>%
group_by(ID, rle) %>%
mutate(sum = cumsum(A),
six = ifelse(rle >= 6 & A == 1, 0, NA),
six = ifelse(sum == 6, 1, ifelse(sum > 6, NA, six))) %>%
filter(!is.na(A)) %>%
ungroup() %>%
select(ID, hour, A, six) %>%
as.data.frame() -> df_out2
check requested output:
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4)
hour<-c(1,2,3,7,8,9,10,12,13,17,18,19,1,2,3,4,5,6,8,9,15,1:23,27,28,29,30,31)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,rep(1,28))
six<-c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
df<-data.frame(ID,hour,A,six)
all.equal(df, df_out2)
#output
TRUE
The old answer:
df %>%
mutate(rle = rep(rle(A)$lengths, times = rle(A)$lengths)) %>%
group_by(ID, rle) %>%
mutate(sum = cumsum(A),
six = ifelse(rle >= 6 & A == 1, 0, NA),
six = ifelse(sum == 6, 1, ifelse(sum > 6, NA, six))) %>%
ungroup() %>%
select(ID, hour, A, six) %>%
as.data.frame() -> df_out2
Lets check if the result is like requested:
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
hour <- c(1,2,3,7,8,9,10,11,12,17,18,19,1,2,3,4,5,6,8,9,15)
A <- c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1)
six <- c(NA,NA,NA,0,0,0,0,0,1,NA,NA,NA,0,0,0,0,0,1,NA,NA,NA)
df1 <- data.frame(ID, hour, A, six)
df1 is requested output
all.equal(df1, df_out2)
#output
TRUE
some benchmarking:
library(microbenchmark)
library(data.table)
akrun <- function(df){
setDT(df)[, grp := rleid(A)][, Anew := A *((hour - shift(hour, fill = hour[1])) ==1), grp
][, sixnew :=if(sum(A)>=5) rep(c(0, 1), c(.N-1, 1)) else NA_real_,.(rleid(Anew), grp)]
i1 <- df[, .I[which(is.na(sixnew) & shift(sixnew == 0, type = 'lead'))], grp]$V1
df[i1, sixnew := 0][, c("Anew", "grp") := NULL][]
}
missuse <- function(df){
df %>%
mutate(rle = rep(rle(A)$lengths, times = rle(A)$lengths)) %>%
group_by(ID, rle) %>%
mutate(sum = cumsum(A),
six = ifelse(rle >= 6 & A == 1, 0, NA),
six = ifelse(sum == 6, 1, ifelse(sum > 6, NA, six))) %>%
ungroup() %>%
select(ID, hour, A, six)
}
Mike <- function(df){
ave(df$A,
cumsum(!(df$hour == shift(df$hour, fill = 0) + 1)),
FUN = function(x) {
if(all(x==1) & length(x) >= 6) return(c(rep(0, length(x) - 1), 1))
else return(rep(NA, length(x)))})
}
microbenchmark(Mike(df),
akrun(df),
missuse(df))
#output
Unit: microseconds
expr min lq mean median uq max neval
Mike(df) 491.291 575.7115 704.2213 597.7155 629.0295 9578.684 100
akrun(df) 6568.313 6725.5175 7867.4059 6843.5790 7279.2240 69790.755 100
missuse(df) 11042.822 11321.0505 12434.8671 11512.3200 12616.3485 43170.935 100
way to go Mike H.!
To get the groupings you can compare the current hour to the lagged hour to see if they are "consecutive" or 1 integer apart and then take the cumsum of that. Once you have the groupings you can use a simple ave to get the output you want.
library(data.table)
df$six <- ave(df$A,
cumsum(!(df$hour == shift(df$hour, fill = 0) + 1)),
FUN = function(x) {
if(all(x==1) & length(x) >= 6) return(c(rep(0, length(x) - 1), 1))
else return(rep(NA, length(x)))}
)
df
# ID hour A six
#1 1 1 0 NA
#2 1 2 1 NA
#3 1 3 0 NA
#4 1 7 1 0
#5 1 8 1 0
#6 1 9 1 0
#7 1 10 1 0
#8 1 11 1 0
#9 1 12 1 1
#10 1 17 0 NA
#11 1 18 0 NA
#12 1 19 0 NA
#13 2 1 1 0
#14 2 2 1 0
#15 2 3 1 0
#16 2 4 1 0
#17 2 5 1 0
#18 2 6 1 1
#19 2 8 1 NA
#20 2 9 1 NA
#21 2 15 1 NA
If you only want to select a patient at most once, this will select the last time period:
df$six_adj <- ave(df$six, df$ID, df$six, FUN = function(x) {
if(all(x==1)) return(c(rep(0, length(x) - 1), 1))
else return(x)}
)
We can use rleid from data.table. Create a grouping column with run-length-id of 'A' ('grp'). Take the difference of 'hour' with the next value of 'hour', check if it is equal to 1 and multiply with 'A' to create 'Anew'. Grouped by run-length-id of 'Anew' and 'grp', if the sum of 'A' is greater than a particular value, replicate with 0s and 1s or else return NA. In the last phase, assign some spillover NAs to 0 by creating an index ('i1')
library(data.table)
setDT(df)[, grp := rleid(A)][, Anew := A *((hour - shift(hour, fill = hour[1])) ==1), grp
][, sixnew :=if(sum(A)>=5) rep(c(0, 1), c(.N-1, 1)) else NA_real_,.(rleid(Anew), grp)]
i1 <- df[, .I[which(is.na(sixnew) & shift(sixnew == 0, type = 'lead'))], grp]$V1
df[i1, sixnew := 0][, c("Anew", "grp") := NULL][]
# ID hour A six sixnew
# 1: 1 1 0 NA NA
# 2: 1 2 1 NA NA
# 3: 1 3 0 NA NA
# 4: 1 7 1 0 0
# 5: 1 8 1 0 0
# 6: 1 9 1 0 0
# 7: 1 10 1 0 0
# 8: 1 11 1 0 0
# 9: 1 12 1 1 1
#10: 1 17 0 NA NA
#11: 1 18 0 NA NA
#12: 1 19 0 NA NA
#13: 2 1 1 0 0
#14: 2 2 1 0 0
#15: 2 3 1 0 0
#16: 2 4 1 0 0
#17: 2 5 1 0 0
#18: 2 6 1 1 1
#19: 2 8 1 NA NA
#20: 2 9 1 NA NA
#21: 2 15 1 NA NA
Or slightly more compact option would be
i1 <- setDT(df)[, if(sum(A)>= 5) .I[.N] , rleid(c(TRUE, diff(hour)==1), A)]$V1
df[i1, sixnew := 1][unlist(Map(`:`, i1-5, i1-1)), sixnew := 0]
df
# ID hour A six sixnew
# 1: 1 1 0 NA NA
# 2: 1 2 1 NA NA
# 3: 1 3 0 NA NA
# 4: 1 7 1 0 0
# 5: 1 8 1 0 0
# 6: 1 9 1 0 0
# 7: 1 10 1 0 0
# 8: 1 11 1 0 0
# 9: 1 12 1 1 1
#10: 1 17 0 NA NA
#11: 1 18 0 NA NA
#12: 1 19 0 NA NA
#13: 2 1 1 0 0
#14: 2 2 1 0 0
#15: 2 3 1 0 0
#16: 2 4 1 0 0
#17: 2 5 1 0 0
#18: 2 6 1 1 1
#19: 2 8 1 NA NA
#20: 2 9 1 NA NA
#21: 2 15 1 NA NA
Benchmarks
Created a slightly bigger dataset and tested the solutions
-data
ID<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
hour<-c(1,2,3,7,8,9,10,11,12,17,18,19,1,2,3,4,5,6,8,9,15)
A<-c(0,1,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1)
ID <- rep(1:5000, rep(c(12, 9), 2500))
A <- rep(A, 2500)
hour <- rep(hour, 2500)
dftest <- data.frame(ID, hour, A)
-functions
akrun <- function(df){
df1 <- copy(df)
setDT(df1)[, grp := rleid(A)][, Anew := A *((hour - shift(hour, fill = hour[1])) ==1), grp
][, sixnew :=if(sum(A)>=5) rep(c(0, 1), c(.N-1, 1))
else NA_real_,.(rleid(Anew), grp)]
i1 <- df1[, .I[which(is.na(sixnew) & shift(sixnew == 0, type = 'lead'))], grp]$V1
df1[i1, sixnew := 0][, c("Anew", "grp") := NULL][]
}
akrun2 <- function(df) {
df1 <- copy(df)
i1 <- setDT(df1)[, if(sum(A)>= 5) .I[.N] , rleid(c(TRUE, diff(hour)==1), A)]$V1
df1[i1, sixnew := 1][unlist(Map(`:`, i1-5, i1-1)), sixnew := 0]
}
missuse <- function(df){
df %>%
mutate(rle = rep(rle(A)$lengths, times = rle(A)$lengths)) %>%
group_by(ID, rle) %>%
mutate(sum = cumsum(A),
six = ifelse(rle >= 6 & A == 1, 0, NA),
six = ifelse(sum == 6, 1, ifelse(sum > 6, NA, six))) %>%
ungroup() %>%
select(ID, hour, A, six)
}
Mike <- function(df){
ave(df$A,
cumsum(!(df$hour == shift(df$hour, fill = 0) + 1)),
FUN = function(x) {
if(all(x==1) & length(x) >= 6) return(c(rep(0, length(x) - 1), 1))
else return(rep(NA, length(x)))})
}
-benchmark
microbenchmark(Mike(dftest),
akrun(dftest),
akrun2(dftest),
missuse(dftest),
times = 10L, unit = 'relative')
-output
#Unit: relative
# expr min lq mean median uq max neval cld
# Mike(dftest) 1.682794 1.754494 1.673811 1.68806 1.632765 1.640221 10 a
# akrun(dftest) 13.159245 12.950117 12.176965 12.33716 11.856271 11.095228 10 b
# akrun2(dftest) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10 a
# missuse(dftest) 37.899905 36.773837 34.726845 34.87672 33.155939 30.665840 10 c

Resources