How to add rows with time periods inbetween given time period? - r

I have a data set with time periods, that may overlap, showing me if somebody was present (example_df). I want to get a data set that splits a large time period (from 2014-01-01 to 2014-10-31) into smaller time periods where somebody was present (present = 1) and time periods where nobody was present (present = 0).
The result should look like result_df
Example data frame
example_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")),
end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")),
present = 1)
Result should look like this
result_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")),
end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")),
present = c(1, 0, 1, 0, 1, 0))
I have no idea how to tackle this problem as it requires to split time periods or add rows (or something else?). Any help is much appreciated!

I hope I can be helpful, as I have struggled with this as well.
As in IceCreamToucan's example, this assumes independence by person ID. This approach uses dplyr to look at overlap in date ranges and then flattens them. Other examples of this approach have been described in stackoverflow and use dplyr. The end result includes time ranges where the person is present.
library(tidyr)
library(dplyr)
pres <- example_df %>%
group_by(ID) %>%
arrange(start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end), present = 1) %>%
select(-indx)
Then, additional rows can be added to indicate time period when not present. In these cases, for a given ID, it will determine gaps between an older end date and a newer (more recent) start date. Then finally the result is ordered by ID and the start date.
result <- pres
for (i in unique(pres$ID)) {
pres_i <- subset(pres, ID == i)
if (nrow(pres_i) > 1) {
adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
adding <- adding[adding$start <= adding$end, ]
result <- bind_rows(result, adding)
}
}
result[order(result$ID, result$start), ]
# A tibble: 5 x 4
# Groups: ID [1]
ID start end present
<dbl> <date> <date> <dbl>
1 1 2014-01-01 2014-04-12 1
2 1 2014-04-13 2014-06-12 0
3 1 2014-06-13 2014-08-05 1
4 1 2014-08-06 2014-08-14 0
5 1 2014-08-15 2014-10-02 1

Assuming you want to do it separately for each ID, you can create a data table with all dates for which someone was present, and join that with a table of all dates over that time period. The result is not exactly the same, because the present and not-present periods don't overlap.
library(data.table)
setDT(example_df)
example_df[, {
pres <- unique(unlist(Map(`:`, start, end)))
class(pres) <- 'Date'
all <- min(pres):max(pres)
class(all) <- 'Date'
pres <- data.table(day = pres)
all <- data.table(day = all)
out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
out.full[, .(start = min(day), end = max(day)),
by = .(present, rid = rleid(present))][, -'rid']
}, by = ID]
# ID present start end
# 1: 1 1 2014-01-01 2014-04-12
# 2: 1 0 2014-04-13 2014-06-12
# 3: 1 1 2014-06-13 2014-08-05
# 4: 1 0 2014-08-06 2014-08-14
# 5: 1 1 2014-08-15 2014-10-02

Related

With R, is there a better way to collect the days of start and end of records for each individuals with several series of records [duplicate]

This question already has answers here:
R grouping based on time difference
(3 answers)
Earliest Date for each id in R
(4 answers)
Closed 2 years ago.
I expect to find for thousand of ids the days when they start to be recorded, and the days when they stop, in a simple way.
I currently use a loop which works well but take ages, as below.
an example of my dataset :
id date
1 2017-11-30
1 2017-12-01
1 2017-12-02
1 2017-12-03
1 2017-12-05
1 2017-12-06
1 2017-12-07
1 2017-12-08
1 2017-12-09
1 2017-12-10
and then I use this loop to find each date when the individual start to be recorded, without a stop between days. In my example in give the '2017-11-30' and the '2017-12-05' for the starts, and the '2017-12-03' and the '2017-12-10' for the ends.
nani <- unique(dat$id)
n <- length(dat$id)
#SET THE NEW OBJECT WHERE TO SAVE RESULTS
NEWDAT <- NULL
for(i in 1 : n)
{
#SELECT ANIMALS I WITHIN THE DATA.FRAME
x <- which(dat$id == nani[i])
#FIND THE POSITION IN THE DATA FRAME OF THE DAYS WHEN THE RECORD IS NOT CONTINUE
diffx <- diff(diff(dat$date[x]))
#FIND THE POSITION OF STARTS FOR EACH SESSIONS OF RECORDS
starti <- which(diffx < 0) +1
#FIND THE POSITION OF ENDS FOR EACH SESSIONS OF RECORDS
endi <- which(diffx > 0) +1
#FIND THE DATES OF STARTS FOR EACH SESSIONS OF RECORDS
starts_records <- c(dat$date[x][1], dat$date[x][starti])
#FIND THE DATES OF ENDS FOR EACH SESSIONS OF RECORDS
ends_records <- c(dat$date[x][endi], dat$date[x][length(x)])
#CREATE LABELS
name_start <- rep("START_RECORDS_BY_SENSORS", length(starts_records))
name_end <- rep("END_RECORDS_BY_SENSORS", length(ends_records))
#CREATE THE NEW DATA.FRAME EXPECTED
dat2 <- data.frame( "event_start" = c(starts_records, ends_records),
"name" = c(name_start, name_end))
dat2 <- dat2[order(dat2$event_start),]
#SAVE RESULTS
NEWDAT <- bind_rows(NEWDAT, dat2)
}
So far, I tried things as below but did not found the right solution to avoid the loop.
NEWDAT <- dat %>% group_by(id) %>% summarize(diff_days = diff(diff(date)))
I still struggle to understand well the syntaxe of dplyr.
You can try to create a new group at every break and get first and last date in each group.
library(dplyr)
df %>%
group_by(id, grp = cumsum(c(TRUE, diff(date) > 1))) %>%
summarise(start = first(date), stop = last(date))
# id grp start stop
# <int> <int> <date> <date>
#1 1 1 2017-11-30 2017-12-03
#2 1 2 2017-12-05 2017-12-10

Generate records in an R data frame between two dates

I have a data frame that consists of customers scheduled subscription payments as follows:
CusID <- c(1,2,3)
FromDate <- c(ymd("2019-01-01"), ymd("2019-01-04"), ymd("2019-02-02"))
ToDate <-c(ymd("2019-01-16"), ymd("2019-01-15"), ymd("2019-04-03"))
Amount <- c(5,10,12)
Frequency <- c("Weekly", "Fortnightly", "Monthly")
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
For each row (customer), I wish to loop from the FromDate to the ToDate and output one row of each data for each scheduled payment that falls between those dates, resulting in the following data frame:
CusID <- c(1,1,1,2,3,3,3)
PaymentDate <- c(ymd("2019-01-01"), ymd("2019-01-08"), ymd("2019-01-15"),
ymd("2019-01-04"),ymd("2019-02-02"),ymd("2019-03-02"),ymd("2019-04-02"))
Amount <- c(5,5,5,10,12,12,12)
Output <- data.frame(CusID, PaymentDate, Amount)
What is an efficient way to achieve this using R (and preferably using dplyr / tidyverse functions)?
In SAS my approach would be to use a DO / WHILE LOOP and OUTPUT statement to write a new line for each scheduled payment. e.g.
data Output;
set Input;
PaymentDate = FromDate;
do while (PaymentDate < ToDate);
Payment = Amount;
PaymentDate = PaymentDate + (7 / 14 / 30 ~ logic based on Frequency);
output;
loop;
run;
(The key here in SAS is the output statement - it explicitly writes a new record each time it is invoked, thus can be used in a loop to write multiple output lines per input line).
Is there an equivalent method available in R, or is a different approach recommended?
Another option using tidyverse
Input %>%
mutate(Frequency = case_when(Frequency == "Weekly" ~ 7L,
Frequency == "Fortnightly" ~ 14L,
Frequency == "Monthly" ~ 30L,
TRUE ~ 0L)) %>%
group_by(CusID) %>%
group_modify(~ {PaymentDate <- seq.Date(from = .x$FromDate, to = .x$ToDate, by = .x$Frequency)
crossing(.x[,1], PaymentDate)})
# A tibble: 7 x 3
# Groups: CusID [3]
CusID PaymentDate Amount
<dbl> <date> <dbl>
1 1 2019-01-01 5
2 1 2019-01-08 5
3 1 2019-01-15 5
4 2 2019-01-04 10
5 3 2019-02-02 12
6 3 2019-03-04 12
7 3 2019-04-03 12
Payment dates are a little different from your expected output because seq.Date adds 30 days taking into account the different number of days in those months.
UPDATE:
Here is a more verbatim solution
Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID, Amount) %>%
expand(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency))
# A tibble: 7 x 3
# Groups: CusID, Amount [3]
CusID Amount PaymentDate
<dbl> <dbl> <date>
1 1 5 2019-01-01
2 1 5 2019-01-08
3 1 5 2019-01-15
4 2 10 2019-01-04
5 3 12 2019-02-02
6 3 12 2019-03-02
7 3 12 2019-04-02
I tweaked your Input data.frame so that the Frequency values are strings, not factors.
You could create a helper table freq_mapping to convert from your Frequency to the frequency format R likes. This would avoid the 30 day issue that one of the other answers pointed out.
freq_mapping <- data.frame(Frequency=c('Weekly', 'Fortnightly', 'Monthly'),
RFrequency = c('1 week', '2 weeks', '1 month'),
stringsAsFactors = FALSE)
Then merge Input with this:
Input <- Input %>%
inner_join(freq_mapping, by = 'Frequency')
Now you can create the payment dates:
Input$PaymentDate <- Input$FromDate
Input %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)
not so easy problem for me. The solution is not beautiful but it should somehow do the work. You'll see there is a problem for the monthly payment which is not always 30, but otherwise it should work. But nicer solution surely exist.
library(data.table)
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
Input=data.table(Input)
Input[Frequency=="Weekly",freq:=7][Frequency=="Fortnightly",freq:=14][Frequency=="Monthly",freq:=30]
Input[,Ratio:=(ToDate-FromDate)/freq]
#What is the maximum rows ? for a customer ?
NREP=as.integer(max(ceiling(Input$Ratio)))
Input[,Rep:=1][,PaymentDate:=FromDate]
for(i in 1:NREP){
Inputtemp=copy(Input)
Inputtemp[,FromDate:=FromDate+freq]
Input=rbind(Input,Inputtemp)
}
#Remove invalid rows
Input=unique(Input)
Input=Input[!(FromDate>ToDate),]
setorder(Input,CusID)
Input=Input[,c("CusID","FromDate","Amount")]
setnames(Input,"FromDate","PaymentDate")
Input==data.table(Output)
A mashup of Humpelstielzchen and user2474226's answers, to bring all logic into a single dplyr step.
Output <- Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

Count number of occurences in date range in R

I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)
We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)

Count rows based on multiple consecutive time flows

I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE

Resources