How to replace missing value in time series data by looping? - r

I'm trying to create looping to replace missing time series data with value == 0.
This is my data:
df
Times value
05-03-2018 09:00:00 1
05-03-2018 09:01:26 2
05-03-2018 09:04:28 1
05-03-2018 09:07:05 2
05-03-2018 09:09:05 1
and my desired output is:
Times value
05-03-2018 09:00:00 1
05-03-2018 09:01:26 2
05-03-2018 09:02:00 0
05-03-2018 09:03:00 0
05-03-2018 09:04:28 1
05-03-2018 09:05:00 0
05-03-2018 09:06:00 0
05-03-2018 09:07:05 2
05-03-2018 09:08:00 0
05-03-2018 09:09:05 1
Missing minutes in the data should be created and assigned a value of 0.
What should I do? Create new dummies table with missing minute or make a sequence looping?

You could do this with dplyr and padr packages. padr is very useful for extending datetime series between to dates or adding missing values.
library(dplyr)
library(padr)
df1 %>%
thicken(interval = "min") %>% # roll time series up to minutes
pad(by = "Times_min") %>% # add missing minute intervals
fill_by_value(value) %>% # fill missing values with 0
mutate(Times = if_else(is.na(Times), Times_min, Times)) %>% # fill NA's in Times column
select(-Times_min) # drop not needed column
pad applied on the interval: min
Times value
1 2018-03-05 09:00:00 1
2 2018-03-05 09:01:26 2
3 2018-03-05 09:02:00 0
4 2018-03-05 09:03:00 0
5 2018-03-05 09:04:28 1
6 2018-03-05 09:05:00 0
7 2018-03-05 09:06:00 0
8 2018-03-05 09:07:05 2
9 2018-03-05 09:08:00 0
10 2018-03-05 09:09:05 1
data:
df1 <- structure(list(Times = structure(c(1520240400, 1520240486, 1520240668,
1520240825, 1520240945), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
value = c(1, 2, 1, 2, 1)), row.names = c(NA, -5L), class = "data.frame")

You could create a second 'complete' data frame and merge them together.
dif <- diff(as.numeric(range(df1$Times)))
df1 <- merge(df1,
data.frame(Times=as.POSIXct(0:(dif/60)*60,
origin=df1[1, 1], tz="UTC")), all=TRUE)
Then replace resulting NAs with 0.
df1[is.na(df1$value), 2] <- 0
Finally remove the duplicates.
df1 <- df1[-which(duplicated(strftime(df1$Times, format="%M"))) + 1, ]
Yields:
> df1
Times value
1 2018-03-05 09:00:00 1
3 2018-03-05 09:01:26 2
4 2018-03-05 09:02:00 0
5 2018-03-05 09:03:00 0
7 2018-03-05 09:04:28 1
8 2018-03-05 09:05:00 0
9 2018-03-05 09:06:00 0
11 2018-03-05 09:07:05 2
12 2018-03-05 09:08:00 0
14 2018-03-05 09:09:05 1
Data:
df1 <- structure(list(Times = structure(c(1520240400, 1520240486, 1520240668,
1520240825, 1520240945), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
value = c(1, 2, 1, 2, 1)), row.names = c(NA, -5L), class = "data.frame")

library(tidyverse)
library(lubridate)
library(magrittr)
Recreate your data
df <- tibble(
Times = c("05-03-2018 09:00:00", "05-03-2018 09:01:26",
"05-03-2018 09:04:28", "05-03-2018 09:07:05",
"05-03-2018 09:09:05"),
value = c(1, 2, 1, 2, 1)
)
Code
Parse your Times variable to datetime
df$Times %<>% parse_datetime("%d-%m-%Y %H:%M:%S")
Create a new variable join that is truncated to the minute
df %<>% mutate(join = floor_date(Times, unit = "minute"))
Create a new data frame with one variable also called join and containing every minute in your range
all <- tibble(
join = seq(as_datetime(first(df$Times), as_datetime(last(df$Times)), by = 60)
)
Join both data frames
result <- left_join(all, df)
Add the "missing minutes" to your Times variable
result$Times[is.na(result$Times)] <- result$join[is.na(result$Times)]
Replace the NA by 0
result$value[is.na(result$value)] <- 0
Remove the join variable
result %>%
select(- join)
Result
# A tibble: 10 x 2
Times value
<dttm> <dbl>
1 2018-03-05 09:00:00 1
2 2018-03-05 09:01:26 2
3 2018-03-05 09:02:00 0
4 2018-03-05 09:03:00 0
5 2018-03-05 09:04:28 1
6 2018-03-05 09:05:00 0
7 2018-03-05 09:06:00 0
8 2018-03-05 09:07:05 2
9 2018-03-05 09:08:00 0
10 2018-03-05 09:09:05 1

Related

Beginner: set up time series in R

I am brand new to R, and am having trouble figuring out how to set up a simple time series.
Illustration: say I have three variables: Event (0 or 1), HR (heart rate), DT (datetime):
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
Event HR DT
1 1 100 2020-01-01 09:00:00
2 0 120 2020-01-01 09:15:00
3 0 115 2020-01-01 10:00:00
4 0 105 2020-01-01 10:30:00
5 1 105 2020-01-01 11:00:00
6 0 115 2020-01-01 12:00:00
7 0 100 2020-01-01 13:00:00
What I would like to do is to calculate elapsed time after each new event: So, row1=0 min, row2=15, row3=60,... row5=0, row6=60 Then I can do things like plot HR vs elapsed.
What might be a simple way to calculate elapsed time?
Apologies for such a low level question, but would be very grateful for any help!
Here is a one line approach using data.table.
Data:
df <- structure(list(Event = c(1, 0, 0, 0, 1, 0, 0), HR = c(100, 120,
115, 105, 105, 115, 100), DT = structure(c(1577869200, 1577870100,
1577872800, 1577874600, 1577876400, 1577880000, 1577883600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -7L), class = "data.frame")
Code:
library(data.table)
dt <- as.data.table(df)
dt[, mins_since_last_event := as.numeric(difftime(DT,DT[1],units = "mins")), by = .(cumsum(Event))]
Output:
dt
Event HR DT mins_since_last_event
1: 1 100 2020-01-01 09:00:00 0
2: 0 120 2020-01-01 09:15:00 15
3: 0 115 2020-01-01 10:00:00 60
4: 0 105 2020-01-01 10:30:00 90
5: 1 105 2020-01-01 11:00:00 0
6: 0 115 2020-01-01 12:00:00 60
7: 0 100 2020-01-01 13:00:00 120
The following uses the Chron library and converts your date/time column to time objects for the library to be able to run calculations and conversions on.
Example Data:
df <- data.frame(
Event=c(1,0,0,0,1,0,0),
HR=c(100,125,115,105,105,115,100),
DT=c("2020-01-01 09:00:00"
,"2020-01-01 09:15:00"
,"2020-01-01 10:00:00"
,"2020-01-01 10:30:00"
,"2020-01-01 11:00:00"
,"2020-01-01 12:00:00"
,"2020-01-01 13:00:00"))
Code:
library(chron)
Dates <- lapply(strsplit(as.character(df$DT)," "),head,n=1)
Times <- lapply(strsplit(as.character(df$DT)," "),tail,n=1)
df$DT <- chron(as.character(Dates),as.character(Times),format=c(dates="y-m-d",times="h:m:s"))
df$TimeElapsed[1] <- 0
for(i in 1:nrow(df)){
if(df$Event[i]==1){TimeStart <- df$DT[i]}
df$TimeElapsed[i] <- (df$DT[i]-TimeStart)*24*60
}
output:
> df
Event HR DT TimeElapsed
1 1 100 (20-01-01 09:00:00) 0
2 0 125 (20-01-01 09:15:00) 15
3 0 115 (20-01-01 10:00:00) 60
4 0 105 (20-01-01 10:30:00) 90
5 1 105 (20-01-01 11:00:00) 0
6 0 115 (20-01-01 12:00:00) 60
7 0 100 (20-01-01 13:00:00) 120
Welcome to Stack Overflow #greyguy.
Here is an approach with dplyr library wich is pretty good with large data sets:
library(dplyr)
#Yours Data
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
# Transform in time format not string and order by time if not ordered
Transform in time format not string and order by time if not ordered
df = df %>%
mutate(DT = as.POSIXct(DT, format = "%Y-%m-%d %H:%M:%S")) %>%
arrange(DT) %>%
mutate(#Litte trick to get last DT Observation
last_DT = case_when(Event==1 ~ DT),
last_DT = na.locf(last_DT),
Elapsed_min = as.numeric( (DT - last_DT)/60)
) %>%
select(-last_DT)
The output:
# Event HR DT Elapsed_min
# 1 100 2020-01-01 09:00:00 0
# 0 120 2020-01-01 09:15:00 15
# 0 115 2020-01-01 10:00:00 60
# 0 105 2020-01-01 10:30:00 90
# 1 105 2020-01-01 11:00:00 0
# 0 115 2020-01-01 12:00:00 60
# 0 100 2020-01-01 13:00:00 120

Complex conditional groupby in R

Here is the problem I am trying to solve.
I want to take table 1 to table 2.
Table 1 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
Table 2 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-12 01:20:00 3 293.0896
# 2 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 3 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 4 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
As you notice :
I am trying to build a function that will :
For every single unique patient (unique icustay_id), groupby icustay_id ONLY if the medication has been stopped for less than an hour.
When the row merges :
Some columns will retain the same value (i.e. the patient identifiers)
Some columns must be modified :
Keep the earlier starttime
Keep the latter endttime
Average the vaso-rate
Sum the vaso-amount
To do so, I have decided to add another column identifier that takes the value 1 when the condition is met and when all the rows are verified, groupby (icustay_id and that new column)
My code as it is written however does not assign the appropriate ID in respect to the condition.
Here is the sample df creation code :
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
Here is the function code that I have right now :
merge_pressor_doses <- function(df){
df %>% arrange(icustay_id,starttime)
for (i in unique(df$icustay_id))
{
for (j in which(df$icustay_id==i))
{
start <- df$starttime[as.numeric(j)+1]
end <- df$endtime[as.numeric(j)]
stopduration <- as.numeric(difftime(start, end, units = 'mins'))
bool <- stopduration < 60
df <- df%>%mutate(
group = case_when(
bool = TRUE ~ 1,
bool = FALSE ~ 0)
)
}
}
return(df)
}
This should result in :
df
# icustay_id starttime endtime vaso_rate vaso_amount group
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896 1
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983 1
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360 0
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494 1
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939 1
But in my case the 3rd row is assign a value of 1...
If I can manage to make this portion of the code work, I could proceed with this portion of the code to achieve my objective.
The eventual second portion of the code would be :
group_by(group, icustay_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount))
Thank you in advance!!
I'd create a new column pause which says how much time passed since the last medication. Then using this column we assign groups ids to medications: cumsum(pause >= 1) - start with 0, then if pause is >=1 hours, it's a different group.
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
library(dplyr)
library(tidyr)
df <-
df %>%
group_by(icustay_id) %>%
mutate(pause = difftime(starttime, lag(endtime), units = "hours")) %>%
replace_na(list(pause = 0)) %>%
mutate(vaso_id = cumsum(pause >= 1))
# A tibble: 5 x 7
# Groups: icustay_id [3]
# icustay_id starttime endtime vaso_rate vaso_amount pause vaso_id
# <dbl> <dttm> <dttm> <int> <dbl> <drtn> <int>
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 9 898. 0.0000000 hours 0
# 2 1 2019-09-11 13:30:00 2019-09-11 01:20:00 4 945. 0.1666667 hours 0
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661. 87.3333333 hours 1
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629. 0.0000000 hours 0
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8 0.0000000 hours 0
Then we can use the code you provided.
df %>%
group_by(icustay_id, vaso_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount)
)
# A tibble: 4 x 6
# Groups: icustay_id [3]
# icustay_id vaso_id starttime endtime vaso_rate sum_vaso_amount
# <dbl> <int> <dttm> <dttm> <dbl> <dbl>
# 1 1 0 2019-09-10 13:20:00 2019-09-11 13:20:00 6.5 1843.
# 2 1 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661.
# 3 2 0 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629.
# 4 3 0 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8

R aggregate second data to minutes more efficient

I have a data.table, allData, containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.
timestamp nightNo data1 data2
2018-10-19 19:15:00 1 1 7
2018-10-19 19:15:01 1 2 8
2018-10-19 19:15:02 1 3 9
2018-10-19 18:10:22 2 4 10
2018-10-19 18:10:23 2 5 11
2018-10-19 18:10:24 2 6 12
I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:
aggregate_minute <- function(df){
df %>%
group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
summarise(data1= mean(data1), data2= mean(data2)) %>%
as.data.table()
}
allData <- allData[, aggregate_minute(allData), by=nightNo]
However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?
allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)),
nightNo = rep(1:2, c(3, 3)),
data1 = 1:6,
data2 = 7:12)
timestamp nightNo data1 data2
1: 2018-06-14 10:43:11 1 1 7
2: 2018-06-14 10:43:11 1 2 8
3: 2018-06-14 10:43:11 1 3 9
4: 2018-06-14 10:48:31 2 4 10
5: 2018-06-14 10:48:31 2 5 11
6: 2018-06-14 10:48:31 2 6 12
allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
nightNo timestamp data1 data2
1: 1 2018-06-14 10:43:00 2 8
2: 2 2018-06-14 10:48:00 5 11
> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
user system elapsed
3.25 0.02 3.31
> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
user system elapsed
1.02 0.04 1.06
You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.
library(data.table)
library(lubridate)
Reproducible data:
text <- "timestamp nightNo data1 data2
'2018-10-19 19:15:00' 1 1 7
'2018-10-19 19:15:01' 1 2 8
'2018-10-19 19:15:02' 1 3 9
'2018-10-19 18:10:22' 2 4 10
'2018-10-19 18:10:23' 2 5 11
'2018-10-19 18:10:24' 2 6 12"
allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)
Create data.table:
setDT(allData)
Create a timestamp and floor it to the nearest minute:
allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]
Change the type of the integer columns to numeric:
allData[, ':='(data1 = as.numeric(data1),
data2 = as.numeric(data2))]
Replace the data columns with their means by nightNo group:
allData[, ':='(data1 = mean(data1),
data2 = mean(data2)),
by = nightNo]
The result is:
timestamp nightNo data1 data2
1: 2018-10-19 19:15:00 1 2 8
2: 2018-10-19 19:15:00 1 2 8
3: 2018-10-19 19:15:00 1 2 8
4: 2018-10-19 18:10:00 2 5 11
5: 2018-10-19 18:10:00 2 5 11
6: 2018-10-19 18:10:00 2 5 11

Consolidating rows by max and min dates

I have a dataset that looks like this.
id1 = c(1,1,1,1,1,1,1,1,2,2)
id2 = c(3,3,3,3,3,3,3,3,3,3)
lat = c(-62.81559,-62.82330, -62.78693,-62.70136, -62.76476,-62.48157,-62.49064,-62.45838,42.06258,42.06310)
lon = c(-61.15518, -61.14885,-61.17801,-61.00363, -59.14270, -59.22009, -59.32967, -59.04125 ,154.70579, 154.70625)
start_date= as.POSIXct(c('2016-03-24 15:30:00', '2016-03-24 15:30:00','2016-03-24 23:40:00','2016-03-25 12:50:00','2016-03-29 18:20:00','2016-06-01 02:40:00','2016-06-01 08:00:00','2016-06-01 16:30:00','2016-07-29 20:20:00','2016-07-29 20:20:00'), tz = 'UTC')
end_date = as.POSIXct(c('2016-03-24 23:40:00', '2016-03-24 18:50:00','2016-03-25 03:00:00','2016-03-25 19:20:00','2016-04-01 03:30:00','2016-06-02 01:40:00','2016-06-01 14:50:00','2016-06-02 01:40:00','2016-07-30 07:00:00','2016-07-30 07:00:00'),tz = 'UTC')
speed = c(2.9299398, 2.9437502, 0.0220565, 0.0798409, 1.2824859, 1.8685429, 3.7927680, 1.8549291, 0.8140249,0.8287073)
df = data.frame(id1, id2, lat, lon, start_date, end_date, speed)
id1 id2 lat lon start_date end_date speed
1 1 3 -62.81559 -61.15518 2016-03-24 15:30:00 2016-03-24 23:40:00 2.9299398
2 1 3 -62.82330 -61.14885 2016-03-24 15:30:00 2016-03-24 18:50:00 2.9437502
3 1 3 -62.78693 -61.17801 2016-03-24 23:40:00 2016-03-25 03:00:00 0.0220565
4 1 3 -62.70136 -61.00363 2016-03-25 12:50:00 2016-03-25 19:20:00 0.0798409
5 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
6 1 3 -62.48157 -59.22009 2016-06-01 02:40:00 2016-06-02 01:40:00 1.8685429
7 1 3 -62.49064 -59.32967 2016-06-01 08:00:00 2016-06-01 14:50:00 3.7927680
8 1 3 -62.45838 -59.04125 2016-06-01 16:30:00 2016-06-02 01:40:00 1.8549291
9 2 3 42.06258 154.70579 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8140249
10 2 3 42.06310 154.70625 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8287073
The actual dataset is larger. What I would like to do is consolidate this dataset based on date ranges and grouped by id1 and id2, such that if the date/time range on one row is within 12 hours of the next date/time range 'ABS(end_date[1] - start_date[2]) < 12hrs' the rows should be consolidated with the new start_date being the earliest date and the end_date being the latest. All other values (lat, lon, speed) will be averaged. This is some sense a 'deduping' effort as rows that are within 12 hours actually represent the same 'event'. For the above example the final result would be
id1 id2 lat lon start_date end_date speed
1 1 3 -62.7818 -61.12142 2016-03-24 15:30:00 2016-03-25 19:20:00 1.493897
2 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
3 1 3 -62.47686 -59.197 2016-06-01 02:40:00 2016-06-02 01:40:00 2.505413
4 2 3 42.06284 154.706 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8213661
With the first four rows consolidated (into row1), the 5 row left alone (row2), the 6-8 rows consolidated (row3), and the 9-10 rows consolidated (row4).
I have been trying to do this with dplyr group_by and summarize, but I can't seem to get the get the date ranges to come out correctly.
Hopefully someone can determine a simple means of solving the problem. Extra points if you know how to do it in SQL ;-) so I can dedupe before even pulling this into R.
Here is a first very naive implementation. Warning: it is slow, not pretty and still missing the start and end dates in the output! Note that it expects the rows to be ordered by date and time. If that's not the case in the data set, you can do it in R or SQL first. Sorry that I can't think of a dplyr or SQL solution. I'd also like to see those two, if anyone has got an idea.
dedupe <- function(df) {
counter = 1
temp_vector = unlist(df[1, ])
summarized_df = df[0, c(1, 2, 3, 4, 7)]
colnames(summarized_df) = colnames(df)[c(1, 2, 3, 4, 7)]
summarized_df$counter = NULL
for (i in 2:nrow(df)) {
if (((abs(difftime(df[i, "start_date"], df[i - 1, "end_date"], units = "h")) <
12) ||
abs(difftime(df[i, "start_date"], df[i - 1, "start_date"], units = "h")) <
12) &&
df[i, "id1"] == df[i - 1, "id1"] &&
df[i, "id2"] == df[i - 1, "id2"]) {
#group events because id is the same and time range overlap
#sum up columns and select maximum end_date
temp_vector[c(3, 4, 7)] = temp_vector[c(3, 4, 7)] + unlist(df[i, c(3, 4, 7)])
temp_vector["end_date"] = max(temp_vector["end_date"], df[i, "end_date"])
counter = counter + 1
if (i == nrow(df)) {
#in the last iteration we need to create a new group
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
}
} else {
#new event so we calculate group statistics for temp_vector and reset its value as well as counter
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
counter = 1
temp_vector[c(3, 4, 7)] = unlist(df[i, c(3, 4, 7)])
}
}
return(summarized_df)
}
Function call
> dedupe(df)
id1 id2 lat lon speed counter
5 1 3 -62.78179 -61.12142 1.4938968 4
6 1 3 -62.76476 -59.14270 1.2824859 1
9 2 3 -62.47686 -59.19700 2.5054133 3
10 2 3 42.06284 154.70602 0.8213661 2
This can be easily achieved by using insurancerating::reduce():
df |>
insurancerating::reduce(begin = start_date, end = end_date, id1, id2,
agg_cols = c(lat, lon, speed), agg = "mean",
min.gapwidth = 12 * 3600)
#> id1 id2 index end_date start_date lat lon
#> 1 1 3 0 2016-03-25 19:20:00 2016-03-24 15:30:00 -62.78180 -61.12142
#> 2 1 3 1 2016-04-01 03:30:00 2016-03-29 18:20:00 -62.76476 -59.14270
#> 3 1 3 2 2016-06-02 01:40:00 2016-06-01 02:40:00 -62.47686 -59.19700
#> 4 2 3 0 2016-07-30 07:00:00 2016-07-29 20:20:00 42.06284 154.70602
#> speed
#> 1 1.4938969
#> 2 1.2824859
#> 3 2.5054133
#> 4 0.8213661
Created on 2022-06-13 by the reprex package (v2.0.1)

summarize by time interval not working

I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1

Resources