Beginner: set up time series in R - r

I am brand new to R, and am having trouble figuring out how to set up a simple time series.
Illustration: say I have three variables: Event (0 or 1), HR (heart rate), DT (datetime):
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
Event HR DT
1 1 100 2020-01-01 09:00:00
2 0 120 2020-01-01 09:15:00
3 0 115 2020-01-01 10:00:00
4 0 105 2020-01-01 10:30:00
5 1 105 2020-01-01 11:00:00
6 0 115 2020-01-01 12:00:00
7 0 100 2020-01-01 13:00:00
What I would like to do is to calculate elapsed time after each new event: So, row1=0 min, row2=15, row3=60,... row5=0, row6=60 Then I can do things like plot HR vs elapsed.
What might be a simple way to calculate elapsed time?
Apologies for such a low level question, but would be very grateful for any help!

Here is a one line approach using data.table.
Data:
df <- structure(list(Event = c(1, 0, 0, 0, 1, 0, 0), HR = c(100, 120,
115, 105, 105, 115, 100), DT = structure(c(1577869200, 1577870100,
1577872800, 1577874600, 1577876400, 1577880000, 1577883600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -7L), class = "data.frame")
Code:
library(data.table)
dt <- as.data.table(df)
dt[, mins_since_last_event := as.numeric(difftime(DT,DT[1],units = "mins")), by = .(cumsum(Event))]
Output:
dt
Event HR DT mins_since_last_event
1: 1 100 2020-01-01 09:00:00 0
2: 0 120 2020-01-01 09:15:00 15
3: 0 115 2020-01-01 10:00:00 60
4: 0 105 2020-01-01 10:30:00 90
5: 1 105 2020-01-01 11:00:00 0
6: 0 115 2020-01-01 12:00:00 60
7: 0 100 2020-01-01 13:00:00 120

The following uses the Chron library and converts your date/time column to time objects for the library to be able to run calculations and conversions on.
Example Data:
df <- data.frame(
Event=c(1,0,0,0,1,0,0),
HR=c(100,125,115,105,105,115,100),
DT=c("2020-01-01 09:00:00"
,"2020-01-01 09:15:00"
,"2020-01-01 10:00:00"
,"2020-01-01 10:30:00"
,"2020-01-01 11:00:00"
,"2020-01-01 12:00:00"
,"2020-01-01 13:00:00"))
Code:
library(chron)
Dates <- lapply(strsplit(as.character(df$DT)," "),head,n=1)
Times <- lapply(strsplit(as.character(df$DT)," "),tail,n=1)
df$DT <- chron(as.character(Dates),as.character(Times),format=c(dates="y-m-d",times="h:m:s"))
df$TimeElapsed[1] <- 0
for(i in 1:nrow(df)){
if(df$Event[i]==1){TimeStart <- df$DT[i]}
df$TimeElapsed[i] <- (df$DT[i]-TimeStart)*24*60
}
output:
> df
Event HR DT TimeElapsed
1 1 100 (20-01-01 09:00:00) 0
2 0 125 (20-01-01 09:15:00) 15
3 0 115 (20-01-01 10:00:00) 60
4 0 105 (20-01-01 10:30:00) 90
5 1 105 (20-01-01 11:00:00) 0
6 0 115 (20-01-01 12:00:00) 60
7 0 100 (20-01-01 13:00:00) 120

Welcome to Stack Overflow #greyguy.
Here is an approach with dplyr library wich is pretty good with large data sets:
library(dplyr)
#Yours Data
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
# Transform in time format not string and order by time if not ordered
Transform in time format not string and order by time if not ordered
df = df %>%
mutate(DT = as.POSIXct(DT, format = "%Y-%m-%d %H:%M:%S")) %>%
arrange(DT) %>%
mutate(#Litte trick to get last DT Observation
last_DT = case_when(Event==1 ~ DT),
last_DT = na.locf(last_DT),
Elapsed_min = as.numeric( (DT - last_DT)/60)
) %>%
select(-last_DT)
The output:
# Event HR DT Elapsed_min
# 1 100 2020-01-01 09:00:00 0
# 0 120 2020-01-01 09:15:00 15
# 0 115 2020-01-01 10:00:00 60
# 0 105 2020-01-01 10:30:00 90
# 1 105 2020-01-01 11:00:00 0
# 0 115 2020-01-01 12:00:00 60
# 0 100 2020-01-01 13:00:00 120

Related

Can I aggregate time series data between an on and off date using a data table join or the aggregate function?

I would like to efficiently summarize continuous meteorological data over the periods that discrete samples are being collected.
I currently do this with a time-consuming loop, but I imagine a better solution exists. I'm new to data.table syntax, but it seems like there should be a solution with joining.
continuous <- data.frame(Time = seq(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 9:00:00"),"hour"),
CO2 = sample(400:450,10),
Temp = sample(10:30,10))
> continuous
Time CO2 Temp
1 2019-01-01 00:00:00 430 11
2 2019-01-01 01:00:00 412 26
3 2019-01-01 02:00:00 427 17
4 2019-01-01 03:00:00 435 29
5 2019-01-01 04:00:00 447 23
6 2019-01-01 05:00:00 417 19
7 2019-01-01 06:00:00 408 12
8 2019-01-01 07:00:00 449 28
9 2019-01-01 08:00:00 445 20
10 2019-01-01 09:00:00 420 27
discrete <- data.frame(on = c(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 3:00:00")),
off = c(as.POSIXct("2019-01-01 3:00:00"),
as.POSIXct("2019-01-01 8:00:00")))
> discrete
on off
1 2019-01-01 00:00:00 2019-01-01 03:00:00
2 2019-01-01 03:00:00 2019-01-01 08:00:00
discrete[, c("CO2.mean","Temp.mean")] <-
lapply(seq(length(c("CO2","Temp"))), function(k)
unlist(lapply(seq(length(discrete[, 1])), function(i)
mean(continuous[
which.closest(continuous$Time,discrete$on[i]):
which.closest(continuous$Time, discrete$off[i]),
c("CO2","Temp")[k]]))))
> discrete
on off CO2.mean Temp.mean
1 2019-01-01 00:00:00 2019-01-01 03:00:00 426.0 20.75000
2 2019-01-01 03:00:00 2019-01-01 08:00:00 433.5 21.83333
This works, but when aggregating tens of continuous variables into hundreds of sampling periods, it takes a very long time to run. Thank you for your help!
An option would be a nonequi join in data.table
library(data.table)
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time >= on, Time <= off), by = .EACHI]
or with a rolling join
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time = on, Time = off),
by = .EACHI, roll = 'nearest']

Group by with summarise in date difference in R

I am trying to use group_by and then summarise using date difference calculation. I am not sure if its a runtime error or something wrong in what I am doing. Sometimes when I run the code I get the output as days and other times as seconds. I am not sure what is causing this change. I am not changing dataset or codes. The dataset I am using is huge (2,304,433 rows and 40 columns). Both the times, the output value (digits) are the same but only the name changes (days to secs). I would like to see the output in days.
This is the code that I am using:
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
Revenue = max(TOTAL_AMT + 0.000001/QUANTITY),
No_Days = (max(ORDER_DT) - min(ORDER_DT) + 1)/n())
This is the output.
Can anyone please help me on this?
Use difftime() You might need to specify the units.
set.seed(314)
data <- data.frame(PRODUCT = sample(1:10, size = 10000, replace = TRUE),
PERSON_ID = sample(1:10, size = 10000, replace = TRUE),
ORDER_DT = as.POSIXct(as.Date('2019/01/01') + sample(-300:+300, size = 10000, replace = TRUE)))
require(dplyr)
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
start = min(ORDER_DT),
end = max(ORDER_DT)) %>%
mutate(No_Days = (as.double(difftime(end, start, units = "days"), units = "days")+1)/Freq)
gives:
PRODUCT PERSON_ID Freq start end No_Days
<int> <int> <int> <dttm> <dttm> <dbl>
1 1 1 109 2018-03-21 01:00:00 2019-10-27 02:00:00 5.38
2 1 2 117 2018-03-23 01:00:00 2019-10-26 02:00:00 4.98
3 1 3 106 2018-03-19 01:00:00 2019-10-28 01:00:00 5.56
4 1 4 109 2018-03-07 01:00:00 2019-10-26 02:00:00 5.50
5 1 5 95 2018-03-07 01:00:00 2019-10-16 02:00:00 6.2
6 1 6 79 2018-03-09 01:00:00 2019-10-04 02:00:00 7.28
7 1 7 83 2018-03-09 01:00:00 2019-10-28 01:00:00 7.22
8 1 8 114 2018-03-09 01:00:00 2019-10-16 02:00:00 5.15
9 1 9 100 2018-03-09 01:00:00 2019-10-13 02:00:00 5.84
10 1 10 91 2018-03-11 01:00:00 2019-10-26 02:00:00 6.54
# ... with 90 more rows
Why is the value devided by n()?
Simple as.integer(max(ORDER_DT) - min(ORDER_DT)) should work, but if it doesn't then please be more specific and update me with more information.
Also while working with datetime values it's good to know lubridate library

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

Consolidating rows by max and min dates

I have a dataset that looks like this.
id1 = c(1,1,1,1,1,1,1,1,2,2)
id2 = c(3,3,3,3,3,3,3,3,3,3)
lat = c(-62.81559,-62.82330, -62.78693,-62.70136, -62.76476,-62.48157,-62.49064,-62.45838,42.06258,42.06310)
lon = c(-61.15518, -61.14885,-61.17801,-61.00363, -59.14270, -59.22009, -59.32967, -59.04125 ,154.70579, 154.70625)
start_date= as.POSIXct(c('2016-03-24 15:30:00', '2016-03-24 15:30:00','2016-03-24 23:40:00','2016-03-25 12:50:00','2016-03-29 18:20:00','2016-06-01 02:40:00','2016-06-01 08:00:00','2016-06-01 16:30:00','2016-07-29 20:20:00','2016-07-29 20:20:00'), tz = 'UTC')
end_date = as.POSIXct(c('2016-03-24 23:40:00', '2016-03-24 18:50:00','2016-03-25 03:00:00','2016-03-25 19:20:00','2016-04-01 03:30:00','2016-06-02 01:40:00','2016-06-01 14:50:00','2016-06-02 01:40:00','2016-07-30 07:00:00','2016-07-30 07:00:00'),tz = 'UTC')
speed = c(2.9299398, 2.9437502, 0.0220565, 0.0798409, 1.2824859, 1.8685429, 3.7927680, 1.8549291, 0.8140249,0.8287073)
df = data.frame(id1, id2, lat, lon, start_date, end_date, speed)
id1 id2 lat lon start_date end_date speed
1 1 3 -62.81559 -61.15518 2016-03-24 15:30:00 2016-03-24 23:40:00 2.9299398
2 1 3 -62.82330 -61.14885 2016-03-24 15:30:00 2016-03-24 18:50:00 2.9437502
3 1 3 -62.78693 -61.17801 2016-03-24 23:40:00 2016-03-25 03:00:00 0.0220565
4 1 3 -62.70136 -61.00363 2016-03-25 12:50:00 2016-03-25 19:20:00 0.0798409
5 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
6 1 3 -62.48157 -59.22009 2016-06-01 02:40:00 2016-06-02 01:40:00 1.8685429
7 1 3 -62.49064 -59.32967 2016-06-01 08:00:00 2016-06-01 14:50:00 3.7927680
8 1 3 -62.45838 -59.04125 2016-06-01 16:30:00 2016-06-02 01:40:00 1.8549291
9 2 3 42.06258 154.70579 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8140249
10 2 3 42.06310 154.70625 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8287073
The actual dataset is larger. What I would like to do is consolidate this dataset based on date ranges and grouped by id1 and id2, such that if the date/time range on one row is within 12 hours of the next date/time range 'ABS(end_date[1] - start_date[2]) < 12hrs' the rows should be consolidated with the new start_date being the earliest date and the end_date being the latest. All other values (lat, lon, speed) will be averaged. This is some sense a 'deduping' effort as rows that are within 12 hours actually represent the same 'event'. For the above example the final result would be
id1 id2 lat lon start_date end_date speed
1 1 3 -62.7818 -61.12142 2016-03-24 15:30:00 2016-03-25 19:20:00 1.493897
2 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
3 1 3 -62.47686 -59.197 2016-06-01 02:40:00 2016-06-02 01:40:00 2.505413
4 2 3 42.06284 154.706 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8213661
With the first four rows consolidated (into row1), the 5 row left alone (row2), the 6-8 rows consolidated (row3), and the 9-10 rows consolidated (row4).
I have been trying to do this with dplyr group_by and summarize, but I can't seem to get the get the date ranges to come out correctly.
Hopefully someone can determine a simple means of solving the problem. Extra points if you know how to do it in SQL ;-) so I can dedupe before even pulling this into R.
Here is a first very naive implementation. Warning: it is slow, not pretty and still missing the start and end dates in the output! Note that it expects the rows to be ordered by date and time. If that's not the case in the data set, you can do it in R or SQL first. Sorry that I can't think of a dplyr or SQL solution. I'd also like to see those two, if anyone has got an idea.
dedupe <- function(df) {
counter = 1
temp_vector = unlist(df[1, ])
summarized_df = df[0, c(1, 2, 3, 4, 7)]
colnames(summarized_df) = colnames(df)[c(1, 2, 3, 4, 7)]
summarized_df$counter = NULL
for (i in 2:nrow(df)) {
if (((abs(difftime(df[i, "start_date"], df[i - 1, "end_date"], units = "h")) <
12) ||
abs(difftime(df[i, "start_date"], df[i - 1, "start_date"], units = "h")) <
12) &&
df[i, "id1"] == df[i - 1, "id1"] &&
df[i, "id2"] == df[i - 1, "id2"]) {
#group events because id is the same and time range overlap
#sum up columns and select maximum end_date
temp_vector[c(3, 4, 7)] = temp_vector[c(3, 4, 7)] + unlist(df[i, c(3, 4, 7)])
temp_vector["end_date"] = max(temp_vector["end_date"], df[i, "end_date"])
counter = counter + 1
if (i == nrow(df)) {
#in the last iteration we need to create a new group
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
}
} else {
#new event so we calculate group statistics for temp_vector and reset its value as well as counter
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
counter = 1
temp_vector[c(3, 4, 7)] = unlist(df[i, c(3, 4, 7)])
}
}
return(summarized_df)
}
Function call
> dedupe(df)
id1 id2 lat lon speed counter
5 1 3 -62.78179 -61.12142 1.4938968 4
6 1 3 -62.76476 -59.14270 1.2824859 1
9 2 3 -62.47686 -59.19700 2.5054133 3
10 2 3 42.06284 154.70602 0.8213661 2
This can be easily achieved by using insurancerating::reduce():
df |>
insurancerating::reduce(begin = start_date, end = end_date, id1, id2,
agg_cols = c(lat, lon, speed), agg = "mean",
min.gapwidth = 12 * 3600)
#> id1 id2 index end_date start_date lat lon
#> 1 1 3 0 2016-03-25 19:20:00 2016-03-24 15:30:00 -62.78180 -61.12142
#> 2 1 3 1 2016-04-01 03:30:00 2016-03-29 18:20:00 -62.76476 -59.14270
#> 3 1 3 2 2016-06-02 01:40:00 2016-06-01 02:40:00 -62.47686 -59.19700
#> 4 2 3 0 2016-07-30 07:00:00 2016-07-29 20:20:00 42.06284 154.70602
#> speed
#> 1 1.4938969
#> 2 1.2824859
#> 3 2.5054133
#> 4 0.8213661
Created on 2022-06-13 by the reprex package (v2.0.1)

R time aggregate with start/stop

I have a set of time series data that has a start and stop time. Each event can last from few seconds to few days, I need to calculate the sum, in this example the total memory used, every hour of the jobs active at the time. Here is a sample of the data:
mem_used start_time stop_time
16 2015-10-24 17:24:41 2015-10-25 04:19:44
80 2015-10-24 17:24:51 2015-10-25 03:14:59
44 2015-10-24 17:25:27 2015-10-25 01:16:10
28 2015-10-24 17:25:43 2015-10-25 00:00:31
72 2015-10-24 17:30:23 2015-10-24 23:58:31
In this case it should give something like:
time total_mem
2015-10-24 17:00:00 240
2015-10-24 18:00:00 240
...
2015-10-25 00:00:00 168
2015-10-25 01:00:00 140
2015-10-25 02:00:00 96
2015-10-25 03:00:00 96
2015-10-25 04:00:00 16
I'm trying to do something with the aggregate function but I can not figure it out. Any ideas? Thanks.
Here's how I would do it, using lubridate.
First, make sure that your dates are in POSIXct format:
dat$start_time = as.POSIXct(dat$start_time, format = "%Y-%m-%d %H:%M:%S")
dat$stop_time = as.POSIXct(dat$stop_time, format = "%Y-%m-%d %H:%M:%S")
Then make an interval object with lubridate:
library(lubridate)
dat$interval <- interval(dat$start_time, dat$stop_time)
Now we can make a vector of times, replace these with your desired times:
z <- seq(start = dat$start_time[1], stop = dat$stop_time[5], by = "hours")
And sum those where we have an overlap:
out <- data.frame(times = z,
mem_used = sapply(z, function(x) sum(dat$mem_used[x %within% dat$interval])))
times mem_used
1 2015-10-24 17:24:41 16
2 2015-10-24 18:24:41 240
3 2015-10-24 19:24:41 240
4 2015-10-24 20:24:41 240
5 2015-10-24 21:24:41 240
6 2015-10-24 22:24:41 240
7 2015-10-24 23:24:41 240
Here's the data used:
structure(list(mem_used = c(16L, 80L, 44L, 28L, 72L), start_time = structure(c(1445721881,
1445721891, 1445721927, 1445721943, 1445722223), class = c("POSIXct",
"POSIXt"), tzone = ""), stop_time = structure(c(1445761184, 1445757299,
1445750170, 1445745631, 1445745511), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("mem_used", "start_time", "stop_time"
), row.names = c(NA, -5L), class = "data.frame")
Here is another solution based on dplyr and lubridate.
Make sure first to have the data in the right format (e.g date in POSIXct)
library(dplyr)
library(lubridate)
glimpse(df)
## Observations: 5
## Variables: 3
## $ mem_used (int) 16, 80, 44, 28, 72
## $ start_time (time) 2015-10-24 17:24:41, 2015-10-24 17:24:51...
## $ end_time (time) 2015-10-25 04:19:44, 2015-10-25 03:14:59...
Then we will just keep the hour (removing minutes and seconds) since we want to aggregate per hour.
### Remove minutes and seconds
minute(df$start_time) <- 0
second(df$start_time) <- 0
minute(df$end_time) <- 0
second(df$end_time) <- 0
The most important step now, is to create a new data.frame with one row for each hour between start_time and end_time. For example, if on the first line of the original data.frame we have 5 hours between start_time and end_time, we will end with 5 rows and the value mem_used duplicated 5 times.
###
n <- nrow(df)
l <- lapply(1:n, function(i) {
date <- seq.POSIXt(df$start_time[i], df$end_time[i], by = "hour")
mem_used <- rep(df$mem_used[i], length(date))
data.frame(time = date, mem_used = mem_used)
})
df <- Reduce(rbind, l)
glimpse(df)
## Observations: 47
## Variables: 2
## $ time (time) 2015-10-24 17:00:00, 2015-10-24 18:00:00, ...
## $ mem_used (int) 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,...
Finally, we can now aggregate using dplyr or aggregate (or other similar functions)
df %>%
group_by(time) %>%
summarise(tot = sum(mem_used))
## time tot
## (time) (int)
## 1 2015-10-24 17:00:00 240
## 2 2015-10-24 18:00:00 240
## 3 2015-10-24 19:00:00 240
## 4 2015-10-24 20:00:00 240
## 5 2015-10-24 21:00:00 240
## 6 2015-10-24 22:00:00 240
## 7 2015-10-24 23:00:00 240
## 8 2015-10-25 00:00:00 168
## 9 2015-10-25 01:00:00 140
## 10 2015-10-25 02:00:00 96
## 11 2015-10-25 03:00:00 96
## 12 2015-10-25 04:00:00 16
## Or aggregate
aggregate(mem_used ~ time, FUN = sum, data = df)

Resources