Adding Date and Time values in R - r

I have the following kind of data in my datafile
DriveNo Date and Time
12 2017-01-31 23:00:00 //Start time of a trip for Driver12
134 2017-01-31 23:00:01
12 2017-01-31 23:10:00 //End time ( 10 min trip)
345 (some date/time)
12 2017-01-31 23:20:00 //Start Time
12 2017-01-31 23:35:00 //End Time (15 min trip)
.
.
.
millions of similar data follow
The total number of data is around 3 million. Now, I need to get the time driven my each of the drivers(there are around 500 drivers).My ideal output would be like
DriveNo TotalTimeDriven
12 35mins
134 ........(in days/hours/mins)
.
.
(for all other Drivers as well)
Above, DriveNo 12 has four entries, suggesting start and end of two rides.Is there an efficient R way to do this?

Data table solution:-
# Sample data
df <- data.table(DriveNo = c(12, 134, 12, 134), Time = c("2017-01-31 23:00:00", "2017-01-31 23:00:01", "2017-01-31 23:10:00", "2017-01-31 23:20:01"))
df[, duration := max(as.POSIXct(Time)) - min(as.POSIXct(Time)), by = DriveNo]
df
DriveNo Time duration
1: 12 2017-01-31 23:00:00 10 mins
2: 134 2017-01-31 23:00:01 20 mins
3: 12 2017-01-31 23:10:00 10 mins
4: 134 2017-01-31 23:20:01 20 mins

range returns the maximum and minimum, and diff subtracts sequential numbers in a vector, so you could just do
aggregate(DateTime ~ DriveNo, df, function(x){diff(range(x))})
## DriveNo DateTime
## 1 12 10
## 2 134 0
or in dplyr,
library(dplyr)
df %>% group_by(DriveNo) %>% summarise(TimeDriven = diff(range(DateTime)))
## # A tibble: 2 × 2
## DriveNo TimeDriven
## <int> <time>
## 1 12 10 mins
## 2 134 0 mins
or in data.table,
library(data.table)
setDT(df)[, .(TimeDriven = diff(range(DateTime))), by = DriveNo]
## DriveNo TimeDriven
## 1: 12 10 mins
## 2: 134 0 mins
To change the units, it may be simpler to call difftime directly.
Data
df <- structure(list(DriveNo = c(12L, 134L, 12L), DateTime = structure(c(1485921600,
1485921601, 1485922200), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-3L), .Names = c("DriveNo", "DateTime"))
For the edit, you can make a variable identifying starts and stops, reshape, and summarise with difftime and sum.
library(tidyverse)
set.seed(47)
drives <- data_frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
drives %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 2 variables:
#> $ DriveNo : int 5 3 4 3 5 1 1 2 3 5 ...
#> $ DateTime: POSIXct, format: "2017-04-13 12:00:00" "2017-04-13 12:10:00" ...
elapsed <- drives %>%
group_by(DriveNo) %>%
mutate(event = rep(c('start', 'stop'), n() / 2),
i = cumsum(event == 'start')) %>%
spread(event, DateTime) %>%
summarise(TimeDriven = sum(difftime(stop, start, units = 'mins')))
elapsed
#> # A tibble: 5 × 2
#> DriveNo TimeDriven
#> <int> <time>
#> 1 1 60 mins
#> 2 2 110 mins
#> 3 3 120 mins
#> 4 4 130 mins
#> 5 5 80 mins
It would be faster to index by recycled Boolean vectors, but in dplyr they get unclassed at some point. In data.table,
library(data.table)
set.seed(47)
drives <- data.table(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- drives[, .(TimeDriven = sum(difftime(DateTime[c(FALSE, TRUE)],
DateTime[c(TRUE, FALSE)],
units = 'mins'))),
keyby = DriveNo]
elapsed
#> DriveNo TimeDriven
#> 1: 1 60 mins
#> 2: 2 110 mins
#> 3: 3 120 mins
#> 4: 4 130 mins
#> 5: 5 80 mins
or in base,
set.seed(47)
drives <- data.frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- aggregate(DateTime ~ DriveNo, drives,
function(x){sum(difftime(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)], units = 'mins'))})
elapsed
#> DriveNo DateTime
#> 1 1 60
#> 2 2 110
#> 3 3 120
#> 4 4 130
#> 5 5 80
All forms will likely have issues if there are an odd number of times for a driver, which is not possible under the assumptions given. If it is, more cleaning is necessary.

Related

How to calculate a time period until a condition is matched

I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
head(data)
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
head(final_df)
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
library(dplyr)
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
final_df
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)

How to generate a unique ID for each group based on relative date interval in R using dplyr?

I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00

How to rewrite an R loop taking averages of every 15 observations to same code but without a loop

I am dealing with a huge dataset (years of 1-minute-interval observations of energy usage). I want to convert it from 1-min-interval to 15-min-interval.
I have written a for loop which does this successfully (tested on a small subset of the data); however, when I tried running it on the main data, it was executing very slowly - and it would have taken me over 175 hours to run the full loop (I stopped it mid-execution).
The data to be converted to the 15-th minute interval is the kWh usage; thusly converting it simply requires taking the average of the first 15th observations, then the second 15th, etc. This is the loop that's working:
# Opening the file
data <- read.csv("1.csv",colClasses="character",na.strings="?")
# Adding an index to each row
total <- nrow(data)
data$obsnum <- seq.int(nrow(data))
# Calculating 15 min kwH usage
data$use_15_min <- data$use
for (i in 1:total) {
int_used <- floor((i-1)/15)
obsNum <- 15*int_used
sum <- 0
for (j in 1:15) {
usedIndex <- as.numeric(obsNum+j)
sum <- as.numeric(data$use[usedIndex]) + sum
}
data$use_15_min[i] <- sum/15
}
I have been searching for a function that can do the same, but without using loops, as I imagine this should save much time. Yet, I haven't been able to find one. How is it possible to achieve the same functionality without using a loop?
Try data.table:
library(data.table)
DT <- data.table(data)
n <- nrow(DT)
DT[, use_15_min := mean(use), by = gl(n, 15, n)]
Note
The question is missing the input data so we used this:
data <- data.frame(use = 1:100)
A potential solution is to calculate the running mean (e.g. using TTR::runMean) and then select every 15th observations. Here is an example:
df = data.frame(x = 1:100, y = runif(100))
df['runmean'] = TTR::runMean(df['y'], n=15)
df_15 = df[seq(1,nrow(df), 15), ]
I cannot test it, as I do not have Your data, but perhaps:
total <- nrow(data)
data$use_15_min = TTR::runMean(data$use, n=15)
data_15_min = data[seq(1, nrow(df), 15)]
I would use lubridate::floor_date to create the 15-minute groupings.
library(tidyverse)
library(lubridate)
df <- tibble(
date = seq(ymd_hm("2019-01-01 00:00"), by = "min", length.out = 60 * 24 * 7),
value = rnorm(n = 60 * 24 * 7)
)
df
#> # A tibble: 10,080 x 2
#> date value
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.182
#> 2 2019-01-01 00:01:00 0.616
#> 3 2019-01-01 00:02:00 -0.252
#> 4 2019-01-01 00:03:00 0.0726
#> 5 2019-01-01 00:04:00 -0.917
#> 6 2019-01-01 00:05:00 -1.78
#> 7 2019-01-01 00:06:00 -1.49
#> 8 2019-01-01 00:07:00 -0.818
#> 9 2019-01-01 00:08:00 0.275
#> 10 2019-01-01 00:09:00 1.26
#> # ... with 10,070 more rows
df %>%
mutate(
nearest_15_mins = floor_date(date, "15 mins")
) %>%
group_by(nearest_15_mins) %>%
summarise(
avg_value_at_15_mins_int = mean(value)
)
#> # A tibble: 672 x 2
#> nearest_15_mins avg_value_at_15_mins_int
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 -0.272
#> 2 2019-01-01 00:15:00 -0.129
#> 3 2019-01-01 00:30:00 0.173
#> 4 2019-01-01 00:45:00 -0.186
#> 5 2019-01-01 01:00:00 -0.188
#> 6 2019-01-01 01:15:00 0.104
#> 7 2019-01-01 01:30:00 -0.310
#> 8 2019-01-01 01:45:00 -0.173
#> 9 2019-01-01 02:00:00 0.0137
#> 10 2019-01-01 02:15:00 0.419
#> # ... with 662 more rows

Calculating conditional cumulative time

Following the pointers from this question.
I'd like to calculate the cumulative time for all the Cats, by considering their respective last toggle status.
EDIT:
I'd also want to check if the FIRST Toggle status of a Cat is Off and if it is so, for that specific cat, the time from midnight 00:00:00 till this first FIRST Off time should be added to its total conditional cumulative ontime.
Sample data:
Time Cat Toggle
1 05:12:09 36 On
2 05:12:12 26R Off # First Toggle of this Cat happens to be Off, Condition met
3 05:12:15 26R On
4 05:12:16 26R Off
5 05:12:18 99 Off # Condition met
6 05:12:18 99 On
7 05:12:24 36 Off
8 05:12:26 36 On
9 05:12:29 80 Off # Condition met
10 05:12:30 99 Off
11 05:12:31 95 Off # Condition met
12 05:12:32 36 Off
Desired sample output:
Cat Time(Secs)
1 36 21
2 26R 18733 # (=1+18732), 18732 secs to be added = total Sec from midnight till 05:12:12
3 99 18750 # (=12+18738), 18738 secs to be added = total Sec from midnight till 05:12:18
4 .. ..
Any sort of help is appreciated.
using base R:
df$Time=as.POSIXct(df$Time,,"%H:%M:%S")
stack(by(df,df$Cat,function(x)sum(c(0,diff(x$Time))*(x$Toggle=="Off"))))
values ind
1 1 26R
2 21 36
3 0 80
4 0 95
5 12 99
One can use as.difftime function to convert time from H:M:S format to seconds. Then for each On statue find the lead record in order to calculate interval of time lapsed from On.
library(dplyr)
# Convert Time in seconds.
df %>% mutate(Time = as.difftime(Time, units = "secs")) %>%
group_by(Cat) %>%
mutate(TimeInterVal = ifelse(Toggle == "On", (lead(Time) - Time), 0)) %>%
summarise(TimeInterVal = sum(TimeInterVal))
# # A tibble: 5 x 2
# Cat TimeInterVal
# <chr> <dbl>
# 1 26R 1.00
# 2 36 21.0
# 3 80 0
# 4 95 0
# 5 99 12.0
Note: On can consider arranging data on Time ensure rows are ordered on time.
Data:
df <- read.table(text ="
Time Cat Toggle
1 05:12:09 36 On
2 05:12:12 26R Off
3 05:12:15 26R On
4 05:12:16 26R Off
5 05:12:18 99 Off
6 05:12:18 99 On
7 05:12:24 36 Off
8 05:12:26 36 On
9 05:12:29 80 Off
10 05:12:30 99 Off
11 05:12:31 95 Off
12 05:12:32 36 Off",
header = TRUE, stringsAsFactors = FALSE)
A possible solution using data.table:
# load the 'data.table'-package, convert 'df' to a 'data.table'
# and 'Time'-column to a time-format
library(data.table)
setDT(df)[, Time := as.ITime(Time)]
# calculate the time-difference
df[, .(time.diff = sum((shift(Time, type = 'lead') - Time) * (Toggle == 'On'), na.rm = TRUE))
, by = Cat]
which gives:
Cat time.diff
1: 36 21
2: 26R 1
3: 99 12
4: 80 0
5: 95 0
In respons to your question in the comments, you could do:
# create a new data.table with midnigth times for the categories where
# the first 'Toggle' is on "Off"
df0 <- df[, .I[first(Toggle) == "Off"], by = Cat
][, .(Time = as.ITime("00:00:00"), Cat = unique(Cat), Toggle = "On")]
# bind that to the original data.table; order on 'Cat' and 'Time'
# and then do the same calculation
rbind(df, df0)[order(Cat, Time)
][, .(time.diff = sum((shift(Time, type = 'lead') - Time) * (Toggle == 'On'), na.rm = TRUE))
, by = Cat]
which gives:
Cat time.diff
1: 26R 18733
2: 36 21
3: 80 18749
4: 95 18751
5: 99 18750
An alternative with base R (only original question):
df$Time <- as.POSIXct(df$Time, format = "%H:%M:%S")
stack(sapply(split(df, df$Cat),
function(x) sum(diff(x[["Time"]]) * (head(x[["Toggle"]],-1) == 'On'))))
which gives:
values ind
1 1 26R
2 21 36
3 0 80
4 0 95
5 12 99
Or with the tidyverse (only original question):
library(dplyr)
library(lubridate)
df %>%
mutate(Time = lubridate::hms(Time)) %>%
group_by(Cat) %>%
summarise(time.diff = sum(diff(Time) * (head(Toggle, -1) == 'On'),
na.rm = TRUE))

R time aggregate with start/stop

I have a set of time series data that has a start and stop time. Each event can last from few seconds to few days, I need to calculate the sum, in this example the total memory used, every hour of the jobs active at the time. Here is a sample of the data:
mem_used start_time stop_time
16 2015-10-24 17:24:41 2015-10-25 04:19:44
80 2015-10-24 17:24:51 2015-10-25 03:14:59
44 2015-10-24 17:25:27 2015-10-25 01:16:10
28 2015-10-24 17:25:43 2015-10-25 00:00:31
72 2015-10-24 17:30:23 2015-10-24 23:58:31
In this case it should give something like:
time total_mem
2015-10-24 17:00:00 240
2015-10-24 18:00:00 240
...
2015-10-25 00:00:00 168
2015-10-25 01:00:00 140
2015-10-25 02:00:00 96
2015-10-25 03:00:00 96
2015-10-25 04:00:00 16
I'm trying to do something with the aggregate function but I can not figure it out. Any ideas? Thanks.
Here's how I would do it, using lubridate.
First, make sure that your dates are in POSIXct format:
dat$start_time = as.POSIXct(dat$start_time, format = "%Y-%m-%d %H:%M:%S")
dat$stop_time = as.POSIXct(dat$stop_time, format = "%Y-%m-%d %H:%M:%S")
Then make an interval object with lubridate:
library(lubridate)
dat$interval <- interval(dat$start_time, dat$stop_time)
Now we can make a vector of times, replace these with your desired times:
z <- seq(start = dat$start_time[1], stop = dat$stop_time[5], by = "hours")
And sum those where we have an overlap:
out <- data.frame(times = z,
mem_used = sapply(z, function(x) sum(dat$mem_used[x %within% dat$interval])))
times mem_used
1 2015-10-24 17:24:41 16
2 2015-10-24 18:24:41 240
3 2015-10-24 19:24:41 240
4 2015-10-24 20:24:41 240
5 2015-10-24 21:24:41 240
6 2015-10-24 22:24:41 240
7 2015-10-24 23:24:41 240
Here's the data used:
structure(list(mem_used = c(16L, 80L, 44L, 28L, 72L), start_time = structure(c(1445721881,
1445721891, 1445721927, 1445721943, 1445722223), class = c("POSIXct",
"POSIXt"), tzone = ""), stop_time = structure(c(1445761184, 1445757299,
1445750170, 1445745631, 1445745511), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("mem_used", "start_time", "stop_time"
), row.names = c(NA, -5L), class = "data.frame")
Here is another solution based on dplyr and lubridate.
Make sure first to have the data in the right format (e.g date in POSIXct)
library(dplyr)
library(lubridate)
glimpse(df)
## Observations: 5
## Variables: 3
## $ mem_used (int) 16, 80, 44, 28, 72
## $ start_time (time) 2015-10-24 17:24:41, 2015-10-24 17:24:51...
## $ end_time (time) 2015-10-25 04:19:44, 2015-10-25 03:14:59...
Then we will just keep the hour (removing minutes and seconds) since we want to aggregate per hour.
### Remove minutes and seconds
minute(df$start_time) <- 0
second(df$start_time) <- 0
minute(df$end_time) <- 0
second(df$end_time) <- 0
The most important step now, is to create a new data.frame with one row for each hour between start_time and end_time. For example, if on the first line of the original data.frame we have 5 hours between start_time and end_time, we will end with 5 rows and the value mem_used duplicated 5 times.
###
n <- nrow(df)
l <- lapply(1:n, function(i) {
date <- seq.POSIXt(df$start_time[i], df$end_time[i], by = "hour")
mem_used <- rep(df$mem_used[i], length(date))
data.frame(time = date, mem_used = mem_used)
})
df <- Reduce(rbind, l)
glimpse(df)
## Observations: 47
## Variables: 2
## $ time (time) 2015-10-24 17:00:00, 2015-10-24 18:00:00, ...
## $ mem_used (int) 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,...
Finally, we can now aggregate using dplyr or aggregate (or other similar functions)
df %>%
group_by(time) %>%
summarise(tot = sum(mem_used))
## time tot
## (time) (int)
## 1 2015-10-24 17:00:00 240
## 2 2015-10-24 18:00:00 240
## 3 2015-10-24 19:00:00 240
## 4 2015-10-24 20:00:00 240
## 5 2015-10-24 21:00:00 240
## 6 2015-10-24 22:00:00 240
## 7 2015-10-24 23:00:00 240
## 8 2015-10-25 00:00:00 168
## 9 2015-10-25 01:00:00 140
## 10 2015-10-25 02:00:00 96
## 11 2015-10-25 03:00:00 96
## 12 2015-10-25 04:00:00 16
## Or aggregate
aggregate(mem_used ~ time, FUN = sum, data = df)

Resources