Summarize values for overlapping time periods - r

I'm trying to summarize values for overlapping time periods.
I can use only tidyr, ggplot2 and dplyr libraries. Base R is preferred though.
My data looks like this, but usually it has around 100 records:
df <- structure(list(Start = structure(c(1546531200, 1546531200, 546531200, 1546638252.6316, 1546549800, 1546534800, 1546545600, 1546531200, 1546633120, 1547065942.1053), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Stop = structure(c(1546770243.1579, 1546607400, 1547110800, 1546670652.6316, 1547122863.1579, 1546638252.6316, 1546878293.5579, 1546416000, 1546849694.4, 1547186400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(12610, 520, 1500, 90, 331380, 27300, 6072, 4200, 61488, 64372)), .Names = c("Start", "Stop", "Value"), row.names = c(41L, 55L, 25L, 29L, 38L, 28L, 1L, 20L, 14L, 31L), class = c("tbl_df", "tbl", "data.frame"))
head(df) and str(df) gives:
Start Stop Value
2019-01-03 16:00:00 2019-01-06 10:24:03 12610
2019-01-03 16:00:00 2019-01-04 13:10:00 520
2019-01-03 16:00:00 2019-01-10 09:00:00 1500
2019-01-04 21:44:12 2019-01-05 06:44:12 90
2019-01-03 21:10:00 2019-01-10 12:21:03 331380
2019-01-03 17:00:00 2019-01-04 21:44:12 27300
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 3 variables:
$ Start: POSIXct, format: "2019-01-03 16:00:00" "2019-01-03 16:00:00" ...
$ Stop : POSIXct, format: "2019-01-06 10:24:03" "2019-01-04 13:10:00" ...
$ Value: num 12610 520 1500 90 331380 ...
So there are overlapping time periods with "Start" and "Stop" dates with assigned value. In any given record when there is a value between df$Start and df$Stop and outside of this scope the value is 0.
I want to create another dataframe based on which I could show how this values summarize and change over time. The Desired output would look like this (the "sum" column is made up):
> head(df2)
timestamp sum
"2019-01-02 09:00:00 CET" 14352
"2019-01-03 17:00:00 CET" 6253
"2019-01-03 18:00:00 CET" 23465
"2019-01-03 21:00:00 CET" 3241
"2019-01-03 22:10:00 CET" 23235
"2019-01-04 14:10:00 CET" 123321
To get unique timestamps:
timestamps <- sort(unique(c(df$`Start`, df$`Stop`)))
With df2 dataframe I could easily draw a graph with ggplot, but how to get this sums?
I think I should iterate over df data frame either some custom function or any built-it summarize function which would work like this:
fnct <- function(date, min, max, value) {
if (date >= min && date <=max) {
a <- value
}
else {
a <- 0
}
return(a)
}
...for every given date from timestamps iterate through df and give me a sum of values for the timestamp.
It looks really simple and I'm missing something very basic.

Here's a tidyverse solution similar to my response to this recent question. I gather to bring the timestamps (Starts and Stops) into one column, with another column specifying which. The Starts add the value and the Stops subtract it, and then we just take the cumulative sum to get values at all the instants when the sum changes.
For 100 records, there won't be any perceivable speed improvement from using data.table; in my experience it starts to make more of a difference around 1M records, especially when grouping is involved.
library(dplyr); library(tidyr)
df2 <- df %>%
gather(type, time, Start:Stop) %>%
mutate(chg = if_else(type == "Start", Value, -Value)) %>%
arrange(time) %>%
mutate(sum = cumsum(chg)) # EDIT: corrected per OP comment
> head(df2)
## A tibble: 6 x 5
# Value type time chg sum
# <dbl> <chr> <dttm> <dbl> <dbl>
#1 1500 Start 1987-04-27 14:13:20 1500 1500
#2 4200 Stop 2019-01-02 08:00:00 -4200 -2700
#3 12610 Start 2019-01-03 16:00:00 12610 9910
#4 520 Start 2019-01-03 16:00:00 520 10430
#5 4200 Start 2019-01-03 16:00:00 4200 14630
#6 27300 Start 2019-01-03 17:00:00 27300 41930

In the past I have tried to solve similar problems using the tidyverse/baseR... But nothing comes even remotely close to the speeds that data.table provides for these kind of operations, so I encourage you to give it a try...
For questions like this, my favourite finction is foverlaps() from the data.table-package. With this function you can (fast!) perform an overlap-join. If you want more flexibility in your joining than foverlaps() provides, a non-equi-join (again using data.table) is probably the best (and fastest!) option. But foverlaps() will do here (I guess).
I used the sample data you provided, but filtered out rows where Stop <= Start (probably a tyop in your sample data). When df$Start is not before df$Stop, foverlaps give a warning and won't execute.
library( data.table )
#create data.table with periods you wish to simmarise on
#NB: UTC is used as timezone, since this is also the case in the sample data provided!!
dt.dates <- data.table( id = paste0( "Day", 1:31 ),
Start = seq( as.POSIXct( "2019-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
as.POSIXct( "2019-01-31 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
by = "1 days"),
Stop = seq( as.POSIXct( "2019-01-02 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
as.POSIXct( "2019-02-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
by = "1 days") )
If you do not want to summarise on a daily basis, but by hour, minute, second, of year. Just change the values (and stepsize) in dt.dates data.table so that they match your periods.
#set df as data.table
dt <- as.data.table( df )
#filter out any row where Stop is smaller than Start
dt <- dt[ Start < Stop, ]
#perform overlap join
#first set keys
setkey(dt, Start, Stop)
#then perform join
result <- foverlaps( dt.dates, dt, type = "within" )
#summarise
result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]
output
# Day Value
# 1: 2019-01-01 1500
# 2: 2019-01-02 1500
# 3: 2019-01-03 1500
# 4: 2019-01-04 351562
# 5: 2019-01-05 413050
# 6: 2019-01-06 400440
# 7: 2019-01-07 332880
# 8: 2019-01-08 332880
# 9: 2019-01-09 332880
# 10: 2019-01-10 64372
# 11: 2019-01-11 0
# 12: 2019-01-12 0
# 13: 2019-01-13 0
# 14: 2019-01-14 0
# 15: 2019-01-15 0
# 16: 2019-01-16 0
# 17: 2019-01-17 0
# 18: 2019-01-18 0
# 19: 2019-01-19 0
# 20: 2019-01-20 0
# 21: 2019-01-21 0
# 22: 2019-01-22 0
# 23: 2019-01-23 0
# 24: 2019-01-24 0
# 25: 2019-01-25 0
# 26: 2019-01-26 0
# 27: 2019-01-27 0
# 28: 2019-01-28 0
# 29: 2019-01-29 0
# 30: 2019-01-30 0
# 31: 2019-01-31 0
# Day Value
plot
#summarise for plot
result.plot <- result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]
library( ggplot2 )
ggplot( data = result.plot, aes( x = Day, y = Value ) ) + geom_col()

Related

Add column in dataframe based on 3 columns from another dataframe using R

I have 2 dataframes which are as follows:
Dataframe 1: traffic_df which is hourly data.
Date_Time
Traffic
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2020-03-24 08:00:00
15
Dataframe 2: Alert.level
Start
End
Alert.level
10/03/2020 13:30
23/03/2020 13:30
2
23/03/2020 13:30
25/03/2020 23:59
3
I want to add a 3rd column to traffic_df which is the associated Alert.level if the Date_Time falls within the Start and End Date_Time of the Alert.level df so that the resulting dataframe will look like this:
Dataframe 1: traffic_df
Date_Time
Traffic
Alert.Level
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2
2020-03-24 08:00:00
15
3
Is there anyway to do this without having to make a matching hourly dataframe and then using join?
I'm thinking somehow using the map function?
Code to produce the df:
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame")
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3))
You may try the fuzzyjoin package.
Data
library(lubridate)
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame") %>%
mutate(Date_Time = ymd_hms(Date_Time))
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3)) %>%
mutate(Start = dmy_hms(Start),
End = dmy_hms(End))
Code
library(fuzzyjoin)
traffic_df %>%
fuzzy_left_join(Alert.Level,
match_fun = list(`>=`, `<=`),
by = list(x = c("Date_Time",
"Date_Time"),
y = c("Start",
"End"))) %>%
select(-Start, -End)
Output
In contrast to your expected output above, row three is not matched, because 7:00 o'clock is before the starting time of 13:30.
Date_Time Traffic Alert.level
1 2020-03-09 06:00:00 10 NA
2 2020-03-09 07:00:00 20 NA
3 2020-03-10 07:00:00 20 NA
4 2020-03-24 08:00:00 15 3
Here is a solution using sqldf. Note that I renamed the data.frame to have an underscore for convenience with SQL.
library(sqldf)
Alert_level <- Alert.level
sqldf("SELECT * FROM traffic_df
LEFT JOIN Alert_level
ON traffic_df.Date_Time BETWEEN Alert_level.Start AND Alert_level.End")
Output
Date_Time Traffic Start End Alert.level
1 2020-03-09 06:00:00 10 <NA> <NA> NA
2 2020-03-09 07:00:00 20 <NA> <NA> NA
3 2020-03-10 07:00:00 20 <NA> <NA> NA
4 2020-03-24 08:00:00 15 2020-03-23 13:30:00 2020-03-25 23:59:00 3
I like outer approaches in such cases. First, define a Vectorized FUNction, that looks if a specific x is between an y interval. Put it in outer which iterates each Date_Time with each start/end interval of Alert.Level. This gives a matrix o that informs which of the intervals is applicable (I use unname to avoid confusion). Then, in traffic_df we crate a NA column alert_lv (should just have a different name than "Alert.Level"), subset it with positive colSums, and put in the according levels of Alert.Level.
FUN <- Vectorize(function(x, y) x >= y[1] & x < y[2])
(o <- unname(outer(traffic_df$Date_Time, Alert.Level[-3], FUN)))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE TRUE
w <- unlist(apply(o, 1, which))
traffic_df <- within(traffic_df, {
alert_lv <- NA
alert_lv[rowSums(o) > 0] <- Alert.Level[w, 3]
})
traffic_df
# Date_Time Traffic alert_lv
# 1 2020-03-09 06:00:00 10 NA
# 2 2020-03-09 07:00:00 20 NA
# 3 2020-03-10 07:00:00 20 2
# 4 2020-03-24 08:00:00 15 3
Note: To use this solution you first need the usual 'POSIXct' formats, so first you should do
traffic_df$Date_Time <- as.POSIXct(traffic_df$Date_Time)
Alert.Level[1:2] <- lapply(Alert.Level[1:2], strptime, format='%d/%m/%Y %H:%M')

Is there a function for checking if a time interval overlaps in a single column and sort by group in R

I have a large dataset and I'm trying to find where time intervals overlap by group. To complicate things further I'm hoping that the code could be integrated with the 'dplyr' group_by function so the overlapping times don't get confused with other ids.
I've tried using the overlapping function "int_overlaps(int1, int2)" from "lubridate", but this doesn't work for one column. Any other overlapping functions appear to not work with time intervals.
library(lubridate)
id <- c(1,1,1,2,2)
start <-as.POSIXct(c("2017-06-27 09:30:00","2017-06-27 15:30:00",
"2017-06-27 14:30:00","2017-06-28 09:30:00","2017-06-28 15:00:00"),tz= "UTC")
end <-as.POSIXct(c("2017-06-27 10:30:00","2017-06-27 17:30:00",
"2017-06-27 18:30:00","2017-06-28 10:30:00","2017-06-28 16:00:00"),tz= "UTC")
inter1<- interval(start,end,tz="UTC")
df <- data.frame(id,inter1)
overlap <- c(FALSE,TRUE,TRUE,FALSE,FALSE)
new_df<-data.frame(id,inter1,overlap)
The sample data doesn't have any overlapping periods. The following change was made:
start <-as.POSIXct(c("2017-06-27 09:30:00","2017-06-27 15:30:00",
"2017-06-27 14:30:00","2017-06-28 09:30:00","2017-06-28 15:00:00"), tz= "UTC")
Using lead will return NA if it is the last record in a group
library(dplyr)
new_df %>%
group_by(id) %>%
arrange(int_start(inter1), .by_group = TRUE) %>%
mutate(overlap2 = lead(int_start(inter1)) < int_end(inter1))
# A tibble: 5 x 4
# Groups: id [2]
id inter1 overlap overlap2
<dbl> <Interval> <lgl> <lgl>
1 1 2017-06-27 09:30:00 UTC--2017-06-27 10:30:00 UTC FALSE FALSE
2 1 2017-06-27 14:30:00 UTC--2017-06-28 18:30:00 UTC TRUE TRUE
3 1 2017-06-27 15:30:00 UTC--2017-06-27 17:30:00 UTC TRUE NA
4 2 2017-06-28 09:30:00 UTC--2017-06-28 10:30:00 UTC FALSE FALSE
5 2 2017-06-28 15:00:00 UTC--2017-06-28 16:00:00 UTC FALSE NA
If needing to compare each row to all rows within the group
library(tidyverse)
new_df %>%
group_by(id) %>%
arrange(int_start(inter1), .by_group = TRUE) %>%
mutate(overlap2 = map_int(inter1, ~ sum(int_overlaps(.x, inter1))) > 1)
# A tibble: 5 x 4
# Groups: id [2]
id inter1 overlap overlap2
<dbl> <Interval> <lgl> <lgl>
1 1 2017-06-27 09:30:00 UTC--2017-06-27 10:30:00 UTC FALSE FALSE
2 1 2017-06-27 14:30:00 UTC--2017-06-28 18:30:00 UTC TRUE TRUE
3 1 2017-06-27 15:30:00 UTC--2017-06-27 17:30:00 UTC TRUE TRUE
4 2 2017-06-28 09:30:00 UTC--2017-06-28 10:30:00 UTC FALSE FALSE
5 2 2017-06-28 15:00:00 UTC--2017-06-28 16:00:00 UTC FALSE FALSE
1) sqldf Assuming you only want to overlap the times and not the dates, replace inter1 with start, end as well as the starting and ending times, time1 and time2, giving new_df1. Then do a self join on id and the
overlap condition grouping by rowid. overlap is TRUE if the count of matching rows exceeds 1 (since overlapping itself does not count).
library(dplyr)
library(lubridate)
library(sqldf)
new_df1 <- new_df %>%
mutate(
start = int_start(inter1),
end = int_end(inter1),
time1 = sub(".* ", "", start),
time2 = sub(".* ", "", end),
inter1 = NULL
)
sqldf("select a.id, a.start, a.end, count(*) > 1 as overlap
from new_df1 a
join new_df1 b on a.id = b.id and
(a.time1 between b.time1 and b.time2 or b.time1 between a.time1 and a.time2)
group by a.rowid")
giving:
id start end overlap
1 1 2017-06-27 05:30:00 2017-06-27 06:30:00 FALSE
2 1 2017-06-27 11:30:00 2017-06-27 13:30:00 TRUE
3 1 2017-06-28 10:30:00 2017-06-28 14:30:00 TRUE
4 2 2017-06-28 05:30:00 2017-06-28 06:30:00 FALSE
5 2 2017-06-28 11:00:00 2017-06-28 12:00:00 FALSE
2) This forms the full ni x ni join for each id i and then filters it down and groups it as a second and third step whereas (1) does these all at once so depending on the SQL optimizations applied by the database software (1) might be much more efficient. Anyways, this joins on id and then filters on the overlap condition and finally does the counting. new_df1 is from (1).
new_df1 %>%
mutate(rowid = 1:n()) %>%
inner_join(new_df1, by = "id", suffix = c("", ".y")) %>%
filter((time1 >= time1.y & time1 <= time2.y) |
(time1.y >= time1 & time1.y <= time2)) %>%
count(rowid, id, start, end) %>%
mutate(overlap = n > 1) %>%
select(id, start, end, overlap)
giving:
# A tibble: 5 x 4
rowid start end overlap
<int> <dttm> <dttm> <lgl>
1 1 2017-06-27 09:30:00 2017-06-27 10:30:00 FALSE
2 2 2017-06-27 15:30:00 2017-06-27 17:30:00 TRUE
3 3 2017-06-28 14:30:00 2017-06-28 18:30:00 TRUE
4 4 2017-06-28 09:30:00 2017-06-28 10:30:00 FALSE
5 5 2017-06-28 15:00:00 2017-06-28 16:00:00 FALSE
Note
The poster changed the question after it was already answered but in any case we used this as the input.
new_df <-
structure(list(id = c(1, 1, 1, 2, 2), inter1 = new("Interval",
.Data = c(3600, 7200, 14400, 3600, 3600), start = structure(c(1498555800,
1498577400, 1498660200, 1498642200, 1498662000), tzone = "UTC",
class = c("POSIXct",
"POSIXt")), tzone = "UTC"), overlap = c(FALSE, TRUE, TRUE,
FALSE, FALSE)), class = "data.frame", row.names = c(NA, -5L))

R data.table add column as function of another data.table

I have one data table which contains just a sequence of times. I have another data table containing two columns: start_time and end_time. I want to take the first data table and add a column where the value is the count of all of the rows in the second data table where the time from the first data table fits within the start and end time. Here is my code
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
Here is what i want to do, but this is wrong and gives an error. What's the right way to write this?
all_dates[, BinCount := input_data[start_times < Bin & end_times > Bin, .N] ]
In the end i should get something like
Bin BinCount
2017-01-31 17:00:00 1
2017-01-31 17:01:00 5
...
The problem can be solved very easily using sqldf as it provides easy way to join tables with range checking. Hence one solution could be:
The data from OP:
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
library(sqldf)
result <- sqldf("SELECT all_dates.bin, count() as BinCount
FROM all_dates, input_data
WHERE all_dates.bin > input_data.start_times AND
all_dates.bin < input_data.end_times
GROUP BY bin" )
result
Bin BinCount
1 2017-01-31 17:01:00 1
2 2017-01-31 17:02:00 1
3 2017-01-31 17:03:00 1
4 2017-01-31 17:04:00 1
5 2017-01-31 17:05:00 1
6 2017-01-31 17:06:00 1
...........
...........
497 2017-02-01 01:17:00 6
498 2017-02-01 01:18:00 5
499 2017-02-01 01:19:00 5
500 2017-02-01 01:20:00 4
[ reached getOption("max.print") -- omitted 460 rows ]
In data.table you're after a range join.
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
set.seed(123)
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
## doing the range-join and calculating the number of items per bin in one chained step
input_data[
all_dates
, on = .(start_times < Bin, end_times > Bin)
, nomatch = 0
, allow.cartesian = T
][, .N, by = start_times]
# start_times N
# 1: 2017-01-31 17:01:00 1
# 2: 2017-01-31 17:02:00 1
# 3: 2017-01-31 17:03:00 1
# 4: 2017-01-31 17:04:00 1
# 5: 2017-01-31 17:05:00 1
# ---
# 956: 2017-02-01 08:56:00 6
# 957: 2017-02-01 08:57:00 4
# 958: 2017-02-01 08:58:00 4
# 959: 2017-02-01 08:59:00 5
# 960: 2017-02-01 09:00:00 5
Note:
I've put the all_dates object on the right-hand-side of the join, so the result contains the names of the input_data columns, even though they are your Bins (see this issue for the discussion on this topic)
I've used set.seed(), as you're taking samples
Wasn't requested, but here is a compact alternative solution using the tidyverse. Uses lubridate parsers, interval, and %within%, as well as purrr::map_int to generate the desired bin counts.
library(tidyverse)
library(lubridate)
start_date <- ymd_hms(x = "2017-01-31 17:00:00") # lubridate parsers
end_date <- ymd_hms(x = "2017-02-01 09:00:00")
all_dates <- tibble(seq(start_date, end_date, "min")) # tibble swap for data.table
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- tibble(
start_times,
end_times,
intvl = interval(start_times, end_times) # Add interval column
)
all_dates %>% # Checks date in Bin and counts intervals it lies within
mutate(BinCount = map_int(.$Bin, ~ sum(. %within% input_data$intvl)))
# A tibble: 961 x 2
Bin BinCount
<dttm> <int>
1 2017-01-31 17:00:00 0
2 2017-01-31 17:01:00 0
3 2017-01-31 17:02:00 0
4 2017-01-31 17:03:00 0
5 2017-01-31 17:04:00 0
6 2017-01-31 17:05:00 0
7 2017-01-31 17:06:00 0
8 2017-01-31 17:07:00 1
9 2017-01-31 17:08:00 1
10 2017-01-31 17:09:00 1
# ... with 951 more rows

Interpolation over time

In a dataframe, I have wind speed data measured four times a day, at 00:00, 06:00, 12:00 and 18:00 o'clock. To combine these with other data, I need to fill the time in between towards a resolution of 15 minutes. I would like to fill the gaps by simple interpolation.
The following example produces two corresponding sample dataframes. df1 and df2 need to be merged. In the resulting merged dataframe, the gap values between the 6-hourly values (where var == NA?) need to be filled by a simply mean interpolation. My problem is how to merge both and do the concrete interpolation between the given values.
First dataframe
Creation:
# create a corresponding sample data frame
df1 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "6 hour"
),
windspeed = abs(rnorm(14*4, 10, 4)) # abs() because windspeed shoud be positive
)
Resulting dataframe:
> # show the head of the dataframe
> head(df1)
date windspeed
1 2015-10-01 00:00:00 17.928217
2 2015-10-01 06:00:00 11.306025
3 2015-10-01 12:00:00 6.648131
4 2015-10-01 18:00:00 10.320146
5 2015-10-02 00:00:00 2.138559
6 2015-10-02 06:00:00 9.076344
Second dataframe
Creation:
# create a 2nd corresponding sample data frame
df2 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "15 min"
),
var = abs(rnorm(14*24*4, 300, 100))
)
Resulting dataframe:
> # show the head of the 2nd dataframe
> head(df2)
date var
1 2015-10-01 00:00:00 198.2657
2 2015-10-01 00:15:00 472.9041
3 2015-10-01 00:30:00 605.8776
4 2015-10-01 00:45:00 429.0949
5 2015-10-01 01:00:00 400.2390
6 2015-10-01 01:15:00 317.1503
This is a solution
First merge them to get using all = TRUE to get all values
df3 <- merge(df1, df2, all = TRUE)
Then use approx for Interpolation
df3$windspeed <- approx(x = df1$date, y = df1$windspeed, xout = df2$date)$y
The only problem there is that the las ones will be NA unless your last value of windspeed is there, but everything in between will be there

R time aggregate with start/stop

I have a set of time series data that has a start and stop time. Each event can last from few seconds to few days, I need to calculate the sum, in this example the total memory used, every hour of the jobs active at the time. Here is a sample of the data:
mem_used start_time stop_time
16 2015-10-24 17:24:41 2015-10-25 04:19:44
80 2015-10-24 17:24:51 2015-10-25 03:14:59
44 2015-10-24 17:25:27 2015-10-25 01:16:10
28 2015-10-24 17:25:43 2015-10-25 00:00:31
72 2015-10-24 17:30:23 2015-10-24 23:58:31
In this case it should give something like:
time total_mem
2015-10-24 17:00:00 240
2015-10-24 18:00:00 240
...
2015-10-25 00:00:00 168
2015-10-25 01:00:00 140
2015-10-25 02:00:00 96
2015-10-25 03:00:00 96
2015-10-25 04:00:00 16
I'm trying to do something with the aggregate function but I can not figure it out. Any ideas? Thanks.
Here's how I would do it, using lubridate.
First, make sure that your dates are in POSIXct format:
dat$start_time = as.POSIXct(dat$start_time, format = "%Y-%m-%d %H:%M:%S")
dat$stop_time = as.POSIXct(dat$stop_time, format = "%Y-%m-%d %H:%M:%S")
Then make an interval object with lubridate:
library(lubridate)
dat$interval <- interval(dat$start_time, dat$stop_time)
Now we can make a vector of times, replace these with your desired times:
z <- seq(start = dat$start_time[1], stop = dat$stop_time[5], by = "hours")
And sum those where we have an overlap:
out <- data.frame(times = z,
mem_used = sapply(z, function(x) sum(dat$mem_used[x %within% dat$interval])))
times mem_used
1 2015-10-24 17:24:41 16
2 2015-10-24 18:24:41 240
3 2015-10-24 19:24:41 240
4 2015-10-24 20:24:41 240
5 2015-10-24 21:24:41 240
6 2015-10-24 22:24:41 240
7 2015-10-24 23:24:41 240
Here's the data used:
structure(list(mem_used = c(16L, 80L, 44L, 28L, 72L), start_time = structure(c(1445721881,
1445721891, 1445721927, 1445721943, 1445722223), class = c("POSIXct",
"POSIXt"), tzone = ""), stop_time = structure(c(1445761184, 1445757299,
1445750170, 1445745631, 1445745511), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("mem_used", "start_time", "stop_time"
), row.names = c(NA, -5L), class = "data.frame")
Here is another solution based on dplyr and lubridate.
Make sure first to have the data in the right format (e.g date in POSIXct)
library(dplyr)
library(lubridate)
glimpse(df)
## Observations: 5
## Variables: 3
## $ mem_used (int) 16, 80, 44, 28, 72
## $ start_time (time) 2015-10-24 17:24:41, 2015-10-24 17:24:51...
## $ end_time (time) 2015-10-25 04:19:44, 2015-10-25 03:14:59...
Then we will just keep the hour (removing minutes and seconds) since we want to aggregate per hour.
### Remove minutes and seconds
minute(df$start_time) <- 0
second(df$start_time) <- 0
minute(df$end_time) <- 0
second(df$end_time) <- 0
The most important step now, is to create a new data.frame with one row for each hour between start_time and end_time. For example, if on the first line of the original data.frame we have 5 hours between start_time and end_time, we will end with 5 rows and the value mem_used duplicated 5 times.
###
n <- nrow(df)
l <- lapply(1:n, function(i) {
date <- seq.POSIXt(df$start_time[i], df$end_time[i], by = "hour")
mem_used <- rep(df$mem_used[i], length(date))
data.frame(time = date, mem_used = mem_used)
})
df <- Reduce(rbind, l)
glimpse(df)
## Observations: 47
## Variables: 2
## $ time (time) 2015-10-24 17:00:00, 2015-10-24 18:00:00, ...
## $ mem_used (int) 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,...
Finally, we can now aggregate using dplyr or aggregate (or other similar functions)
df %>%
group_by(time) %>%
summarise(tot = sum(mem_used))
## time tot
## (time) (int)
## 1 2015-10-24 17:00:00 240
## 2 2015-10-24 18:00:00 240
## 3 2015-10-24 19:00:00 240
## 4 2015-10-24 20:00:00 240
## 5 2015-10-24 21:00:00 240
## 6 2015-10-24 22:00:00 240
## 7 2015-10-24 23:00:00 240
## 8 2015-10-25 00:00:00 168
## 9 2015-10-25 01:00:00 140
## 10 2015-10-25 02:00:00 96
## 11 2015-10-25 03:00:00 96
## 12 2015-10-25 04:00:00 16
## Or aggregate
aggregate(mem_used ~ time, FUN = sum, data = df)

Resources