I have a selection of scattered timestamp data based on requests to a particular service. This data covers approximately 3.5-4 years of requests against this service.
I am looking to turn this selection of variable-interval timestamps into a frequency-binned timeseries in R.
How would I go about converting these timestamps into a frequency-binned timeseries, such as "between 1 and 1:15PM on this day, there were 7 requests, and between 1:15 and 1:30PM there were 2, and between 1:30 and 1:45, there were 0", being sure to also have a bin where there is nothing?
The data is just a vector of timestamps from a database dump, all of the format: ""2014-02-17 13:10:46". Just a big ol' vector with ~2 million objects in it.
You could use tools for handling time series data from xts and zoo. Note that you will need some artificial 'data':
ts.index <- ISOdatetime(2018, 1, 8, 8:9, sample(60, 10), 0)
ts <- xts(rep(1, length(ts.index)), ts.index)
aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:45:00 3
Edit: If you want to include bins without observations, you can convert to a strictly regular ts object and replace the inserted NAvalues with zero:
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct")
#> zoo(coredata(x), tt)
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:30:00 0
#> 2018-01-08 09:45:00 3
Edit 2: It also works for the provided sample data:
data <- c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510, 1292561113)
class(data) = c("POSIXct", "POSIXt")
attr(data, "tzone") <- "UTC"
#> structure(c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510,
#> 1292561113), class = c("POSIXct", "POSIXt"), tzone = "UTC")
ts <- xts(rep(1, length(data)), data)
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
head(as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct"))
#> zoo(coredata(x), tt)
#> 2008-12-10 15:00:00 1
#> 2008-12-10 15:15:00 0
#> 2008-12-10 15:30:00 0
#> 2008-12-10 15:45:00 0
#> 2008-12-10 16:00:00 0
#> 2008-12-10 16:15:00 0
Let's say I have a csv file. For example, this one, https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#
If I do
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-12 20:00:00 NA 2003-12-12 19:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-14 20:00:00 NA 2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-07 19:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-11-30 19:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
I can do Sys.setenv(TZ="GMT") before I load the file and then that avoids the offset issue.
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-13 00:00:00 NA 2003-12-13 00:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-15 00:00:00 NA 2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-08 00:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-12-01 00:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
While setting my session tz to GMT isn't really too onerous, I'm wondering if there's a way to have it either assume the file is the same as my local time zone and just keep it that way or if it wants to assume it's GMT in the file then just keep it in GMT regardless of my local timezone.
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
Actually, the timezone conversion you are seeing just happens when you print. You can see this if you save the data frame to a variable and print it before and after you change your current timezone:
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows
However, in the example you showed there is no time data, so you might be better off reading that column as a date instead of a timestamp. Unfortunately right now I think Arrow only lets you parse as a date right now if you provide the schema for the whole table. One alternative would be to parse the date columns after reading.
I have a time series (xts) of rain gage data and I would like to be able to sum all the rain amounts between a beginning and end time point from a list. And then make a new data frame that is StormNumber and TotalRain over that time
> head(RainGage)
2019-07-01 00:00:00 0
2019-07-01 00:15:00 0
2019-07-01 00:30:00 0
2019-07-01 00:45:00 0
2019-07-01 01:00:00 0
2019-07-01 01:15:00 0
StormNumber RainStartTime RainEndTime
1 1 2019-07-21 20:00:00 2019-07-22 04:45:00
2 2 2019-07-22 11:30:00 2019-07-22 23:45:00
3 3 2019-07-11 09:15:00 2019-07-11 19:00:00
4 4 2019-05-29 17:00:00 2019-05-29 20:45:00
5 5 2019-06-27 14:30:00 2019-06-27 17:15:00
6 6 2019-07-11 06:15:00 2019-07-11 09:00:00
I have this code that I got from the SO community when I was trying to do something similar in the past (but extract data rather than sum it). However, I have no idea how it works so I am struggling to adapt it to this situation.
do.call(rbind, Map(function(x, y) RainGage[paste(x, y, sep="/")],
StormTimes$RainStartTime, StormTimes$RainEndTime)
In this case I would suggest just to write your own function and then use apply to achieve what you want, for example:
dates <- c('2019-07-01 00:00:00', '2019-07-01 00:15:00',
'2019-07-01 00:30:00', '2019-07-01 00:45:00',
'2019-07-01 01:00:00', '2019-07-01 01:15:00')
dates <- as.POSIXct(strptime(dates, '%Y-%m-%d %H:%M:%S'))
mm <- c(0, 10, 10, 20, 0, 0)
rain <- data.frame(dates, mm)
number <- c(1,2)
start <- c('2019-07-01 00:00:00','2019-07-01 00:18:00')
start <- as.POSIXct(strptime(start, '%Y-%m-%d %H:%M:%S'))
end <- c('2019-07-01 00:17:00','2019-07-01 01:20:00')
end <- as.POSIXct(strptime(end, '%Y-%m-%d %H:%M:%S'))
storms <- data.frame(number, start, end)
# Sum of rain
f = function(x, output) {
# Get storm number
number = x[1]
# Get starting moment
start = x[2]
# Get ending moment
end = x[3]
# Calculate sum
output <- sum(rain[rain$dates >= start & rain$dates < end, 'mm'])
# Apply function to each row of the dataframe
storms$rain <- apply(storms, 1, f)
This yields:
number start end rain
1 1 2019-07-01 00:00:00 2019-07-01 00:17:00 10
2 2 2019-07-01 00:18:00 2019-07-01 01:20:00 30
So a column rain in storms now holds the sum of rain$mm, which is what you're after.
Hope that helps you out!
I am dealing with a huge dataset (years of 1-minute-interval observations of energy usage). I want to convert it from 1-min-interval to 15-min-interval.
I have written a for loop which does this successfully (tested on a small subset of the data); however, when I tried running it on the main data, it was executing very slowly - and it would have taken me over 175 hours to run the full loop (I stopped it mid-execution).
The data to be converted to the 15-th minute interval is the kWh usage; thusly converting it simply requires taking the average of the first 15th observations, then the second 15th, etc. This is the loop that's working:
# Opening the file
data <- read.csv("1.csv",colClasses="character",na.strings="?")
# Adding an index to each row
total <- nrow(data)
data$obsnum <- seq.int(nrow(data))
# Calculating 15 min kwH usage
data$use_15_min <- data$use
for (i in 1:total) {
int_used <- floor((i-1)/15)
obsNum <- 15*int_used
sum <- 0
for (j in 1:15) {
usedIndex <- as.numeric(obsNum+j)
sum <- as.numeric(data$use[usedIndex]) + sum
data$use_15_min[i] <- sum/15
I have been searching for a function that can do the same, but without using loops, as I imagine this should save much time. Yet, I haven't been able to find one. How is it possible to achieve the same functionality without using a loop?
Try data.table:
DT <- data.table(data)
n <- nrow(DT)
DT[, use_15_min := mean(use), by = gl(n, 15, n)]
The question is missing the input data so we used this:
data <- data.frame(use = 1:100)
A potential solution is to calculate the running mean (e.g. using TTR::runMean) and then select every 15th observations. Here is an example:
df = data.frame(x = 1:100, y = runif(100))
df['runmean'] = TTR::runMean(df['y'], n=15)
df_15 = df[seq(1,nrow(df), 15), ]
I cannot test it, as I do not have Your data, but perhaps:
total <- nrow(data)
data$use_15_min = TTR::runMean(data$use, n=15)
data_15_min = data[seq(1, nrow(df), 15)]
I would use lubridate::floor_date to create the 15-minute groupings.
df <- tibble(
date = seq(ymd_hm("2019-01-01 00:00"), by = "min", length.out = 60 * 24 * 7),
value = rnorm(n = 60 * 24 * 7)
#> # A tibble: 10,080 x 2
#> date value
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.182
#> 2 2019-01-01 00:01:00 0.616
#> 3 2019-01-01 00:02:00 -0.252
#> 4 2019-01-01 00:03:00 0.0726
#> 5 2019-01-01 00:04:00 -0.917
#> 6 2019-01-01 00:05:00 -1.78
#> 7 2019-01-01 00:06:00 -1.49
#> 8 2019-01-01 00:07:00 -0.818
#> 9 2019-01-01 00:08:00 0.275
#> 10 2019-01-01 00:09:00 1.26
#> # ... with 10,070 more rows
df %>%
nearest_15_mins = floor_date(date, "15 mins")
) %>%
group_by(nearest_15_mins) %>%
avg_value_at_15_mins_int = mean(value)
#> # A tibble: 672 x 2
#> nearest_15_mins avg_value_at_15_mins_int
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 -0.272
#> 2 2019-01-01 00:15:00 -0.129
#> 3 2019-01-01 00:30:00 0.173
#> 4 2019-01-01 00:45:00 -0.186
#> 5 2019-01-01 01:00:00 -0.188
#> 6 2019-01-01 01:15:00 0.104
#> 7 2019-01-01 01:30:00 -0.310
#> 8 2019-01-01 01:45:00 -0.173
#> 9 2019-01-01 02:00:00 0.0137
#> 10 2019-01-01 02:15:00 0.419
#> # ... with 662 more rows
I am trying to calculate how many reports are running at a certain time.
The data is like:
ReportID StartTime Duration
1 2018-11-02 13:00:00 240 seconds
2 2018-11-02 14:00:00 300 seconds
3 2018-11-02 14:01:15 300 seconds
4 2018-11-02 14:00:00 5000 seconds
The ideal output will be:
Time #ReportsRunning
2018-11-01 13:00:00 0
2018-11-02 13:00:00 1
2018-11-02 14:00:00 2
2018-11-02 15:00:00 1
Is there anyway to do something like this? I am thinking about adding column to every timestamp I want to check. But that will make the table extremely wide.
Data in reproducible form:
df1 <- data.frame(
ReportID = 1:4,
StartTime = as.POSIXct(c("2018-11-02 13:00:00", "2018-11-02 14:00:00",
"2018-11-02 14:01:15", "2018-11-02 14:00:00")),
Duration = as.difftime(c(240, 300, 300, 5000), units = "secs")
df2 <- data.frame(
Time = as.POSIXct(c("2018-11-01 13:00:00", "2018-11-02 13:00:00",
"2018-11-02 14:00:00", "2018-11-02 15:00:00"))
Here is a base R solution:
df2$`#ReportsRunning` <- sapply(
function(x) sum(x >= df1$StartTime & x <= df1$StartTime + df1$Duration)
# Time #ReportsRunning
# 1 2018-11-01 13:00:00 0
# 2 2018-11-02 13:00:00 1
# 3 2018-11-02 14:00:00 2
# 4 2018-11-02 15:00:00 1
But if your data is large, it should be much more efficient to use the IRanges package from BioConductor:
ranges <- IRanges(as.integer(df1$StartTime), width = as.integer(df1$Duration))
values <- as.integer(df2$Time)
df2$`#ReportsRunning` <- countOverlaps(values, ranges)
# Time #ReportsRunning
# 1 2018-11-01 13:00:00 0
# 2 2018-11-02 13:00:00 1
# 3 2018-11-02 14:00:00 2
# 4 2018-11-02 15:00:00 1
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]