so i have a large data frame with a date time column of class POSIXct and a another column with price data of class numeric. the date time column has values of the form "1998-12-07 02:00:00 AEST" that are half hour observations across 20 years. a sample data set can be generated with the following code (vary the 100 to whatever number of observations are necessary):
data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"), as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
i want to look at a typical year and typical week. so for the typical year i have the following code:
mean.year <- aggregate(df$price, by = list(format(df$date.time, "%m-%d %H:%M")), mean)
it seems to give me what i want:
Group.1 x
1 01-01 00:00 31.86200
2 01-01 00:30 34.20526
3 01-01 01:00 28.40105
4 01-01 01:30 26.01684
5 01-01 02:00 23.68895
6 01-01 02:30 23.70632
however the column "Group.1" is of class character and i would like it to be of class POSIXct. how can i do this?
for the typical week i have the following code
mean.week <- aggregate(df$price, by = list(format(df$date.time, "%wday %H:%M")), mean)
the output is as follows
Group.1 x
1 0day 00:00 33.05613
2 0day 00:30 30.92815
3 0day 01:00 29.26245
4 0day 01:30 29.47959
5 0day 02:00 29.18380
6 0day 02:30 25.99400
again, column "Group.1" is of class character and i would like POSIXct. also, i would like to have the day of the week as "Monday", "Tuesday", etc. instead of 0day. how would i do this?
Convert the datetime to a character string that can validly be converted back to POSIXct and then do so:
mean.year <- aggregate(df["price"],
by = list(time = as.POSIXct(format(df$date.time, "2000-%m-%d %H:%M"))), mean)
head(mean.year)
## time price
## 1 2000-12-07 02:00:00 -0.56047565
## 2 2000-12-07 02:30:00 -0.23017749
## 3 2000-12-07 03:00:00 1.55870831
## 4 2000-12-07 03:30:00 0.07050839
## 5 2000-12-07 04:00:00 0.12928774
## 6 2000-12-07 04:30:00 1.71506499
To get the day of the week use %a or %A -- see ?strptime for the list of percent codes.
mean.week <- aggregate(df["price"],
by = list(time = format(df$date.time, "%a %H:%M")), mean)
head(mean.week)
## time price
## 1 Mon 02:00 -0.56047565
## 2 Mon 02:30 -0.23017749
## 3 Mon 03:00 1.55870831
## 4 Mon 03:30 0.07050839
## 5 Mon 04:00 0.12928774
## 6 Mon 04:30 1.71506499
Note
The input df in reproducible form -- note that set.seed is needed to make it reproducible:
set.seed(123)
df <- data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"),
as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
Related
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]
I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455
My problem is as follows: I've got a time series with 5-Minute precipitation data like:
Datum mm
1 2004-04-08 00:05:00 NA
2 2004-04-08 00:10:00 NA
3 2004-04-08 00:15:00 NA
4 2004-04-08 00:20:00 NA
5 2004-04-08 00:25:00 NA
6 2004-04-08 00:30:00 NA
With this structure:
'data.frame': 1098144 obs. of 2 variables:
$ Datum: POSIXlt, format: "2004-04-08 00:05:00" "2004-04-08 00:10:00" "2004-04-08 00:15:00" "2004-04-08 00:20:00" ...
$ mm : num NA NA NA NA NA NA NA NA NA NA ...
As you can see, the time series begins with a lot of NA's, but there is measured precipitation further down, although riddled with single, less common NA's due to malfunction of the measuring station.
What I'm trying to achieve, is summing up the measured precipitation to hourly sums, not considering NA's.
This is what I tried so far:
sums <- aggregate(precip["mm"],
list(cut(precip$Datum, "1 hour")), sum)
Even though the timestamps are correctly aggregated to hours, all sums are 0 or NA. The sums are not even calculated if there is no NA at all.
additionally to be taken into account:
Hourly precipitation sums in meteorology always describe the cumulative sum until a certain hour: The amount of precipitation at 0:00 o'clock describes the sum from 23:00 the previous day until 0:00. So I always need to sum up the previous hour.
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-03-08 23:00:00")
r <- seq(s, s+1e4, "30 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 6, T))
Datum mm
2004-03-08 23:00:00 4
2004-03-08 23:30:00 1
2004-03-09 00:00:00 2
2004-03-09 00:30:00 4
2004-03-09 01:00:00 1
2004-03-09 01:30:00 4
With the above example, the result I am looking for is:
Datum mm
2004-03-09 00:00:00 5
2004-03-09 01:00:00 6
2004-03-09 02:00:00 5
Try adding na.rm=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
# Group.1 mm
# 1 2004-04-08 00:00:00 26
# 2 2004-04-08 01:00:00 35
# 3 2004-04-08 02:00:00 25
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-04-08 00:05:00")
r <- seq(s, s+1e4, "5 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 34, T))
addendum
To your second question: If you would like measurements on the hour to be calculated with the lesser hour add right=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour", right=TRUE)), sum, na.rm=TRUE)
Further Explanation
We will create another more detailed explanation to show how the solution works:
p <- c("2004-04-07 23:48:20", "2004-04-08 00:00:00", "2004-04-08 00:03:20")
ptime <- as.POSIXlt(p)
#[1] "2004-04-07 23:48:20 EDT" "2004-04-08 00:00:00 EDT" "2004-04-08 00:03:20 EDT"
We have three dates to separate into groups. If we use cut without any extra arguments, the second entry "2004-04-08 00:00:00 EDT" will be grouped with the third entry for hour "00:00":
cut(ptime, "1 hour")
#[1] 2004-04-07 23:00:00 2004-04-08 00:00:00 2004-04-08 00:00:00
But if we add the argument right=FALSE we can group it with the "23:00" hour:
cut(ptime, "1 hour", right=TRUE)
#[1] 2004-04-07 23:00:00 2004-04-07 23:00:00 2004-04-08 00:00:00
We can specify the behavior of edge cases.
edit
With your new data the original solution produces the desired output:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
Group.1 mm
1 2004-03-08 23:00:00 5
2 2004-03-09 00:00:00 6
3 2004-03-09 01:00:00 5
You can use dplyr to calculate sum like :
precip$hour <- strftime(precip$Datum,"%Y-%m-%d %H")
library(dplyr)
sum_hour <- precip %>% group_by(hour) %>% summarise(sum_hour = sum(mm,na.rm = T))
I am quite new to R and have been struggling with trying to convert my data and could use some much needed help.
I have a dataframe which is approx. 70,000*2. This data covers a whole year (52 weeks/365 days). A portion of it looks like this:
Create.Date.Time Ticket.ID
1 2013-06-01 12:59:00 INCIDENT684790
2 2013-06-02 07:56:00 SERVICE684793
3 2013-06-02 09:39:00 SERVICE684794
4 2013-06-02 14:14:00 SERVICE684796
5 2013-06-02 17:20:00 SERVICE684797
6 2013-06-03 07:20:00 SERVICE684799
7 2013-06-03 08:02:00 SERVICE684839
8 2013-06-03 08:04:00 SERVICE684841
9 2013-06-03 08:04:00 SERVICE684842
10 2013-06-03 08:08:00 SERVICE684843
I am trying to get the number of tickets in every hour of the week (that is, hour 1 to hour 168) for each week. Hour 1 would start on Monday at 00.00, and hour 168 would be Sunday 23.00-23.59. This would be repeated for each week. I want to use the Create.Date.Time data to calculate the hour of the week the ticket is in, say for:
2013-06-01 12:59:00 INCIDENT684790 - hour 133,
2013-06-03 08:08:00 SERVICE684843 - hour 9
I am then going to do averages for each hour and plot those. I am completely at a loss as to where to start. Could someone please point me to the right direction?
Before addressing the plotting aspect of your question, is this the format of data you are trying to get? This uses the package lubridate which you might have to install (install.packages("lubridate",dependencies=TRUE)).
library(lubridate)
##
Events <- paste(
sample(c("INCIDENT","SERVICE"),20000,replace=TRUE),
sample(600000:900000,20000)
)
t0 <- as.POSIXct(
"2013-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
Dates <- sort(t0 + sample(0:(3600*24*365-1),20000))
Weeks <- week(Dates)
wDay <- wday(Dates,label=TRUE)
Hour <- hour(Dates)
##
hourShift <- function(time,wday){
hShift <- sapply(wday, function(X){
if(X=="Mon"){
0
} else if(X=="Tues"){
24*1
} else if(X=="Wed"){
24*2
} else if(X=="Thurs"){
24*3
} else if(X=="Fri"){
24*4
} else if(X=="Sat"){
24*5
} else {
24*6
}
})
##
tOut <- hour(time) + hShift + 1
return(tOut)
}
##
weekHour <- hourShift(time=Dates,wday=wDay)
##
Data <- data.frame(
Event=Events,
Timestamp=Dates,
Week=Weeks,
wDay=wDay,
dayHour=Hour,
weekHour=weekHour,
stringsAsFactors=FALSE)
##
This gives you:
> head(Data)
Event Timestamp Week wDay dayHour weekHour
1 SERVICE 783405 2013-01-01 00:13:55 1 Tues 0 25
2 INCIDENT 860015 2013-01-01 01:06:41 1 Tues 1 26
3 INCIDENT 808309 2013-01-01 01:10:05 1 Tues 1 26
4 INCIDENT 835509 2013-01-01 01:21:44 1 Tues 1 26
5 SERVICE 769239 2013-01-01 02:04:59 1 Tues 2 27
6 SERVICE 762269 2013-01-01 02:07:41 1 Tues 2 27
I am quite new to R, and I am trying to find a way to average continuous data into a specific period of time.
My data is a month recording of several parameters with 1s time steps
The table via read.csv has a date and time in one column and several other columns with values.
TimeStamp UTC Pitch Roll Heave(m)
05-02-13 6:45 0 0 0
05-02-13 6:46 0.75 -0.34 0.01
05-02-13 6:47 0.81 -0.32 0
05-02-13 6:48 0.79 -0.37 0
05-02-13 6:49 0.73 -0.08 -0.02
So I want to average the data in specific intervals: 20 min for example in a way that the average for hour 7:00, takes all the points from hour 6:41 to 7:00 and returns the average in this interval and so on for the entire dataset.
The time interval will look like this :
TimeStamp
05-02-13 19:00 462
05-02-13 19:20 332
05-02-13 19:40 15
05-02-13 20:00 10
05-02-13 20:20 42
Here is a reproducible dataset similar to your own.
meteorological <- data.frame(
TimeStamp = rep.int("05-02-13", 1440),
UTC = paste(
rep(formatC(0:23, width = 2, flag = "0"), each = 60),
rep(formatC(0:59, width = 2, flag = "0"), times = 24),
sep = ":"
),
Pitch = runif(1440),
Roll = rnorm(1440),
Heave = rnorm(1440)
)
The first thing that you need to do is to combine the first two columns to create a single (POSIXct) date-time column.
library(lubridate)
meteorological$DateTime <- with(
meteorological,
dmy_hm(paste(TimeStamp, UTC))
)
Then set up a sequence of break points for your different time groupings.
breaks <- seq(ymd("2013-02-05"), ymd("2013-02-06"), "20 mins")
Finally, you can calculate the summary statistics for each group. There are many ways to do this. ddply from the plyr package is a good choice.
library(plyr)
ddply(
meteorological,
.(cut(DateTime, breaks)),
summarise,
MeanPitch = mean(Pitch),
MeanRoll = mean(Roll),
MeanHeave = mean(Heave)
)
Please see if something simple like this works for you:
myseq <- data.frame(time=seq(ISOdate(2014,1,1,12,0,0), ISOdate(2014,1,1,13,0,0), "5 min"))
myseq$cltime <- cut(myseq$time, "20 min", labels = F)
> myseq
time cltime
1 2014-01-01 12:00:00 1
2 2014-01-01 12:05:00 1
3 2014-01-01 12:10:00 1
4 2014-01-01 12:15:00 1
5 2014-01-01 12:20:00 2
6 2014-01-01 12:25:00 2
7 2014-01-01 12:30:00 2
8 2014-01-01 12:35:00 2
9 2014-01-01 12:40:00 3
10 2014-01-01 12:45:00 3
11 2014-01-01 12:50:00 3
12 2014-01-01 12:55:00 3
13 2014-01-01 13:00:00 4