I am interested in calculating averages over specific time periods in a time series data set.
Given a time series like this:
dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count<-c(1:240)
DF<-data.frame(dtm,Count)
In the past I have been able to calculate daily averages with
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
But now I am trying to cut up the day into specific time periods and I'm not sure how to set my "breaks".
As opposed to a daily average from 0:00:24:00, How for example could I get a Noon to Noon average?
Or more fancy, how could I set up a Noon to Noon average excluding the night times of 7PM to 6AM (or conversely only including the daylight hours of 6AM- 7PM).
xts is perfect package for timeseries analysis
library(xts)
originalTZ <- Sys.getenv("TZ")
Sys.setenv(TZ = "GMT")
data.xts <- as.xts(1:240, as.POSIXct("2007-03-27 05:00", tz = "GMT") + 3600 * (1:240))
head(data.xts)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
# You can filter data using ISO-style subsetting
data.xts.filterd <- data.xts["T06:00/T19:00"]
# You can use builtin functions to apply any function FUN on daily data.
apply.daily(data.xts.filtered, mean)
## [,1]
## 2007-03-27 18:00:00 7.5
## 2007-03-28 18:00:00 31.5
## 2007-03-29 18:00:00 55.5
## 2007-03-30 18:00:00 79.5
## 2007-03-31 18:00:00 103.5
## 2007-04-01 18:00:00 127.5
## 2007-04-02 18:00:00 151.5
## 2007-04-03 18:00:00 175.5
## 2007-04-04 18:00:00 199.5
## 2007-04-05 18:00:00 223.5
# OR
# now let's say you want to find noon to noon average.
period.apply(data.xts, c(0, which(.indexhour(data.xts) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.5
## 2007-03-28 11:00:00 18.5
## 2007-03-29 11:00:00 42.5
## 2007-03-30 11:00:00 66.5
## 2007-03-31 11:00:00 90.5
## 2007-04-01 11:00:00 114.5
## 2007-04-02 11:00:00 138.5
## 2007-04-03 11:00:00 162.5
## 2007-04-04 11:00:00 186.5
## 2007-04-05 11:00:00 210.5
# now if you want to exclude time from 7 PM to 6 AM
data.xts.filtered <- data.xts[!data.xts %in% data.xts["T20:00/T05:00"]]
head(data.xts.filtered, 20)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
## 2007-03-27 12:00:00 7
## 2007-03-27 13:00:00 8
## 2007-03-27 14:00:00 9
## 2007-03-27 15:00:00 10
## 2007-03-27 16:00:00 11
## 2007-03-27 17:00:00 12
## 2007-03-27 18:00:00 13
## 2007-03-27 19:00:00 14
## 2007-03-28 06:00:00 25
## 2007-03-28 07:00:00 26
## 2007-03-28 08:00:00 27
## 2007-03-28 09:00:00 28
## 2007-03-28 10:00:00 29
## 2007-03-28 11:00:00 30
period.apply(data.xts.filtered, c(0, which(.indexhour(data.xts.filtered) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.50000
## 2007-03-28 11:00:00 17.78571
## 2007-03-29 11:00:00 41.78571
## 2007-03-30 11:00:00 65.78571
## 2007-03-31 11:00:00 89.78571
## 2007-04-01 11:00:00 113.78571
## 2007-04-02 11:00:00 137.78571
## 2007-04-03 11:00:00 161.78571
## 2007-04-04 11:00:00 185.78571
## 2007-04-05 11:00:00 209.78571
Sys.setenv(TZ = originalTZ)
Let me quickly repeat your code.
dtm <- as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count <- c(1:240)
DF<-data.frame(dtm,Count)
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
If you offset each time by 12 hours in the function call, you can still use cut with breaks on "day". I will save the day that the noon to noon starts on, so I will subtract 12 hours.
# Get twelve hours in seconds
timeOffset <- 60*60*12
# Subtract the offset to get the start day of the noon to noon
DF$Noon_Start_Day <- cut((DF$dtm - timeOffset), breaks="day")
# Get the mean
NtN_Avg <- aggregate(DF$Count ~ Noon_Start_Day, DF, mean)
One way to exclude certain hours is to convert the dates to POSIXlt. Then you can access hour among other things.
# Indicate which times are good (use whatever boolean test is needed here)
goodTimes <- !(as.POSIXlt(DF$dtm)$hour >= 19) & !(as.POSIXlt(DF$dtm)$hour <= 6)
new_NtN_Avg <- aggregate(Count ~ Noon_Start_Day, data=subset(DF, goodTimes), mean)
I found some help at this question on stackoverflow: r-calculate-means-for-subset-of-a-group
The noon-to-noon problem can easily be solved numerically. The key is that the start of a (GMT) day has a time_t value that is always divisible by 86400. This is specified by POSIX. For example, see: http://en.wikipedia.org/wiki/Unix_time
cuts <- unique(as.numeric(DF$dtm) %/% (86400/2)) * (86400/2) # half-days
cuts <- c(cuts, cuts[length(cuts)]+(86400/2)) # One more at the end
cuts <- as.POSIXct(cuts, tz="GMT", origin="1970-01-01") # Familiar format
DF$halfday <- cut(DF$dtm, cuts) # This is the cut you want.
Halfday_Avg <- aggregate(Count~halfday, data=DF, FUN=mean)
Halfday_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 12.5
## 3 2007-03-28 00:00:00 24.5
## 4 2007-03-28 12:00:00 36.5
## 5 2007-03-29 00:00:00 48.5
## 6 2007-03-29 12:00:00 60.5
## 7 2007-03-30 00:00:00 72.5
## 8 2007-03-30 12:00:00 84.5
## 9 2007-03-31 00:00:00 96.5
## 10 2007-03-31 12:00:00 108.5
## 11 2007-04-01 00:00:00 120.5
## 12 2007-04-01 12:00:00 132.5
## 13 2007-04-02 00:00:00 144.5
## 14 2007-04-02 12:00:00 156.5
## 15 2007-04-03 00:00:00 168.5
## 16 2007-04-03 12:00:00 180.5
## 17 2007-04-04 00:00:00 192.5
## 18 2007-04-04 12:00:00 204.5
## 19 2007-04-05 00:00:00 216.5
## 20 2007-04-05 12:00:00 228.5
## 21 2007-04-06 00:00:00 237.5
Now to extend this to solve the rest of the problem. Given here is the 6AM-7PM time range.
intraday <- as.numeric(DF$dtm) %% 86400
# Subset DF by the chosen range
New_Avg <- aggregate(Count~halfday, data=DF[intraday >= 6*3600 & intraday <= 19*3600,], FUN=mean)
New_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 10.5
## 3 2007-03-28 00:00:00 27.5
## 4 2007-03-28 12:00:00 34.5
## 5 2007-03-29 00:00:00 51.5
## 6 2007-03-29 12:00:00 58.5
## 7 2007-03-30 00:00:00 75.5
## 8 2007-03-30 12:00:00 82.5
## 9 2007-03-31 00:00:00 99.5
## 10 2007-03-31 12:00:00 106.5
## 11 2007-04-01 00:00:00 123.5
## 12 2007-04-01 12:00:00 130.5
## 13 2007-04-02 00:00:00 147.5
## 14 2007-04-02 12:00:00 154.5
## 15 2007-04-03 00:00:00 171.5
## 16 2007-04-03 12:00:00 178.5
## 17 2007-04-04 00:00:00 195.5
## 18 2007-04-04 12:00:00 202.5
## 19 2007-04-05 00:00:00 219.5
## 20 2007-04-05 12:00:00 226.5
Related
I am trying to split rows in an excel file based on day and time. The data is from a study which participants will need to wear a tracking watch. Each row of the data set is started with participants put on the watch (Variable: 'Wear Time Start ') and ended with them taking off the device (Variable: 'Wear Time End').
I need to calculate how many hours of each participant wearing the device on each day (NOT each time period in one row).
Data set before split:
ID WearStart WearEnd
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00
2 01 2018-05-14 21:30:00 2018-05-15 02:00:00
3 01 2018-05-15 07:00:00 2018-05-16 22:30:00
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00
5 01 2018-05-17 01:00:00 2018-05-19 15:00:00
6 02 ...
Some explanation about the data set before split: the data type of 'WearStart' and 'WearEnd' are POSIXlt.
Desired output after split:
ID WearStart WearEnd Interval
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00 11
2 01 2018-05-14 21:30:00 2018-05-15 00:00:00 2.5
3 01 2018-05-15 00:00:00 2018-05-15 02:00:00 2
4 01 2018-05-15 07:00:00 2018-05-16 00:00:00 17
5 01 2018-05-16 00:00:00 2018-05-16 22:30:00 22.5
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00 0.4
5 01 2018-05-17 01:00:00 2018-05-18 00:00:00 23
6 01 2018-05-18 00:00:00 2018-05-19 00:00:00 24
7 01 2018-05-19 00:00:00 2018-05-19 15:00:00 15
Then I need to accumulate hours based on day:
ID Wear_Day Total_Hours
1 01 2018-05-14 13.5
2 01 2018-05-15 19
3 01 2018-05-16 22.9
4 01 2018-05-17 23
5 01 2018-05-18 24
4 01 2018-05-19 15
So, I reworked the entire answer. Please, review the code. I am pretty sure this is what you want.
Short summary
The problem is that you need to split rows which start and end on different dates. And you need to do this recursively. So, I split the dataframe into a list of 1-row dataframes. For each I check whether start and end is on the same day. If not, I make it a 2-row dataframe with the adjusted start and end times. This is then split up again into a list of 1-row dataframes and so on so forth.
In the end there is a nested list of 1-row dataframes where start and end is on the same day. And this list is then recursively bound together again.
# Load Packages ---------------------------------------------------------------------------------------------------
library(tidyverse)
library(lubridate)
df <- tribble(
~ID, ~WearStart, ~WearEnd
, 01, "2018-05-14 09:00:00", "2018-05-14 20:00:00"
, 01, "2018-05-14 21:30:00", "2018-05-15 02:00:00"
, 01, "2018-05-15 07:00:00", "2018-05-16 22:30:00"
, 01, "2018-05-16 23:00:00", "2018-05-16 23:40:00"
, 01, "2018-05-17 01:00:00", "2018-05-19 15:00:00"
)
df <- df %>% mutate_at(vars(starts_with("Wear")), ymd_hms)
# Helper Functions ------------------------------------------------------------------------------------------------
endsOnOtherDay <- function(df){
as_date(df$WearStart) != as_date(df$WearEnd)
}
split1rowInto2Days <- function(df){
df1 <- df
df2 <- df
df1$WearEnd <- as_date(df1$WearStart) + days(1) - milliseconds(1)
df2$WearStart <- as_date(df2$WearStart) + days(1)
rbind(df1, df2)
}
splitDates <- function(df){
if (nrow(df) > 1){
return(df %>%
split(f = 1:nrow(df)) %>%
lapply(splitDates) %>%
reduce(rbind))
}
if (df %>% endsOnOtherDay()){
return(df %>%
split1rowInto2Days() %>%
splitDates())
}
df
}
# The actual Calculation ------------------------------------------------------------------------------------------
df %>%
splitDates() %>%
mutate(wearDuration = difftime(WearEnd, WearStart, units = "hours")
, wearDay = as_date(WearStart)) %>%
group_by(ID, wearDay) %>%
summarise(wearDuration_perDay = sum(wearDuration))
ID wearDay wearDuration_perDay
<dbl> <date> <drtn>
1 1 2018-05-14 13.50000 hours
2 1 2018-05-15 19.00000 hours
3 1 2018-05-16 23.16667 hours
4 1 2018-05-17 23.00000 hours
5 1 2018-05-18 24.00000 hours
6 1 2018-05-19 15.00000 hours
Here is my solution to your question with just using basic functions in R:
#step 1: read data from file
d <- read.csv("dt.csv", header = TRUE)
d
ID WearStart WearEnd
1 1 2018-05-14 09:00:00 2018-05-14 20:00:00
2 1 2018-05-14 21:30:00 2018-05-15 02:00:00
3 1 2018-05-15 07:00:00 2018-05-16 22:30:00
4 1 2018-05-16 23:00:00 2018-05-16 23:40:00
5 1 2018-05-17 01:00:00 2018-05-19 15:00:00
6 2 2018-05-16 11:30:00 2018-05-16 11:40:00
7 2 2018-05-16 22:05:00 2018-05-22 22:42:00
#step 2: change class of WearStart and WearEnd to POSIlct
d$WearStart <- as.POSIXlt(d$WearStart, tryFormats = "%Y-%m-%d %H:%M")
d$WearEnd <- as.POSIXlt(d$WearEnd, tryFormats = "%Y-%m-%d %H:%M")
#step 3: calculate time interval (days and hours) for each record
timeInt <- function(d) {
WearStartDay <- as.Date(d$WearStart, "%Y/%m/%d")
Interval_days <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "days"))
Days <- WearStartDay + seq(0, Interval_days,1)
N_FullBTWDays <- length(Days) - 2
if (N_FullBTWDays >= 0) {
sd <- d$WearStart
sd_h <- 24 - sd$hour -1
sd_m <- (60 - sd$min)/60
sd_total <- sd_h + sd_m
hours <- sd_total
hours <- c(hours, rep(24,N_FullBTWDays))
ed <- d$WearEnd
ed_h <- ed$hour
ed_m <- ed$min/60
ed_total <- ed_h + ed_m
hours <- c(hours,ed_total)
} else {
hours <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "hours"))
}
df <- data.frame(id = rep(d$ID, length(Days)), days = Days, hours = hours)
return(df)
}
df <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df) <- c("id", "days", "hours")
for ( i in 1:nrow(d)) {
df <- rbind(df,timeInt(d[i,]))
}
id days hours
1 1 2018-05-14 11.0000000
2 1 2018-05-14 4.5000000
3 1 2018-05-15 17.0000000
4 1 2018-05-16 22.5000000
5 1 2018-05-16 0.6666667
6 1 2018-05-17 23.0000000
7 1 2018-05-18 24.0000000
8 1 2018-05-19 15.0000000
9 2 2018-05-16 0.1666667
10 2 2018-05-16 1.9166667
11 2 2018-05-17 24.0000000
12 2 2018-05-18 24.0000000
13 2 2018-05-19 24.0000000
14 2 2018-05-20 24.0000000
15 2 2018-05-21 24.0000000
16 2 2018-05-22 22.7000000
#daily usage of device for each customer
res <- as.data.frame(tapply(df$hours, list(df$days,df$id), sum))
res[is.na(res)] <- 0
res$date <- rownames(res)
res
1 2 date
2018-05-14 15.50000 0.000000 2018-05-14
2018-05-15 17.00000 0.000000 2018-05-15
2018-05-16 23.16667 2.083333 2018-05-16
2018-05-17 23.00000 24.000000 2018-05-17
2018-05-18 24.00000 24.000000 2018-05-18
2018-05-19 15.00000 24.000000 2018-05-19
2018-05-20 0.00000 24.000000 2018-05-20
2018-05-21 0.00000 24.000000 2018-05-21
2018-05-22 0.00000 22.700000 2018-05-22
How can one split the following datetime into year-month-day-hour-minute-second? The date was created using:
datetime = seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = 'GMT'),
as.POSIXct("2015-11-30 23:59:59", tz = 'GMT'),
by="hour",tz="GMT"))
The ultimate goal is to aggregate x which is at hourly resolution into 6-hourly resolution. Probably it is possible to aggregate datetime without needing to split it?
datetime x
1 2015-04-01 00:00:00 0.0
2 2015-04-01 01:00:00 0.0
3 2015-04-01 02:00:00 0.0
4 2015-04-01 03:00:00 0.0
5 2015-04-01 04:00:00 0.0
6 2015-04-01 05:00:00 0.0
7 2015-04-01 06:00:00 0.0
8 2015-04-01 07:00:00 0.0
9 2015-04-01 08:00:00 0.0
10 2015-04-01 09:00:00 0.0
11 2015-04-01 10:00:00 0.0
12 2015-04-01 11:00:00 0.0
13 2015-04-01 12:00:00 0.0
14 2015-04-01 13:00:00 0.0
15 2015-04-01 14:00:00 0.0
16 2015-04-01 15:00:00 0.0
17 2015-04-01 16:00:00 0.0
18 2015-04-01 17:00:00 0.0
19 2015-04-01 18:00:00 0.0
20 2015-04-01 19:00:00 0.0
21 2015-04-01 20:00:00 0.0
22 2015-04-01 21:00:00 0.0
23 2015-04-01 22:00:00 1.6
24 2015-04-01 23:00:00 0.2
25 2015-04-02 00:00:00 1.5
26 2015-04-02 01:00:00 1.5
27 2015-04-02 02:00:00 0.5
28 2015-04-02 03:00:00 0.0
29 2015-04-02 04:00:00 0.0
30 2015-04-02 05:00:00 0.0
31 2015-04-02 06:00:00 0.0
32 2015-04-02 07:00:00 0.5
33 2015-04-02 08:00:00 0.3
34 2015-04-02 09:00:00 0.0
35 2015-04-02 10:00:00 0.0
36 2015-04-02 11:00:00 0.0
37 2015-04-02 12:00:00 0.0
38 2015-04-02 13:00:00 0.0
39 2015-04-02 14:00:00 0.0
40 2015-04-02 15:00:00 0.0
41 2015-04-02 16:00:00 0.0
42 2015-04-02 17:00:00 0.0
43 2015-04-02 18:00:00 0.0
44 2015-04-02 19:00:00 0.0
45 2015-04-02 20:00:00 0.0
46 2015-04-02 21:00:00 0.0
47 2015-04-02 22:00:00 0.0
48 2015-04-02 23:00:00 0.0
....
The output should be very close to:
YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss
2015-04-01 00:00:00 2015-04-01 06:00:00 2015-04-01 12:00:00 2015-04-01 18:00:00
2015-04-02 00:00:00 2015-04-02 06:00:00 2015-04-02 12:00:00 2015-04-02 18:00:00
.....
I appreciate your thoughts on this.
EDIT
How to implement #r2evans answer on a list object such as:
x = runif(5856)
flst1=list(x,x,x,x)
flst1=lapply(flst1, function(x){x$datetime <- as.POSIXct(x$datetime, tz = "GMT"); x})
sixhours1=lapply(flst1, function(x) {x$bin <- cut(x$datetime,sixhours);x})
head(sixhours1[[1]],n=7)
ret=lapply(sixhours1, function(x) aggregate(x$precip, list(x$bin), sum,na.rm=T))
head(ret[[1]],n=20)
Your minimal data is incomplete, so I'll generate something random:
dat <- data.frame(datetime = seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = "GMT"),
as.POSIXct("2015-11-30 23:59:59", tz = "GMT"),
by = "hour",tz = "GMT"),
x = runif(5856))
# the "1+" ensures we extend at least to the end of the datetimes;
# without it, the last several rows in "bin" would be NA
sixhours <- seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = "GMT"),
1 + as.POSIXct("2015-11-30 23:59:59", tz = "GMT"),
by = "6 hours",tz = "GMT")
# this doesn't have to go into the data.frame (could be a separate
# vector), but I'm including it for easy row-wise comparison
dat$bin <- cut(dat$datetime, sixhours)
head(dat, n=7)
# datetime x bin
# 1 2015-04-01 00:00:00 0.91022534 2015-04-01 00:00:00
# 2 2015-04-01 01:00:00 0.02638850 2015-04-01 00:00:00
# 3 2015-04-01 02:00:00 0.42486354 2015-04-01 00:00:00
# 4 2015-04-01 03:00:00 0.90722845 2015-04-01 00:00:00
# 5 2015-04-01 04:00:00 0.24540085 2015-04-01 00:00:00
# 6 2015-04-01 05:00:00 0.60360906 2015-04-01 00:00:00
# 7 2015-04-01 06:00:00 0.01843313 2015-04-01 06:00:00
tail(dat)
# datetime x bin
# 5851 2015-11-30 18:00:00 0.5963204 2015-11-30 18:00:00
# 5852 2015-11-30 19:00:00 0.2503440 2015-11-30 18:00:00
# 5853 2015-11-30 20:00:00 0.9600476 2015-11-30 18:00:00
# 5854 2015-11-30 21:00:00 0.6837394 2015-11-30 18:00:00
# 5855 2015-11-30 22:00:00 0.9093506 2015-11-30 18:00:00
# 5856 2015-11-30 23:00:00 0.9197769 2015-11-30 18:00:00
nrow(dat)
# [1] 5856
The work:
ret <- aggregate(dat$x, list(dat$bin), mean)
nrow(ret)
# [1] 976
head(ret)
# Group.1 x
# 1 2015-04-01 00:00:00 0.5196193
# 2 2015-04-01 06:00:00 0.4770019
# 3 2015-04-01 12:00:00 0.5359483
# 4 2015-04-01 18:00:00 0.8140603
# 5 2015-04-02 00:00:00 0.4874332
# 6 2015-04-02 06:00:00 0.6139554
tail(ret)
# Group.1 x
# 971 2015-11-29 12:00:00 0.6881228
# 972 2015-11-29 18:00:00 0.4791925
# 973 2015-11-30 00:00:00 0.5793872
# 974 2015-11-30 06:00:00 0.4809868
# 975 2015-11-30 12:00:00 0.5157432
# 976 2015-11-30 18:00:00 0.7199298
I got a solution using:
library(xts)
flst<- list.files(pattern=".csv")
flst1<- lapply(flst,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE,
dec = ".",quote = "\"",colClasses=c('factor', 'numeric', 'NULL'))) # read files ignoring 3 column
head(flst1[[1]])
dat.xts=lapply(flst1, function(x) xts(x$precip,as.POSIXct(x$datetime)))
head(dat.xts[[1]])
ep.xts=lapply(dat.xts, function(x) endpoints(x, on="hours", k=6))#k=by .... see endpoints for "on"
head(ep.xts[[1]])
stations6hrly<-lapply(dat.xts, function(x) period.apply(x, FUN=sum,INDEX=ep))
head(stations6hrly[[703]])
[,1]
2015-04-01 05:00:00 0.3
2015-04-01 11:00:00 1.2
2015-04-01 17:00:00 0.0
2015-04-01 23:00:00 0.2
2015-04-02 05:00:00 0.0
2015-04-02 11:00:00 1.4
The dates are not as I wanted them to be but the values are correct. I doubt if there is a -shifttime function in R just as in CDO
My dataset is a bit noisy at 1-min interval. So, I'd like to get an average value every hour from 25 min to 35 min to stand for that hour at 30 min.
For example, an average average at: 00:30 (average from 00:25 to 00:35), 01:30 (average from 01:25 to 01:35), 02:30 (average from 02:25 to 02:35), etc.
Can you good way to do this in R?
Here is my dataset:
set.seed(1)
DateTime <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), "min")
value <- rnorm(n=length(DateTime), mean=100, sd=1)
df <- data.frame(DateTime, value)
Thanks a lot.
Here's one way
library(dplyr)
df %>%
filter(between(as.numeric(format(DateTime, "%M")), 25, 35)) %>%
group_by(hour=format(DateTime, "%Y-%m-%d %H")) %>%
summarise(value=mean(value))
I think that the existing answers are not general enough as they do not take into account that a time interval could fall within multiple midpoints.
I would instead use shift from the data.table package.
library(data.table)
setDT(df)
First set the interval argument based on the sequence you chose above. This calculates an average ten rows (minutes) around every row in your table:
df[, ave_val :=
Reduce('+',c(shift(value, 0:5L, type = "lag"),shift(value, 1:5L, type = "lead")))/11
]
Then generate the midpoints you want:
mids <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), by = 60*60) + 30*60 # every hour starting at 0:30
Then filter accordingly:
setkey(df,DateTime)
df[J(mids)]
Since you want to average on just a subset of each period, I think it makes sense to first subset the data.frame, then aggregate:
aggregate(
value~cbind(time=strftime(DateTime,'%Y-%m-%d %H:30:00')),
subset(df,{ m <- strftime(DateTime,'%M'); m>='25' & m<='35'; }),
mean
);
## time value
## 1 2010-01-01 00:30:00 99.82317
## 2 2010-01-01 01:30:00 100.58184
## 3 2010-01-01 02:30:00 99.54985
## 4 2010-01-01 03:30:00 100.47238
## 5 2010-01-01 04:30:00 100.05517
## 6 2010-01-01 05:30:00 99.96252
## 7 2010-01-01 06:30:00 99.79512
## 8 2010-01-01 07:30:00 99.06791
## 9 2010-01-01 08:30:00 99.58731
## 10 2010-01-01 09:30:00 100.27202
## 11 2010-01-01 10:30:00 99.60758
## 12 2010-01-01 11:30:00 99.92074
## 13 2010-01-01 12:30:00 99.65819
## 14 2010-01-01 13:30:00 100.04202
## 15 2010-01-01 14:30:00 100.04461
## 16 2010-01-01 15:30:00 100.11609
## 17 2010-01-01 16:30:00 100.08631
## 18 2010-01-01 17:30:00 100.41956
## 19 2010-01-01 18:30:00 99.98065
## 20 2010-01-01 19:30:00 100.07341
## 21 2010-01-01 20:30:00 100.20281
## 22 2010-01-01 21:30:00 100.86013
## 23 2010-01-01 22:30:00 99.68170
## 24 2010-01-01 23:30:00 99.68097
## 25 2010-01-02 00:30:00 99.58603
## 26 2010-01-02 01:30:00 100.10178
## 27 2010-01-02 02:30:00 99.78766
## 28 2010-01-02 03:30:00 100.02220
## 29 2010-01-02 04:30:00 99.83427
## 30 2010-01-02 05:30:00 99.74934
## 31 2010-01-02 06:30:00 99.99594
## 32 2010-01-02 07:30:00 100.08257
## 33 2010-01-02 08:30:00 99.47077
## 34 2010-01-02 09:30:00 99.81419
## 35 2010-01-02 10:30:00 100.13294
## 36 2010-01-02 11:30:00 99.78352
## 37 2010-01-02 12:30:00 100.04590
## 38 2010-01-02 13:30:00 99.91061
## 39 2010-01-02 14:30:00 100.61730
## 40 2010-01-02 15:30:00 100.18539
## 41 2010-01-02 16:30:00 99.45165
## 42 2010-01-02 17:30:00 100.09894
## 43 2010-01-02 18:30:00 100.04131
## 44 2010-01-02 19:30:00 99.58399
## 45 2010-01-02 20:30:00 99.75524
## 46 2010-01-02 21:30:00 99.94079
## 47 2010-01-02 22:30:00 100.26533
## 48 2010-01-02 23:30:00 100.35354
## 49 2010-01-03 00:30:00 100.31141
## 50 2010-01-03 01:30:00 100.10709
## 51 2010-01-03 02:30:00 99.41102
## 52 2010-01-03 03:30:00 100.07964
## 53 2010-01-03 04:30:00 99.88183
## 54 2010-01-03 05:30:00 99.91112
## 55 2010-01-03 06:30:00 99.71431
## 56 2010-01-03 07:30:00 100.48585
## 57 2010-01-03 08:30:00 100.35096
## 58 2010-01-03 09:30:00 100.00060
## 59 2010-01-03 10:30:00 100.03858
## 60 2010-01-03 11:30:00 99.95713
## 61 2010-01-03 12:30:00 99.18699
## 62 2010-01-03 13:30:00 99.49216
## 63 2010-01-03 14:30:00 99.37762
## 64 2010-01-03 15:30:00 99.68642
## 65 2010-01-03 16:30:00 99.84921
## 66 2010-01-03 17:30:00 99.84039
## 67 2010-01-03 18:30:00 99.90989
## 68 2010-01-03 19:30:00 99.95421
## 69 2010-01-03 20:30:00 100.01276
## 70 2010-01-03 21:30:00 100.14585
## 71 2010-01-03 22:30:00 99.54110
## 72 2010-01-03 23:30:00 100.02526
## 73 2010-01-04 00:30:00 100.04476
## 74 2010-01-04 01:30:00 99.61132
## 75 2010-01-04 02:30:00 99.94782
## 76 2010-01-04 03:30:00 99.44863
## 77 2010-01-04 04:30:00 99.91305
## 78 2010-01-04 05:30:00 100.25428
## 79 2010-01-04 06:30:00 99.86279
## 80 2010-01-04 07:30:00 99.63516
## 81 2010-01-04 08:30:00 99.65747
## 82 2010-01-04 09:30:00 99.57810
## 83 2010-01-04 10:30:00 99.77603
## 84 2010-01-04 11:30:00 99.85140
## 85 2010-01-04 12:30:00 100.82995
## 86 2010-01-04 13:30:00 100.26138
## 87 2010-01-04 14:30:00 100.25851
## 88 2010-01-04 15:30:00 99.92685
## 89 2010-01-04 16:30:00 100.00825
## 90 2010-01-04 17:30:00 100.24437
## 91 2010-01-04 18:30:00 99.62711
## 92 2010-01-04 19:30:00 99.93999
## 93 2010-01-04 20:30:00 99.82477
## 94 2010-01-04 21:30:00 100.15321
## 95 2010-01-04 22:30:00 99.88370
## 96 2010-01-04 23:30:00 100.06657
I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.
The sample of the data is below.
date obs
2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11
First I entered the data with the multiple spaces replaced with tabs.
dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
date obs
1 2011-10-24 01:00:00 12
2 2011-10-24 02:00:00 4
3 2011-10-24 19:00:00 18
4 2011-10-24 20:00:00 7
5 2011-10-24 21:00:00 4
6 2011-10-24 22:00:00 2
$`297`
date obs
7 2011-10-25 00:00:00 4
8 2011-10-25 01:00:00 2
9 2011-10-25 02:00:00 2
10 2011-10-25 15:00:00 12
11 2011-10-25 18:00:00 2
12 2011-10-25 19:00:00 3
13 2011-10-25 21:00:00 2
14 2011-10-25 23:00:00 9
$`298`
date obs
15 2011-10-26 00:00:00 13
16 2011-10-26 01:00:00 11
The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.
?cut.POSIXt
split(dat , cut(dat$date, "week") )
If you wanted to sum within date:
tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298
47 36 24
I'd use a time series class such as xts
dat <- read.table(text="2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11", header=FALSE, stringsAsFactors=FALSE)
xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))
xts subsetting is very intuitive. For all data on "2011-10-25", do this
xobj["2011-10-25"]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)
xobj["2011-10-24/2011-10-25"]
Or, if you want all data from October 2011,
xobj["2011-10"]
If you want to get all data from any day that is between 19:00 and 20:00,
xobj['T19:00:00/T20:00:00']
# [,1]
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-25 19:00:00 3
You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)
endpoints(xobj, "days")
[1] 0 6 14 16
Or you can convert to a lower frequency
to.weekly(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26 12 18 2 11
to.daily(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25 12 18 2 2
#2011-10-26 4 12 2 9
#2011-10-26 13 13 11 11
Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE
to.daily(xobj, OHLC=FALSE)
# [,1]
#2011-10-25 2
#2011-10-26 9
#2011-10-26 11
For more basic subsetting, and much more, visit http://www.quantmod.com/examples/
As #JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.
You can split by day (or week, or month, etc), apply a function, then recombine
split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
# [,1]
#2011-10-24 01:00:00 12
#2011-10-24 02:00:00 4
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-24 21:00:00 4
#2011-10-24 22:00:00 2
#
#[[2]]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
#
#[[3]]
# [,1]
#2011-10-26 00:00:00 13
#2011-10-26 01:00:00 11
Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.
do.call(rbind, lapply(split(xobj, 'days'), first))
# [,1]
#2011-10-24 01:00:00 12
#2011-10-25 00:00:00 4
#2011-10-26 00:00:00 13
I have hourly rainfall and temperature data for long period. I would like to get daily values from hourly data. I am considering day means from 07:00:00 to next day 07:00:00.
Could you tell me how to convert hourly data to daily between specific time interval?
example : 07:00:00 to 07:00:00 or 12:00:00 to 12:00:00)
Rainfall data looks like:
1970-01-05 00:00:00 1.0
1970-01-05 01:00:00 1.0
1970-01-05 02:00:00 1.0
1970-01-05 03:00:00 1.0
1970-01-05 04:00:00 1.0
1970-01-05 05:00:00 3.6
1970-01-05 06:00:00 3.6
1970-01-05 07:00:00 2.2
1970-01-05 08:00:00 2.2
1970-01-05 09:00:00 2.2
1970-01-05 10:00:00 2.2
1970-01-05 11:00:00 2.2
1970-01-05 12:00:00 2.2
1970-01-05 13:00:00 2.2
1970-01-05 14:00:00 2.2
1970-01-05 15:00:00 2.2
1970-01-05 16:00:00 0.0
1970-01-05 17:00:00 0.0
1970-01-05 18:00:00 0.0
1970-01-05 19:00:00 0.0
1970-01-05 20:00:00 0.0
1970-01-05 21:00:00 0.0
1970-01-05 22:00:00 0.0
1970-01-05 23:00:00 0.0
1970-01-06 00:00:00 0.0
First, create some reproducible data so we can help you better:
require(xts)
set.seed(1)
X = data.frame(When = as.Date(seq(from = ISOdatetime(2012, 01, 01, 00, 00, 00),
length.out = 100, by="1 hour")),
Measurements = sample(1:20, 100, replace=TRUE))
We now have a data frame with 100 hourly observations where the dates start at 2012-01-01 00:00:00 and end at 2012-01-05 03:00:00 (time is in 24-hour format).
Second, convert it to an XTS object.
X2 = xts(X$Measurements, order.by=X$When)
Third, learn how to subset a specific time window.
X2['T04:00/T08:00']
# [,1]
# 2012-01-01 04:00:00 5
# 2012-01-01 05:00:00 18
# 2012-01-01 06:00:00 19
# 2012-01-01 07:00:00 14
# 2012-01-01 08:00:00 13
# 2012-01-02 04:00:00 18
# 2012-01-02 05:00:00 7
# 2012-01-02 06:00:00 10
# 2012-01-02 07:00:00 12
# 2012-01-02 08:00:00 10
# 2012-01-03 04:00:00 9
# 2012-01-03 05:00:00 5
# 2012-01-03 06:00:00 2
# 2012-01-03 07:00:00 2
# 2012-01-03 08:00:00 7
# 2012-01-04 04:00:00 18
# 2012-01-04 05:00:00 8
# 2012-01-04 06:00:00 16
# 2012-01-04 07:00:00 20
# 2012-01-04 08:00:00 9
Fourth, use that information with apply.daily and whatever function you want, as follows:
apply.daily(X2['T04:00/T08:00'], mean)
# [,1]
# 2012-01-01 08:00:00 13.8
# 2012-01-02 08:00:00 11.4
# 2012-01-03 08:00:00 5.0
# 2012-01-04 08:00:00 14.2
Update: Custom endpoints
After re-reading your question, I see that I misinterpreted what you wanted.
It seems that you want to take the mean of a 24 hour period, not necessarily from midnight to midnight.
For this, you should ditch apply.daily and instead, use period.apply with custom endpoints, like this:
# You want to start at 7AM. Find out which record is the first one at 7AM.
A = which(as.character(index(X2)) == "2012-01-01 07:00:00")
# Use that to create your endpoints.
# The ends of the endpoints should start at 0
# and end at the max number of records.
ep = c(0, seq(A, 100, by=24), 100)
period.apply(X2, INDEX=ep, FUN=function(x) mean(x))
# [,1]
# 2012-01-01 07:00:00 12.62500
# 2012-01-02 07:00:00 10.08333
# 2012-01-03 07:00:00 10.79167
# 2012-01-04 07:00:00 11.54167
# 2012-01-05 03:00:00 10.25000
You can you this code :
fun <- function(s,i,j) { sum(s[i:(i+j-1)]) }
sapply(X=seq(1,24*nb_of_days,24),FUN=fun,s=your_time_serie,j=24)
You just have to change 1 to another value to have different interval of time : 8 of 07:00:00 to 07:00:00 or 13 for 12:00:00 to 12:00:00
Step 1: transform date to POSIXct
ttt <- as.POSIXct("1970-01-05 08:00:00",tz="GMT")
ttt
#"1970-01-05 08:00:00 GMT"
Step 2: substract difftime of 7 hours
ttt <- ttt-as.difftime(7,units="hours")
ttt
#"1970-01-05 01:00:00 GMT"
Step 3: trunc to days
ttt<-trunc(ttt,"days")
ttt
#"1970-01-05 GMT"
Step 4: use plyr, data.table or whatever method you prefer, to calculate daily means
Using regular expressions should get you what you need. Select lines that match your needs and sum the values. Do this for each day within your hour range and you're set.