Compute column average based on date and time in R - r

I have a matrix, which looks a bit like this:
Date Time Data
15000 04/09/2014 05:45:00 0.908
15001 04/09/2014 06:00:00 0.888
15002 04/09/2014 06:15:00 0.976
15003 04/09/2014 06:30:00 1.632
15004 04/09/2014 06:45:00 1.648
15005 04/09/2014 07:00:00 1.164
15006 04/09/2014 07:15:00 0.568
15007 04/09/2014 07:30:00 1.020
15008 04/09/2014 07:45:00 1.052
15009 04/09/2014 08:00:00 0.920
15010 04/09/2014 08:15:00 0.656
15011 04/09/2014 08:30:00 1.172
15012 04/09/2014 08:45:00 1.000
15013 04/09/2014 09:00:00 1.420
15014 04/09/2014 09:15:00 0.936
15015 04/09/2014 09:30:00 0.996
15016 04/09/2014 09:45:00 1.100
15017 04/09/2014 10:00:00 0.492
It contains a years worth of data, with each day having a 96 rows (15 minute intervals from 00:00 to 23:45). My question is that I'd like to average the data column, for each day, based on the time range I specify. For example, if I wanted to average over times 06:00 - 08:00 for each day, in the code above I should get an answer of 1.0964 for the date 04/09/2014.
I have no idea how to do this using the date and time columns as filters, and wondered if someone could help?
To make things even more complicated, I would also like to compute 45 minute rolling averages for each day, within a different time period, say 04:00 - 09:00. Again, as this is for each day, it would be good to get the result in a matrix for which each row is a certain date, then the columns would represent the rolling averages from say, 04:00 - 04:45, 04:15 - 05:00...
Any ideas?!

check the following code and let me know if anything is unclear
data = read.table(header = T, stringsAsFactors = F, text = "Index Date Time Data
15000 04/09/2014 05:45:00 0.908
15001 04/09/2014 06:00:00 0.888
15002 04/09/2014 06:15:00 0.976
15003 04/09/2014 06:30:00 1.632
15004 04/09/2014 06:45:00 1.648
15005 04/09/2014 07:00:00 1.164
15006 04/09/2014 07:15:00 0.568
15007 04/09/2014 07:30:00 1.020
15008 04/09/2014 07:45:00 1.052
15009 04/09/2014 08:00:00 0.920
15010 04/09/2014 08:15:00 0.656
15011 04/09/2014 08:30:00 1.172
15012 04/09/2014 08:45:00 1.000
15013 04/09/2014 09:00:00 1.420
15014 04/09/2014 09:15:00 0.936
15015 04/09/2014 09:30:00 0.996
15016 04/09/2014 09:45:00 1.100
15017 04/09/2014 10:00:00 0.492")
library("magrittr")
data$parsed.timestamp = paste(data$Date, data$Time) %>% strptime(., format = "%d/%m/%Y %H:%M:%S")
# Hourly Average
desiredGroupingUnit = cut(data$parsed.timestamp, breaks = "hour") #You can use substr for that also
aggregate(data$Data, by = list(desiredGroupingUnit), FUN = mean )
# Group.1 x
# 1 2014-09-04 05:00:00 0.908
# 2 2014-09-04 06:00:00 1.286
# 3 2014-09-04 07:00:00 0.951
# 4 2014-09-04 08:00:00 0.937
# 5 2014-09-04 09:00:00 1.113
# 6 2014-09-04 10:00:00 0.492
# Moving average
getAvgBetweenTwoTimeStamps = function(data, startTime, endTime) {
avergeThoseIndcies = which(data$parsed.timestamp >= startTime & data$parsed.timestamp <= endTime)
return(mean(data$Data[avergeThoseIndcies]))
}
movingAvgWindow = 45*60 #minutes
movingAvgTimestamps = data.frame(from = data$parsed.timestamp, to = data$parsed.timestamp + movingAvgWindow)
movingAvgTimestamps$movingAvg =
apply(movingAvgTimestamps, MARGIN = 1,
FUN = function(x) getAvgBetweenTwoTimeStamps(data = data, startTime = x["from"], endTime = x["to"]))
print(movingAvgTimestamps)
# from to movingAvg
# 1 2014-09-04 05:45:00 2014-09-04 06:30:00 1.1010000
# 2 2014-09-04 06:00:00 2014-09-04 06:45:00 1.2860000
# 3 2014-09-04 06:15:00 2014-09-04 07:00:00 1.3550000
# 4 2014-09-04 06:30:00 2014-09-04 07:15:00 1.2530000
# 5 2014-09-04 06:45:00 2014-09-04 07:30:00 1.1000000
# 6 2014-09-04 07:00:00 2014-09-04 07:45:00 0.9510000
# 7 2014-09-04 07:15:00 2014-09-04 08:00:00 0.8900000
# 8 2014-09-04 07:30:00 2014-09-04 08:15:00 0.9120000
# 9 2014-09-04 07:45:00 2014-09-04 08:30:00 0.9500000
# 10 2014-09-04 08:00:00 2014-09-04 08:45:00 0.9370000
# 11 2014-09-04 08:15:00 2014-09-04 09:00:00 1.0620000
# 12 2014-09-04 08:30:00 2014-09-04 09:15:00 1.1320000
# 13 2014-09-04 08:45:00 2014-09-04 09:30:00 1.0880000
# 14 2014-09-04 09:00:00 2014-09-04 09:45:00 1.1130000
# 15 2014-09-04 09:15:00 2014-09-04 10:00:00 0.8810000
# 16 2014-09-04 09:30:00 2014-09-04 10:15:00 0.8626667
# 17 2014-09-04 09:45:00 2014-09-04 10:30:00 0.7960000
# 18 2014-09-04 10:00:00 2014-09-04 10:45:00 0.4920000

Related

How to calculate distance and time between two locations

Here's a sample of some data
Tag.ID TimeStep.coa Latitude.coa Longitude.coa
<chr> <dttm> <dbl> <dbl>
1 1657 2017-08-17 12:00:00 72.4 -81.1
2 1657 2017-08-17 18:00:00 72.3 -81.1
3 1658 2017-08-14 18:00:00 72.3 -81.2
4 1658 2017-08-15 00:00:00 72.3 -81.3
5 1659 2017-08-14 18:00:00 72.3 -81.1
6 1659 2017-08-15 00:00:00 72.3 -81.2
7 1660 2017-08-20 18:00:00 72.3 -81.1
8 1660 2017-08-21 00:00:00 72.3 -81.2
9 1660 2017-08-21 06:00:00 72.3 -81.2
10 1660 2017-08-21 12:00:00 72.3 -81.3
11 1661 2017-08-28 12:00:00 72.4 -81.1
12 1661 2017-08-28 18:00:00 72.3 -81.1
13 1661 2017-08-29 06:00:00 72.3 -81.2
14 1661 2017-08-29 12:00:00 72.3 -81.2
15 1661 2017-08-30 06:00:00 72.3 -81.2
16 1661 2017-08-30 18:00:00 72.3 -81.2
17 1661 2017-08-31 00:00:00 72.3 -81.2
18 1661 2017-08-31 06:00:00 72.3 -81.2
19 1661 2017-08-31 12:00:00 72.3 -81.2
20 1661 2017-08-31 18:00:00 72.4 -81.1
I'm looking for a method to obtain distances travelled for each ID. I will be using the ComputeDistance function within VTrack package (could use a different function though). The function looks like this:
ComputeDistance( Lat1, Lat2, Lon1, Lon2)
This calculates a straight line distance between lat/lon coordinates.
I eventually want a dataframe with four columns Tag.ID, Timestep1, Timestep2, and distance. Here's an example:
Tag.ID Timestep1 Timestep2 Distance
1657 2017-08-17 12:00:00 2017-08-17 18:00:00 ComputeDistance(72.4,72.3,-81.1,-81.1)
1658 2017-08-14 18:00:00 2017-08-15 00:00:00 ComputeDistance(72.3,72.3,-81.2,-81.3)
1659 2017-08-14 18:00:00 2017-08-15 00:00:00 ComputeDistance(72.3,72.3,-81.1,-81.2)
1660 2017-08-20 18:00:00 2017-08-21 00:00:00 ComputeDistance(72.3,72.3,-81.1,-81.2)
1660 2017-08-21 00:00:00 2017-08-21 06:00:00 ComputeDistance(72.3,72.3,=81.1,-81.2
And so on
EDIT:
This is the code I used (thanks AntoniosK). COASpeeds2 is exactly the same as the sample df above:
test <- COASpeeds2 %>%
group_by(Tag.ID) %>%
mutate(Timestep1 = TimeStep.coa,
Timestep2 = lead(TimeStep.coa),
Distance = ComputeDistance(Latitude.coa, lead(Latitude.coa),
Longitude.coa, lead(Longitude.coa))) %>%
ungroup() %>%
na.omit() %>%
select(Tag.ID, Timestep1, Timestep2, Distance)
This is the df I'm getting.
Tag.ID Timestep1 Timestep2 Distance
<fct> <dttm> <dttm> <dbl>
1 1657 2017-08-17 12:00:00 2017-08-17 18:00:00 2.76
2 1657 2017-08-17 18:00:00 2017-08-14 18:00:00 1.40
3 1658 2017-08-14 18:00:00 2017-08-15 00:00:00 6.51
4 1658 2017-08-15 00:00:00 2017-08-14 18:00:00 10.5
5 1659 2017-08-14 18:00:00 2017-08-15 00:00:00 7.51
6 1659 2017-08-15 00:00:00 2017-08-20 18:00:00 7.55
7 1660 2017-08-20 18:00:00 2017-08-21 00:00:00 3.69
8 1660 2017-08-21 00:00:00 2017-08-21 06:00:00 4.32
9 1660 2017-08-21 06:00:00 2017-08-21 12:00:00 3.26
10 1660 2017-08-21 12:00:00 2017-08-28 12:00:00 10.5
11 1661 2017-08-28 12:00:00 2017-08-28 18:00:00 1.60
12 1661 2017-08-28 18:00:00 2017-08-29 06:00:00 1.94
13 1661 2017-08-29 06:00:00 2017-08-29 12:00:00 5.22
14 1661 2017-08-29 12:00:00 2017-08-30 06:00:00 0.759
15 1661 2017-08-30 06:00:00 2017-08-30 18:00:00 1.94
16 1661 2017-08-30 18:00:00 2017-08-31 00:00:00 0.342
17 1661 2017-08-31 00:00:00 2017-08-31 06:00:00 0.281
18 1661 2017-08-31 06:00:00 2017-08-31 12:00:00 4.21
19 1661 2017-08-31 12:00:00 2017-08-31 18:00:00 8.77
library(tidyverse)
library(VTrack)
# example data
dt = read.table(text = "
Tag.ID TimeStep.coa Latitude.coa Longitude.coa
1 1657 2017-08-17_12:00:00 72.4 -81.1
2 1657 2017-08-17_18:00:00 72.3 -81.1
3 1658 2017-08-14_18:00:00 72.3 -81.2
4 1658 2017-08-15_00:00:00 72.3 -81.3
5 1659 2017-08-14_18:00:00 72.3 -81.1
6 1659 2017-08-15_00:00:00 72.3 -81.2
7 1660 2017-08-20_18:00:00 72.3 -81.1
8 1660 2017-08-21_00:00:00 72.3 -81.2
9 1660 2017-08-21_06:00:00 72.3 -81.2
10 1660 2017-08-21_12:00:00 72.3 -81.3
", header=T)
dt %>%
group_by(Tag.ID) %>%
mutate(Timestep1 = TimeStep.coa,
Timestep2 = lead(TimeStep.coa),
Distance = ComputeDistance(Latitude.coa, lead(Latitude.coa),
Longitude.coa, lead(Longitude.coa))) %>%
ungroup() %>%
na.omit() %>%
select(Tag.ID, Timestep1, Timestep2, Distance)
As a result you get this:
# # A tibble: 6 x 4
# Tag.ID Timestep1 Timestep2 Distance
# <int> <fct> <fct> <dbl>
# 1 1657 2017-08-17_12:00:00 2017-08-17_18:00:00 11.1
# 2 1658 2017-08-14_18:00:00 2017-08-15_00:00:00 3.38
# 3 1659 2017-08-14_18:00:00 2017-08-15_00:00:00 3.38
# 4 1660 2017-08-20_18:00:00 2017-08-21_00:00:00 3.38
# 5 1660 2017-08-21_00:00:00 2017-08-21_06:00:00 0.0000949
# 6 1660 2017-08-21_06:00:00 2017-08-21_12:00:00 3.38
You could use geosphere::distGeo in a by approach.
library(geosphere)
do.call(rbind.data.frame, by(dat, dat$Tag.ID, function(s) {
t.diff <- (s$TimeStep.coa[length(s$TimeStep.coa)] - s$TimeStep.coa[1])
d.diff <- sum(mapply(function(x, y)
distGeo(s[x, 3:4], s[y, 3:4]), x=1:(nrow(s)-1), y=2:nrow(s)))/1e3
`colnames<-`(cbind(t.diff, d.diff), c("hours", "km"))
}))
# hours km
# 1657 6.00 1.727882
# 1658 6.00 11.166785
# 1659 6.00 11.166726
# 1660 18.00 22.333511
# 1661 3.25 24.192753
Data:
dat <- structure(list(Tag.ID = c(1657L, 1657L, 1658L, 1658L, 1659L,
1659L, 1660L, 1660L, 1660L, 1660L, 1661L, 1661L, 1661L, 1661L,
1661L, 1661L, 1661L, 1661L, 1661L, 1661L), TimeStep.coa = structure(c(1502964000,
1502985600, 1502726400, 1502748000, 1502726400, 1502748000, 1503244800,
1503266400, 1503288000, 1503309600, 1503914400, 1503936000, 1503979200,
1504000800, 1504065600, 1504108800, 1504130400, 1504152000, 1504173600,
1504195200), class = c("POSIXct", "POSIXt"), tzone = ""), Latitude.coa = c(72.4,
72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.4, 72.3,
72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.3, 72.4), Longitude.coa = c(-81.1,
-81.1, -81.2, -81.3, -81.1, -81.2, -81.1, -81.2, -81.2, -81.3,
-81.1, -81.1, -81.2, -81.2, -81.2, -81.2, -81.2, -81.2, -81.2,
-81.1)), row.names = c(NA, -20L), class = "data.frame")
Assuming the start and ending points are in order and have a matching pair.
Here is another option:
#identify the start and end of each trip
df$leg<-rep(c("Start", "End"), nrow(df)/2)
#label each trip
df$trip <- rep(1:(nrow(df)/2), each=2)
#change the shape
library(tidyr)
output<-pivot_wider(df, id_cols = c(Tag.ID, trip),
names_from = leg,
values_from = c(TimeStep.coa, Latitude.coa, Longitude.coa))
#calcuate distance (use your package of choice)
library(geosphere)
output$distance<-distGeo(output[ ,c("Longitude.coa_Start", "Latitude.coa_Start")],
output[ ,c("Longitude.coa_End", "Latitude.coa_End")])
# #remove undesired columns
# output <- output[, -c(5, 6, 7, 8)]
output
> output[, -c(5, 6, 7, 8)]
# A tibble: 10 x 5
Tag.ID trip TimeStep.coa_Start TimeStep.coa_End distance
<int> <int> <fct> <fct> <dbl>
1 1657 1 2017-08-17 12:00:00 2017-08-17 18:00:00 11159.
2 1658 2 2017-08-14 18:00:00 2017-08-15 00:00:00 3395.
3 1659 3 2017-08-14 18:00:00 2017-08-15 00:00:00 3395.
4 1660 4 2017-08-20 18:00:00 2017-08-21 00:00:00 3395.
5 1660 5 2017-08-21 06:00:00 2017-08-21 12:00:00 3395.
6 1661 6 2017-08-28 12:00:00 2017-08-28 18:00:00 11159.
7 1661 7 2017-08-29 06:00:00 2017-08-29 12:00:00 0
8 1661 8 2017-08-30 06:00:00 2017-08-30 18:00:00 0
9 1661 9 2017-08-31 00:00:00 2017-08-31 06:00:00 0
10 1661 10 2017-08-31 12:00:00 2017-08-31 18:00:00 11661.

Identify Weekdays and Time from a xts object R

I've a xtc object x2 like this:
str(x2)
An ‘xts’ object on 2016-01-31 23:15:00/2016-02-26 22:55:00 containing:
Data: num [1:5700, 1] 1.08 1.08 1.08 1.08 1.08 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "close"
Indexed by objects of class: [POSIXct,POSIXt] TZ: America/New_York
xts Attributes:
NULL
head(x2, 10)
close
2016-01-31 23:15:00 1.083390
2016-01-31 23:20:00 1.083350
2016-01-31 23:25:00 1.083125
2016-01-31 23:30:00 1.083360
2016-01-31 23:35:00 1.083240
2016-01-31 23:40:00 1.083190
2016-01-31 23:45:00 1.083165
2016-01-31 23:50:00 1.083020
2016-01-31 23:55:00 1.082965
2016-02-01 00:00:00 1.082200
And now I would like to identify for example all Mondays from 8:00 to 10:00. Is there a smart way to get this? Thank you guys.
You can use .indexwday() and .indexhour(). A little example follows:
library(xts)
seqTime <- seq(as.POSIXct("2016-01-01"), by = 300, length.out = 1000)
myXts <- xts(rnorm(1000), seqTime)
myXts[.indexwday(myXts) == 1 & (.indexhour(myXts) %in% c(8, 9))]
with output:
[,1]
2016-01-04 08:00:00 0.74224022
2016-01-04 08:05:00 -0.50372235
2016-01-04 08:10:00 0.94655985
2016-01-04 08:15:00 -0.80261212
2016-01-04 08:20:00 0.90475246
2016-01-04 08:25:00 -0.72225021
2016-01-04 08:30:00 -0.32635167
2016-01-04 08:35:00 0.94919253
2016-01-04 08:40:00 0.33799147
2016-01-04 08:45:00 1.19636284
2016-01-04 08:50:00 0.13022675
2016-01-04 08:55:00 -0.61397227
2016-01-04 09:00:00 -2.14580209
2016-01-04 09:05:00 -0.02778257
2016-01-04 09:10:00 -0.73649967
2016-01-04 09:15:00 0.31217192
2016-01-04 09:20:00 -0.30923692
2016-01-04 09:25:00 0.64499992
2016-01-04 09:30:00 -1.84125238
2016-01-04 09:35:00 2.43008526
2016-01-04 09:40:00 -1.85907819
2016-01-04 09:45:00 0.31648160
2016-01-04 09:50:00 -0.02847419
2016-01-04 09:55:00 -0.09911078

R XTS package to.minutes – Unable to create 15m and 30m time series from 5m correctly

I am trying to use R XTS package to.minutes to create 15m and 30m time series from 5m. I have an xts object, which is date-time followed by OHLC. Info about xts object x is below:
head(x) shows the following:
High Low Open Close Volume
2010-05-03 09:00:00 106.08 105.95 106.06 106.00 1055
2010-05-03 09:05:00 106.03 105.75 106.00 105.77 4369
2010-05-03 09:10:00 105.77 105.59 105.77 105.68 4125
2010-05-03 09:15:00 105.84 105.66 105.69 105.80 2457
2010-05-03 09:20:00 105.89 105.71 105.80 105.83 1788
2010-05-03 09:25:00 105.89 105.78 105.84 105.78 977
str(x) shows the following:
> str(x)
‘zoo’ series from 2010-05-03 09:00:00 to 2013-06-10 14:30:00
Data: num [1:222473, 1:5] 106 106 106 106 106 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "High" "Low" "Open" "Close" ...
Index: POSIXct[1:222473], format: "2010-05-03 09:00:00" "2010-05-03 09:05:00" "2010-05-03 09:10:00" ...
str(head(index(x))) shows the following:
head(str(index(x)))
POSIXct[1:222473], format: "2010-05-03 09:00:00" "2010-05-03 09:05:00" "2010-05-03 09:10:00" ...
NULL
When I convert the time series to 15m, the series starts at 09:10:00 followed by 15 minutes increments instead of starting at 09:15:00 followed by 15 minutes increments
> head(to.minutes(x, k=15))
x.Open x.High x.Low x.Close x.Volume
2010-05-03 09:10:00 106.08 105.95 105.77 105.68 9549
2010-05-03 09:25:00 105.84 105.78 105.69 105.78 5222
2010-05-03 09:40:00 105.80 105.92 105.62 106.12 9727
2010-05-03 09:55:00 106.17 106.00 106.01 106.22 6320
2010-05-03 10:10:00 106.26 106.07 106.15 106.14 8422
2010-05-03 10:25:00 106.57 106.38 106.15 106.37 10422
The same issue when I convert the time series to 30m, the series starts at 09:25:00 followed by 30 minutes increments instead of starting at 09:30:00 followed by 30 minutes increments
> head(to.minutes(x, k=30))
x.Open x.High x.Low x.Close x.Volume
2010-05-03 09:25:00 106.08 105.95 105.69 105.78 14771
2010-05-03 09:55:00 105.80 106.00 105.62 106.22 16047
2010-05-03 10:25:00 106.26 106.38 106.15 106.37 18844
2010-05-03 10:55:00 106.37 106.27 106.01 106.00 17193
2010-05-03 11:25:00 106.04 106.20 105.95 106.29 9075
2010-05-03 11:55:00 106.34 106.35 106.24 106.39 8517
I also tried the same using 1 minute data and had the same issue. Any thought on what might be causing this issue and how to resolve it? Thanks
This is pretty clearly described on the ?to.minutes help page. The default is for the groups to start at the end of your data and work backwards so it doesn't necessarily pay attention to what the first value is. However, you can explicity set the indexAt= parameter to "startof". For example
x <- zoo(runif(25), order.by=seq(as.POSIXct("2010-05-03 09:00:00"),
as.POSIXct("2010-05-03 11:00:00"), by="5 min"))
to.minutes15(x)
# x.Open x.High x.Low x.Close
# 2010-05-03 09:10:00 0.35570172 0.3557017 0.04524480 0.04524480
# 2010-05-03 09:25:00 0.78939084 0.7893908 0.44032175 0.44032175
# 2010-05-03 09:40:00 0.05272398 0.5381755 0.05272398 0.53817548
# 2010-05-03 09:55:00 0.02198503 0.1113298 0.02198503 0.11132980
# 2010-05-03 10:10:00 0.78785210 0.8804505 0.04152860 0.04152860
# 2010-05-03 10:25:00 0.79317091 0.9497044 0.54751546 0.94970444
# 2010-05-03 10:40:00 0.03886176 0.7425681 0.03886176 0.06614893
# 2010-05-03 10:55:00 0.58684500 0.5868450 0.02794687 0.14291696
# 2010-05-03 11:00:00 0.11713868 0.1171387 0.11713868 0.11713868
versus
to.minutes15(x, indexAt="startof")
# x.Open x.High x.Low x.Close
# 2010-05-03 09:00:00 0.35570172 0.3557017 0.04524480 0.04524480
# 2010-05-03 09:15:00 0.78939084 0.7893908 0.44032175 0.44032175
# 2010-05-03 09:30:00 0.05272398 0.5381755 0.05272398 0.53817548
# 2010-05-03 09:45:00 0.02198503 0.1113298 0.02198503 0.11132980
# 2010-05-03 10:00:00 0.78785210 0.8804505 0.04152860 0.04152860
# 2010-05-03 10:15:00 0.79317091 0.9497044 0.54751546 0.94970444
# 2010-05-03 10:30:00 0.03886176 0.7425681 0.03886176 0.06614893
# 2010-05-03 10:45:00 0.58684500 0.5868450 0.02794687 0.14291696
# 2010-05-03 11:00:00 0.11713868 0.1171387 0.11713868 0.11713868

Cut a POSIXct by specific time for daily means

I am interested in calculating averages over specific time periods in a time series data set.
Given a time series like this:
dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count<-c(1:240)
DF<-data.frame(dtm,Count)
In the past I have been able to calculate daily averages with
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
But now I am trying to cut up the day into specific time periods and I'm not sure how to set my "breaks".
As opposed to a daily average from 0:00:24:00, How for example could I get a Noon to Noon average?
Or more fancy, how could I set up a Noon to Noon average excluding the night times of 7PM to 6AM (or conversely only including the daylight hours of 6AM- 7PM).
xts is perfect package for timeseries analysis
library(xts)
originalTZ <- Sys.getenv("TZ")
Sys.setenv(TZ = "GMT")
data.xts <- as.xts(1:240, as.POSIXct("2007-03-27 05:00", tz = "GMT") + 3600 * (1:240))
head(data.xts)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
# You can filter data using ISO-style subsetting
data.xts.filterd <- data.xts["T06:00/T19:00"]
# You can use builtin functions to apply any function FUN on daily data.
apply.daily(data.xts.filtered, mean)
## [,1]
## 2007-03-27 18:00:00 7.5
## 2007-03-28 18:00:00 31.5
## 2007-03-29 18:00:00 55.5
## 2007-03-30 18:00:00 79.5
## 2007-03-31 18:00:00 103.5
## 2007-04-01 18:00:00 127.5
## 2007-04-02 18:00:00 151.5
## 2007-04-03 18:00:00 175.5
## 2007-04-04 18:00:00 199.5
## 2007-04-05 18:00:00 223.5
# OR
# now let's say you want to find noon to noon average.
period.apply(data.xts, c(0, which(.indexhour(data.xts) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.5
## 2007-03-28 11:00:00 18.5
## 2007-03-29 11:00:00 42.5
## 2007-03-30 11:00:00 66.5
## 2007-03-31 11:00:00 90.5
## 2007-04-01 11:00:00 114.5
## 2007-04-02 11:00:00 138.5
## 2007-04-03 11:00:00 162.5
## 2007-04-04 11:00:00 186.5
## 2007-04-05 11:00:00 210.5
# now if you want to exclude time from 7 PM to 6 AM
data.xts.filtered <- data.xts[!data.xts %in% data.xts["T20:00/T05:00"]]
head(data.xts.filtered, 20)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
## 2007-03-27 12:00:00 7
## 2007-03-27 13:00:00 8
## 2007-03-27 14:00:00 9
## 2007-03-27 15:00:00 10
## 2007-03-27 16:00:00 11
## 2007-03-27 17:00:00 12
## 2007-03-27 18:00:00 13
## 2007-03-27 19:00:00 14
## 2007-03-28 06:00:00 25
## 2007-03-28 07:00:00 26
## 2007-03-28 08:00:00 27
## 2007-03-28 09:00:00 28
## 2007-03-28 10:00:00 29
## 2007-03-28 11:00:00 30
period.apply(data.xts.filtered, c(0, which(.indexhour(data.xts.filtered) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.50000
## 2007-03-28 11:00:00 17.78571
## 2007-03-29 11:00:00 41.78571
## 2007-03-30 11:00:00 65.78571
## 2007-03-31 11:00:00 89.78571
## 2007-04-01 11:00:00 113.78571
## 2007-04-02 11:00:00 137.78571
## 2007-04-03 11:00:00 161.78571
## 2007-04-04 11:00:00 185.78571
## 2007-04-05 11:00:00 209.78571
Sys.setenv(TZ = originalTZ)
Let me quickly repeat your code.
dtm <- as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count <- c(1:240)
DF<-data.frame(dtm,Count)
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
If you offset each time by 12 hours in the function call, you can still use cut with breaks on "day". I will save the day that the noon to noon starts on, so I will subtract 12 hours.
# Get twelve hours in seconds
timeOffset <- 60*60*12
# Subtract the offset to get the start day of the noon to noon
DF$Noon_Start_Day <- cut((DF$dtm - timeOffset), breaks="day")
# Get the mean
NtN_Avg <- aggregate(DF$Count ~ Noon_Start_Day, DF, mean)
One way to exclude certain hours is to convert the dates to POSIXlt. Then you can access hour among other things.
# Indicate which times are good (use whatever boolean test is needed here)
goodTimes <- !(as.POSIXlt(DF$dtm)$hour >= 19) & !(as.POSIXlt(DF$dtm)$hour <= 6)
new_NtN_Avg <- aggregate(Count ~ Noon_Start_Day, data=subset(DF, goodTimes), mean)
I found some help at this question on stackoverflow: r-calculate-means-for-subset-of-a-group
The noon-to-noon problem can easily be solved numerically. The key is that the start of a (GMT) day has a time_t value that is always divisible by 86400. This is specified by POSIX. For example, see: http://en.wikipedia.org/wiki/Unix_time
cuts <- unique(as.numeric(DF$dtm) %/% (86400/2)) * (86400/2) # half-days
cuts <- c(cuts, cuts[length(cuts)]+(86400/2)) # One more at the end
cuts <- as.POSIXct(cuts, tz="GMT", origin="1970-01-01") # Familiar format
DF$halfday <- cut(DF$dtm, cuts) # This is the cut you want.
Halfday_Avg <- aggregate(Count~halfday, data=DF, FUN=mean)
Halfday_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 12.5
## 3 2007-03-28 00:00:00 24.5
## 4 2007-03-28 12:00:00 36.5
## 5 2007-03-29 00:00:00 48.5
## 6 2007-03-29 12:00:00 60.5
## 7 2007-03-30 00:00:00 72.5
## 8 2007-03-30 12:00:00 84.5
## 9 2007-03-31 00:00:00 96.5
## 10 2007-03-31 12:00:00 108.5
## 11 2007-04-01 00:00:00 120.5
## 12 2007-04-01 12:00:00 132.5
## 13 2007-04-02 00:00:00 144.5
## 14 2007-04-02 12:00:00 156.5
## 15 2007-04-03 00:00:00 168.5
## 16 2007-04-03 12:00:00 180.5
## 17 2007-04-04 00:00:00 192.5
## 18 2007-04-04 12:00:00 204.5
## 19 2007-04-05 00:00:00 216.5
## 20 2007-04-05 12:00:00 228.5
## 21 2007-04-06 00:00:00 237.5
Now to extend this to solve the rest of the problem. Given here is the 6AM-7PM time range.
intraday <- as.numeric(DF$dtm) %% 86400
# Subset DF by the chosen range
New_Avg <- aggregate(Count~halfday, data=DF[intraday >= 6*3600 & intraday <= 19*3600,], FUN=mean)
New_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 10.5
## 3 2007-03-28 00:00:00 27.5
## 4 2007-03-28 12:00:00 34.5
## 5 2007-03-29 00:00:00 51.5
## 6 2007-03-29 12:00:00 58.5
## 7 2007-03-30 00:00:00 75.5
## 8 2007-03-30 12:00:00 82.5
## 9 2007-03-31 00:00:00 99.5
## 10 2007-03-31 12:00:00 106.5
## 11 2007-04-01 00:00:00 123.5
## 12 2007-04-01 12:00:00 130.5
## 13 2007-04-02 00:00:00 147.5
## 14 2007-04-02 12:00:00 154.5
## 15 2007-04-03 00:00:00 171.5
## 16 2007-04-03 12:00:00 178.5
## 17 2007-04-04 00:00:00 195.5
## 18 2007-04-04 12:00:00 202.5
## 19 2007-04-05 00:00:00 219.5
## 20 2007-04-05 12:00:00 226.5

Divide observation by period mean. Help to simplify code

Link to data:
http://dl.dropbox.com/u/56075871/data.txt
I want to divide each observation by mean for that hour. Example:
2012-01-02 10:00:00 5.23
2012-01-03 10:00:00 5.28
2012-01-04 10:00:00 5.29
2012-01-05 10:00:00 5.29
2012-01-09 10:00:00 5.28
2012-01-10 10:00:00 5.33
2012-01-11 10:00:00 5.42
2012-01-12 10:00:00 5.55
2012-01-13 10:00:00 5.68
2012-01-16 10:00:00 5.53
mean for that is 5.388. Next i want divide each observation by that mean, so... 5.23/5.388, 5.28/5.388, ... until end 5.53/5.388
I have hourly timeseries for 10 stocks:
S1.1h S2.1h S3.1h S4.1h S5.1h S6.1h S7.1h S8.1h S9.1h S10.1h
2012-01-02 10:00:00 64.00 110.7 5.23 142.0 20.75 34.12 32.53 311.9 7.82 5.31
2012-01-02 11:00:00 64.00 110.8 5.30 143.2 20.90 34.27 32.81 312.0 7.97 5.34
2012-01-02 12:00:00 64.00 111.1 5.30 142.8 20.90 34.28 32.70 312.4 7.98 5.33
2012-01-02 13:00:00 61.45 114.7 5.30 143.1 21.01 34.35 32.85 313.0 7.96 5.35
2012-01-02 14:00:00 61.45 116.2 5.26 143.7 21.10 34.60 32.99 312.9 7.95 5.36
2012-01-02 15:00:00 63.95 116.2 5.26 143.2 21.26 34.72 33.00 312.6 7.99 5.37
2012-01-02 16:00:00 63.95 117.3 5.25 143.3 21.27 35.08 33.04 312.7 7.99 5.36
2012-01-02 17:00:00 63.95 117.8 5.24 144.7 21.25 35.40 33.10 313.6 7.99 5.40
2012-01-02 18:00:00 63.95 117.9 5.23 145.0 21.20 35.50 33.17 312.5 7.98 5.35
2012-01-03 10:00:00 63.95 115.5 5.28 143.5 21.15 35.31 33.05 311.7 7.94 5.37
...
And i want to divie each observation by its mean for hour (periodical)
I have some code. Code to make means:
#10:00:00, 11:00:00, ... 18:00:00
times <- paste(seq(10, 18),":00:00", sep="")
#means - matrix of means for timeseries and hour
means <- matrix(ncol= ncol(time_series), nrow = length(times))
for (t in 1:length(times)) {
#t is time 10 to 18
for(i in 1:ncol(time_series)) {
#i is stock 1 to 10
# hour mean for each observation in data
means[t,i] <- mean(time_series[grep(times[t], index(time_series)), i])
}
}
And my function to get "things done":
for (t in 1:length(times)) {
# get all dates with times[t] hour
hours <- time_series[grep(times[t], index(time_series))]
ep <- endpoints(hours, "hours")
out <- rbind(out, period.apply(hours, INDEX=ep, FUN=function(x) {
x/means[t,]
}))
}
I know this is awful, but it works. How can i simplify code?
Here's one way to do it:
# Split the xts object into chunks by hour
# .indexhour() returns the hourly portion for each timestamp
s <- split(time_series, .indexhour(time_series))
# Use sweep to divide each value of x by colMeans(x) for each group of hours
l <- lapply(s, function(x) sweep(x, 2, colMeans(x), FUN="/"))
# rbind everything back together
r <- do.call(rbind, l)
The scale function can do that. Used with ave you could restrict to calcs within hours. Post the resutls of dput on that xts/zoo object and you will get rapid replies.

Resources