I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.
The sample of the data is below.
date obs
2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11
First I entered the data with the multiple spaces replaced with tabs.
dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
date obs
1 2011-10-24 01:00:00 12
2 2011-10-24 02:00:00 4
3 2011-10-24 19:00:00 18
4 2011-10-24 20:00:00 7
5 2011-10-24 21:00:00 4
6 2011-10-24 22:00:00 2
$`297`
date obs
7 2011-10-25 00:00:00 4
8 2011-10-25 01:00:00 2
9 2011-10-25 02:00:00 2
10 2011-10-25 15:00:00 12
11 2011-10-25 18:00:00 2
12 2011-10-25 19:00:00 3
13 2011-10-25 21:00:00 2
14 2011-10-25 23:00:00 9
$`298`
date obs
15 2011-10-26 00:00:00 13
16 2011-10-26 01:00:00 11
The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.
?cut.POSIXt
split(dat , cut(dat$date, "week") )
If you wanted to sum within date:
tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298
47 36 24
I'd use a time series class such as xts
dat <- read.table(text="2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11", header=FALSE, stringsAsFactors=FALSE)
xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))
xts subsetting is very intuitive. For all data on "2011-10-25", do this
xobj["2011-10-25"]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)
xobj["2011-10-24/2011-10-25"]
Or, if you want all data from October 2011,
xobj["2011-10"]
If you want to get all data from any day that is between 19:00 and 20:00,
xobj['T19:00:00/T20:00:00']
# [,1]
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-25 19:00:00 3
You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)
endpoints(xobj, "days")
[1] 0 6 14 16
Or you can convert to a lower frequency
to.weekly(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26 12 18 2 11
to.daily(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25 12 18 2 2
#2011-10-26 4 12 2 9
#2011-10-26 13 13 11 11
Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE
to.daily(xobj, OHLC=FALSE)
# [,1]
#2011-10-25 2
#2011-10-26 9
#2011-10-26 11
For more basic subsetting, and much more, visit http://www.quantmod.com/examples/
As #JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.
You can split by day (or week, or month, etc), apply a function, then recombine
split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
# [,1]
#2011-10-24 01:00:00 12
#2011-10-24 02:00:00 4
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-24 21:00:00 4
#2011-10-24 22:00:00 2
#
#[[2]]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
#
#[[3]]
# [,1]
#2011-10-26 00:00:00 13
#2011-10-26 01:00:00 11
Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.
do.call(rbind, lapply(split(xobj, 'days'), first))
# [,1]
#2011-10-24 01:00:00 12
#2011-10-25 00:00:00 4
#2011-10-26 00:00:00 13
Related
This is a bit of a curious case for which I have been unable to find a solution on stackoverflow. I have a dataset with a date-time column and a column of values that indicate an event, such as in the dat example below. The date-times are every hour, however, note that occasional "missed" hours exist (2 hours are missing between rows 12 & 13).
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6))
> dat
datetime event
1 2010-04-03 03:00:00 1
2 2010-04-03 04:00:00 NA
3 2010-04-03 05:00:00 NA
4 2010-04-03 06:00:00 NA
5 2010-04-03 07:00:00 NA
6 2010-04-03 08:00:00 NA
7 2010-04-03 09:00:00 NA
8 2010-04-03 10:00:00 NA
9 2010-04-03 11:00:00 NA
10 2010-04-03 12:00:00 NA
11 2010-04-03 13:00:00 2
12 2010-04-03 14:00:00 NA
13 2010-04-03 17:00:00 NA
14 2010-04-03 18:00:00 NA
15 2010-04-03 19:00:00 NA
16 2010-04-03 20:00:00 NA
17 2010-04-03 21:00:00 3
18 2010-04-03 22:00:00 4
19 2010-04-03 23:00:00 NA
20 2010-04-04 00:00:00 NA
21 2010-04-04 01:00:00 NA
22 2010-04-04 02:00:00 NA
23 2010-04-04 03:00:00 NA
24 2010-04-04 04:00:00 NA
25 2010-04-04 05:00:00 NA
26 2010-04-04 06:00:00 NA
27 2010-04-04 07:00:00 NA
28 2010-04-04 08:00:00 5
29 2010-04-04 09:00:00 NA
30 2010-04-04 10:00:00 6
I would like each row within an interval of 7 hours after the event occurs to be identified with a unique identifier, but with the following caveats (hence the "curious case"):
if a subsequent event occurs within the 7 hours of the event prior, that subsequent event is essentially ignored (i.e., "event" number does not equal assigned identifier value), and
missing times are accounted for (i.e., the rule is based on the time elapsed, not the number of rows).
The product would look like result:
library(dplyr)
result <- dat %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
> result
datetime event id
1 2010-04-03 03:00:00 1 1
2 2010-04-03 04:00:00 NA 1
3 2010-04-03 05:00:00 NA 1
4 2010-04-03 06:00:00 NA 1
5 2010-04-03 07:00:00 NA 1
6 2010-04-03 08:00:00 NA 1
7 2010-04-03 09:00:00 NA 1
8 2010-04-03 10:00:00 NA 1
9 2010-04-03 11:00:00 NA NA
10 2010-04-03 12:00:00 NA NA
11 2010-04-03 13:00:00 2 2
12 2010-04-03 14:00:00 NA 2
13 2010-04-03 17:00:00 NA 2
14 2010-04-03 18:00:00 NA 2
15 2010-04-03 19:00:00 NA 2
16 2010-04-03 20:00:00 NA 2
17 2010-04-03 21:00:00 3 3
18 2010-04-03 22:00:00 4 3
19 2010-04-03 23:00:00 NA 3
20 2010-04-04 00:00:00 NA 3
21 2010-04-04 01:00:00 NA 3
22 2010-04-04 02:00:00 NA 3
23 2010-04-04 03:00:00 NA 3
24 2010-04-04 04:00:00 NA 3
25 2010-04-04 05:00:00 NA NA
26 2010-04-04 06:00:00 NA NA
27 2010-04-04 07:00:00 NA NA
28 2010-04-04 08:00:00 5 4
29 2010-04-04 09:00:00 NA 4
30 2010-04-04 10:00:00 6 4
Most ideally, this would be accomplished in a dplyr framework.
library(lubridate)
library(tidyverse)
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6)) %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
Events <- dat %>%
#Get only the roes with events
filter(!is.na(event)) %>%
#Get the duration of time between events
mutate(
EventLag = datetime - lag(datetime)) %>%
## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
## I do not suspect your first point would ever be an event...? Maybe this can be removed in the
## real dataset...
filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>%
as.data.frame()
## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
## Give the events a new ID so there are no gaps
## Join them with the rest of the datetime stamps
Events <- Events %>%
mutate(ID = row_number()) %>%
dplyr::select(datetime, ID)
## Expand each event by 7 hrs
Events <- Events %>%
group_by(ID) %>%
do(data.frame(ID= .$ID, datetime= seq(.$datetime, .$datetime + hours(7), by = '1 hour'), stringsAsFactors=FALSE)) %>%
as.data.frame()
## Join with initial data by datettime
DatJoin <- dat %>%
left_join(Events, by = "datetime")
DatJoin
I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.
My dataset is a bit noisy at 1-min interval. So, I'd like to get an average value every hour from 25 min to 35 min to stand for that hour at 30 min.
For example, an average average at: 00:30 (average from 00:25 to 00:35), 01:30 (average from 01:25 to 01:35), 02:30 (average from 02:25 to 02:35), etc.
Can you good way to do this in R?
Here is my dataset:
set.seed(1)
DateTime <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), "min")
value <- rnorm(n=length(DateTime), mean=100, sd=1)
df <- data.frame(DateTime, value)
Thanks a lot.
Here's one way
library(dplyr)
df %>%
filter(between(as.numeric(format(DateTime, "%M")), 25, 35)) %>%
group_by(hour=format(DateTime, "%Y-%m-%d %H")) %>%
summarise(value=mean(value))
I think that the existing answers are not general enough as they do not take into account that a time interval could fall within multiple midpoints.
I would instead use shift from the data.table package.
library(data.table)
setDT(df)
First set the interval argument based on the sequence you chose above. This calculates an average ten rows (minutes) around every row in your table:
df[, ave_val :=
Reduce('+',c(shift(value, 0:5L, type = "lag"),shift(value, 1:5L, type = "lead")))/11
]
Then generate the midpoints you want:
mids <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), by = 60*60) + 30*60 # every hour starting at 0:30
Then filter accordingly:
setkey(df,DateTime)
df[J(mids)]
Since you want to average on just a subset of each period, I think it makes sense to first subset the data.frame, then aggregate:
aggregate(
value~cbind(time=strftime(DateTime,'%Y-%m-%d %H:30:00')),
subset(df,{ m <- strftime(DateTime,'%M'); m>='25' & m<='35'; }),
mean
);
## time value
## 1 2010-01-01 00:30:00 99.82317
## 2 2010-01-01 01:30:00 100.58184
## 3 2010-01-01 02:30:00 99.54985
## 4 2010-01-01 03:30:00 100.47238
## 5 2010-01-01 04:30:00 100.05517
## 6 2010-01-01 05:30:00 99.96252
## 7 2010-01-01 06:30:00 99.79512
## 8 2010-01-01 07:30:00 99.06791
## 9 2010-01-01 08:30:00 99.58731
## 10 2010-01-01 09:30:00 100.27202
## 11 2010-01-01 10:30:00 99.60758
## 12 2010-01-01 11:30:00 99.92074
## 13 2010-01-01 12:30:00 99.65819
## 14 2010-01-01 13:30:00 100.04202
## 15 2010-01-01 14:30:00 100.04461
## 16 2010-01-01 15:30:00 100.11609
## 17 2010-01-01 16:30:00 100.08631
## 18 2010-01-01 17:30:00 100.41956
## 19 2010-01-01 18:30:00 99.98065
## 20 2010-01-01 19:30:00 100.07341
## 21 2010-01-01 20:30:00 100.20281
## 22 2010-01-01 21:30:00 100.86013
## 23 2010-01-01 22:30:00 99.68170
## 24 2010-01-01 23:30:00 99.68097
## 25 2010-01-02 00:30:00 99.58603
## 26 2010-01-02 01:30:00 100.10178
## 27 2010-01-02 02:30:00 99.78766
## 28 2010-01-02 03:30:00 100.02220
## 29 2010-01-02 04:30:00 99.83427
## 30 2010-01-02 05:30:00 99.74934
## 31 2010-01-02 06:30:00 99.99594
## 32 2010-01-02 07:30:00 100.08257
## 33 2010-01-02 08:30:00 99.47077
## 34 2010-01-02 09:30:00 99.81419
## 35 2010-01-02 10:30:00 100.13294
## 36 2010-01-02 11:30:00 99.78352
## 37 2010-01-02 12:30:00 100.04590
## 38 2010-01-02 13:30:00 99.91061
## 39 2010-01-02 14:30:00 100.61730
## 40 2010-01-02 15:30:00 100.18539
## 41 2010-01-02 16:30:00 99.45165
## 42 2010-01-02 17:30:00 100.09894
## 43 2010-01-02 18:30:00 100.04131
## 44 2010-01-02 19:30:00 99.58399
## 45 2010-01-02 20:30:00 99.75524
## 46 2010-01-02 21:30:00 99.94079
## 47 2010-01-02 22:30:00 100.26533
## 48 2010-01-02 23:30:00 100.35354
## 49 2010-01-03 00:30:00 100.31141
## 50 2010-01-03 01:30:00 100.10709
## 51 2010-01-03 02:30:00 99.41102
## 52 2010-01-03 03:30:00 100.07964
## 53 2010-01-03 04:30:00 99.88183
## 54 2010-01-03 05:30:00 99.91112
## 55 2010-01-03 06:30:00 99.71431
## 56 2010-01-03 07:30:00 100.48585
## 57 2010-01-03 08:30:00 100.35096
## 58 2010-01-03 09:30:00 100.00060
## 59 2010-01-03 10:30:00 100.03858
## 60 2010-01-03 11:30:00 99.95713
## 61 2010-01-03 12:30:00 99.18699
## 62 2010-01-03 13:30:00 99.49216
## 63 2010-01-03 14:30:00 99.37762
## 64 2010-01-03 15:30:00 99.68642
## 65 2010-01-03 16:30:00 99.84921
## 66 2010-01-03 17:30:00 99.84039
## 67 2010-01-03 18:30:00 99.90989
## 68 2010-01-03 19:30:00 99.95421
## 69 2010-01-03 20:30:00 100.01276
## 70 2010-01-03 21:30:00 100.14585
## 71 2010-01-03 22:30:00 99.54110
## 72 2010-01-03 23:30:00 100.02526
## 73 2010-01-04 00:30:00 100.04476
## 74 2010-01-04 01:30:00 99.61132
## 75 2010-01-04 02:30:00 99.94782
## 76 2010-01-04 03:30:00 99.44863
## 77 2010-01-04 04:30:00 99.91305
## 78 2010-01-04 05:30:00 100.25428
## 79 2010-01-04 06:30:00 99.86279
## 80 2010-01-04 07:30:00 99.63516
## 81 2010-01-04 08:30:00 99.65747
## 82 2010-01-04 09:30:00 99.57810
## 83 2010-01-04 10:30:00 99.77603
## 84 2010-01-04 11:30:00 99.85140
## 85 2010-01-04 12:30:00 100.82995
## 86 2010-01-04 13:30:00 100.26138
## 87 2010-01-04 14:30:00 100.25851
## 88 2010-01-04 15:30:00 99.92685
## 89 2010-01-04 16:30:00 100.00825
## 90 2010-01-04 17:30:00 100.24437
## 91 2010-01-04 18:30:00 99.62711
## 92 2010-01-04 19:30:00 99.93999
## 93 2010-01-04 20:30:00 99.82477
## 94 2010-01-04 21:30:00 100.15321
## 95 2010-01-04 22:30:00 99.88370
## 96 2010-01-04 23:30:00 100.06657
I am interested in calculating averages over specific time periods in a time series data set.
Given a time series like this:
dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count<-c(1:240)
DF<-data.frame(dtm,Count)
In the past I have been able to calculate daily averages with
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
But now I am trying to cut up the day into specific time periods and I'm not sure how to set my "breaks".
As opposed to a daily average from 0:00:24:00, How for example could I get a Noon to Noon average?
Or more fancy, how could I set up a Noon to Noon average excluding the night times of 7PM to 6AM (or conversely only including the daylight hours of 6AM- 7PM).
xts is perfect package for timeseries analysis
library(xts)
originalTZ <- Sys.getenv("TZ")
Sys.setenv(TZ = "GMT")
data.xts <- as.xts(1:240, as.POSIXct("2007-03-27 05:00", tz = "GMT") + 3600 * (1:240))
head(data.xts)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
# You can filter data using ISO-style subsetting
data.xts.filterd <- data.xts["T06:00/T19:00"]
# You can use builtin functions to apply any function FUN on daily data.
apply.daily(data.xts.filtered, mean)
## [,1]
## 2007-03-27 18:00:00 7.5
## 2007-03-28 18:00:00 31.5
## 2007-03-29 18:00:00 55.5
## 2007-03-30 18:00:00 79.5
## 2007-03-31 18:00:00 103.5
## 2007-04-01 18:00:00 127.5
## 2007-04-02 18:00:00 151.5
## 2007-04-03 18:00:00 175.5
## 2007-04-04 18:00:00 199.5
## 2007-04-05 18:00:00 223.5
# OR
# now let's say you want to find noon to noon average.
period.apply(data.xts, c(0, which(.indexhour(data.xts) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.5
## 2007-03-28 11:00:00 18.5
## 2007-03-29 11:00:00 42.5
## 2007-03-30 11:00:00 66.5
## 2007-03-31 11:00:00 90.5
## 2007-04-01 11:00:00 114.5
## 2007-04-02 11:00:00 138.5
## 2007-04-03 11:00:00 162.5
## 2007-04-04 11:00:00 186.5
## 2007-04-05 11:00:00 210.5
# now if you want to exclude time from 7 PM to 6 AM
data.xts.filtered <- data.xts[!data.xts %in% data.xts["T20:00/T05:00"]]
head(data.xts.filtered, 20)
## [,1]
## 2007-03-27 06:00:00 1
## 2007-03-27 07:00:00 2
## 2007-03-27 08:00:00 3
## 2007-03-27 09:00:00 4
## 2007-03-27 10:00:00 5
## 2007-03-27 11:00:00 6
## 2007-03-27 12:00:00 7
## 2007-03-27 13:00:00 8
## 2007-03-27 14:00:00 9
## 2007-03-27 15:00:00 10
## 2007-03-27 16:00:00 11
## 2007-03-27 17:00:00 12
## 2007-03-27 18:00:00 13
## 2007-03-27 19:00:00 14
## 2007-03-28 06:00:00 25
## 2007-03-28 07:00:00 26
## 2007-03-28 08:00:00 27
## 2007-03-28 09:00:00 28
## 2007-03-28 10:00:00 29
## 2007-03-28 11:00:00 30
period.apply(data.xts.filtered, c(0, which(.indexhour(data.xts.filtered) == 11)), FUN = mean)
## [,1]
## 2007-03-27 11:00:00 3.50000
## 2007-03-28 11:00:00 17.78571
## 2007-03-29 11:00:00 41.78571
## 2007-03-30 11:00:00 65.78571
## 2007-03-31 11:00:00 89.78571
## 2007-04-01 11:00:00 113.78571
## 2007-04-02 11:00:00 137.78571
## 2007-04-03 11:00:00 161.78571
## 2007-04-04 11:00:00 185.78571
## 2007-04-05 11:00:00 209.78571
Sys.setenv(TZ = originalTZ)
Let me quickly repeat your code.
dtm <- as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240)
Count <- c(1:240)
DF<-data.frame(dtm,Count)
DF$Day<-cut(DF$dtm,breaks="day")
Day_Avg<-aggregate(DF$Count~Day,DF,mean)
If you offset each time by 12 hours in the function call, you can still use cut with breaks on "day". I will save the day that the noon to noon starts on, so I will subtract 12 hours.
# Get twelve hours in seconds
timeOffset <- 60*60*12
# Subtract the offset to get the start day of the noon to noon
DF$Noon_Start_Day <- cut((DF$dtm - timeOffset), breaks="day")
# Get the mean
NtN_Avg <- aggregate(DF$Count ~ Noon_Start_Day, DF, mean)
One way to exclude certain hours is to convert the dates to POSIXlt. Then you can access hour among other things.
# Indicate which times are good (use whatever boolean test is needed here)
goodTimes <- !(as.POSIXlt(DF$dtm)$hour >= 19) & !(as.POSIXlt(DF$dtm)$hour <= 6)
new_NtN_Avg <- aggregate(Count ~ Noon_Start_Day, data=subset(DF, goodTimes), mean)
I found some help at this question on stackoverflow: r-calculate-means-for-subset-of-a-group
The noon-to-noon problem can easily be solved numerically. The key is that the start of a (GMT) day has a time_t value that is always divisible by 86400. This is specified by POSIX. For example, see: http://en.wikipedia.org/wiki/Unix_time
cuts <- unique(as.numeric(DF$dtm) %/% (86400/2)) * (86400/2) # half-days
cuts <- c(cuts, cuts[length(cuts)]+(86400/2)) # One more at the end
cuts <- as.POSIXct(cuts, tz="GMT", origin="1970-01-01") # Familiar format
DF$halfday <- cut(DF$dtm, cuts) # This is the cut you want.
Halfday_Avg <- aggregate(Count~halfday, data=DF, FUN=mean)
Halfday_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 12.5
## 3 2007-03-28 00:00:00 24.5
## 4 2007-03-28 12:00:00 36.5
## 5 2007-03-29 00:00:00 48.5
## 6 2007-03-29 12:00:00 60.5
## 7 2007-03-30 00:00:00 72.5
## 8 2007-03-30 12:00:00 84.5
## 9 2007-03-31 00:00:00 96.5
## 10 2007-03-31 12:00:00 108.5
## 11 2007-04-01 00:00:00 120.5
## 12 2007-04-01 12:00:00 132.5
## 13 2007-04-02 00:00:00 144.5
## 14 2007-04-02 12:00:00 156.5
## 15 2007-04-03 00:00:00 168.5
## 16 2007-04-03 12:00:00 180.5
## 17 2007-04-04 00:00:00 192.5
## 18 2007-04-04 12:00:00 204.5
## 19 2007-04-05 00:00:00 216.5
## 20 2007-04-05 12:00:00 228.5
## 21 2007-04-06 00:00:00 237.5
Now to extend this to solve the rest of the problem. Given here is the 6AM-7PM time range.
intraday <- as.numeric(DF$dtm) %% 86400
# Subset DF by the chosen range
New_Avg <- aggregate(Count~halfday, data=DF[intraday >= 6*3600 & intraday <= 19*3600,], FUN=mean)
New_Avg
## halfday Count
## 1 2007-03-27 00:00:00 3.5
## 2 2007-03-27 12:00:00 10.5
## 3 2007-03-28 00:00:00 27.5
## 4 2007-03-28 12:00:00 34.5
## 5 2007-03-29 00:00:00 51.5
## 6 2007-03-29 12:00:00 58.5
## 7 2007-03-30 00:00:00 75.5
## 8 2007-03-30 12:00:00 82.5
## 9 2007-03-31 00:00:00 99.5
## 10 2007-03-31 12:00:00 106.5
## 11 2007-04-01 00:00:00 123.5
## 12 2007-04-01 12:00:00 130.5
## 13 2007-04-02 00:00:00 147.5
## 14 2007-04-02 12:00:00 154.5
## 15 2007-04-03 00:00:00 171.5
## 16 2007-04-03 12:00:00 178.5
## 17 2007-04-04 00:00:00 195.5
## 18 2007-04-04 12:00:00 202.5
## 19 2007-04-05 00:00:00 219.5
## 20 2007-04-05 12:00:00 226.5
I just have a data file with one column of time series:
'2012-02-01 17:42:44'
'2012-02-01 17:42:44'
'2012-02-01 17:42:44'
...
I want to split the data up such that I have a count at the top of hour. Say:
'2012-02-01 17:00:00' 20
'2012-02-01 18:00:00' 30
The '20' and '30' represent the number of time series entries for that out period. And I want to be able to graph the time vs that 'count'. How can I do this with R?
Here is my current line graph plot.
library(ggplot2)
req <- read.table("times1.dat")
summary(req)
da <- req$V2
db <- req$V1
time <- as.POSIXct(db)
png('time_data_errs.png', width=800, height=600)
gg <- qplot(time, da) + geom_line()
print(gg)
dev.off()
It sounds like you want to use cut to figure out how many values occur within an hour.
It's generally helpful if you can provide some sample data. Here's some:
set.seed(1) # So you can get the same numbers as I do
MyDates <- ISOdatetime(2012, 1, 1, 0, 0, 0, tz = "GMT") + sample(1:27000, 500)
head(MyDates)
# [1] "2012-01-01 01:59:29 GMT" "2012-01-01 02:47:27 GMT" "2012-01-01 04:17:46 GMT"
# [4] "2012-01-01 06:48:39 GMT" "2012-01-01 01:30:45 GMT" "2012-01-01 06:44:13 GMT"
You can use table and cut (with the argument breaks="hour" (see ?cut.Date for more info)) to find the frequencies per hour.
MyDatesTable <- table(cut(MyDates, breaks="hour"))
MyDatesTable
#
# 2012-01-01 00:00:00 2012-01-01 01:00:00 2012-01-01 02:00:00 2012-01-01 03:00:00
# 59 73 74 83
# 2012-01-01 04:00:00 2012-01-01 05:00:00 2012-01-01 06:00:00 2012-01-01 07:00:00
# 52 62 64 33
# Or a data.frame if you prefer
data.frame(MyDatesTable)
# Var1 Freq
# 1 2012-01-01 00:00:00 59
# 2 2012-01-01 01:00:00 73
# 3 2012-01-01 02:00:00 74
# 4 2012-01-01 03:00:00 83
# 5 2012-01-01 04:00:00 52
# 6 2012-01-01 05:00:00 62
# 7 2012-01-01 06:00:00 64
# 8 2012-01-01 07:00:00 33
Finally, here's a line plot of the MyDatesTable object:
plot(MyDatesTable, type="l", xlab="Time", ylab="Freq")
cut can handle a range of time intervals. For example, if you wanted to tabulate for every 30 minutes, you can easily adapt the breaks argument to handle that:
data.frame(table(cut(MyDates, breaks = "30 mins")))
# Var1 Freq
# 1 2012-01-01 00:00:00 22
# 2 2012-01-01 00:30:00 37
# 3 2012-01-01 01:00:00 38
# 4 2012-01-01 01:30:00 35
# 5 2012-01-01 02:00:00 32
# 6 2012-01-01 02:30:00 42
# 7 2012-01-01 03:00:00 39
# 8 2012-01-01 03:30:00 44
# 9 2012-01-01 04:00:00 25
# 10 2012-01-01 04:30:00 27
# 11 2012-01-01 05:00:00 33
# 12 2012-01-01 05:30:00 29
# 13 2012-01-01 06:00:00 29
# 14 2012-01-01 06:30:00 35
# 15 2012-01-01 07:00:00 33
Update
Since you were trying to plot with ggplot2, here's one approach (not sure if it is the best since I usually use base R's graphics when I need to).
Create a data.frame of the table (as demonstrated above) and add a dummy "group" variable and plot that as follows:
MyDatesDF <- data.frame(MyDatesTable, grp = 1)
ggplot(MyDatesDF, aes(Var1, Freq)) + geom_line(aes(group = grp))