xts - how to subset on each day of the week - r

I understand similar questions have been answered. My problem is I have a time series data for 2033 days on 15 minutes interval. I would like to plot the series for each day (Mon-Sun). For instance how an average Monday looks like.
I tried to subset by using .indexwday, but the series for the day starts at 13:00.
I am kind of novice, so please let me know if I need to provide additional details.
Sample data (xts)
2008-01-01 00:00:00 16
2008-01-01 00:15:00 56
2008-01-01 00:30:00 136
2008-01-01 00:45:00 170
2008-01-01 01:00:00 132
....
2013-07-25 22:30:00 95
2013-07-25 22:45:00 82
2013-07-25 23:00:00 66
2013-07-25 23:15:00 65
2013-07-25 23:30:00 66
2013-07-25 23:45:00 46
The plot below might make more sense what I want (This is the average of all Mondays)

Here's another solution, which does not depend on packages other than xts and zoo.
# example data
ix <- seq(as.POSIXct("2008-01-01"), as.POSIXct("2013-07-26"), by="15 min")
set.seed(21)
x <- xts(sample(200, length(ix), TRUE), ix)
# aggregate by 15-minute observations for each weekday
a <- lapply(split.default(x, format(index(x), "%A")), # split by weekday
function(x) aggregate(x, format(index(x), "%H:%M"), mean)) # aggregate by 15-min
# merge aggregated data into one zoo object, ordering columns
z <- do.call(merge, a)[,c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")]
# convert index to POSIXct to make plotting easier
index(z) <- as.POSIXct(index(z), format="%H:%M")
# plot
plot(z, type="l", nc=1, ylim=range(z), main="Average daily volume", las=1)
Setting ylim forces each plot to have the same y-axis range. Otherwise they would depend on each individual series, which may make them difficult to compare if the values vary greatly.

Try this:
#Get necessary packages
install.packages("lubridate")
install.packages("magrittr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("scales")
#Import packages
library(lubridate,warn=F)
library(dplyr,warn=F)
library(magrittr)
library(ggplot2,warn=F)
library(scales, warn=F)
#Getting the data
tstart = as.POSIXct('2008-01-01 00:00:00')
tend = as.POSIXct('2013-07-25 23:45:00')
ttimes <- seq(from = tstart,to=tend,by='15 mins')
tvals <- sample(seq(1,200),length(ttimes),T)
tsdata <- data.frame(Dates=ttimes,Vals=tvals)
tsdata <- tsdata %>% mutate(DayofWeek = wday(Dates,label=T), Hours = as.POSIXct(strftime(Dates,format="%H:%M:%S"),format="%H:%M:%S"))
#Pick a day at a time. I am using Mondays for this example.
tsdata_monday <- tsdata %>% filter(DayofWeek=='Mon') %>% group_by(Hours) %>% summarise(meanVals=mean(Vals)) %>% as.data.frame()
#Plotting the graph of mean values versus times for Monday:
ggplot(tsdata_monday) + aes(x=Hours,y=meanVals) + geom_line() + scale_x_datetime(breaks=date_breaks("4 hour"), labels=date_format("%H:%M"))
#If you want you can go ahead and plot all the days. But please keep in mind
#that this does not look good at all. Too many plots for the plot window to
#Display nicely.
alltsdata <- tsdata %>% group_by(DayofWeek, Hours) %>% summarise(MeanVals=mean(Vals)) %>% as.data.frame()
ggplot(alltsdata) + aes(x=Hours,y=MeanVals) + geom_line() + scale_x_datetime(breaks=date_breaks("4 hour"), labels=date_format("%H:%M")) + facet_grid(.~DayofWeek)
I recommend you plot one day at a time or use a for loop or one of the apply function variations to get the plots.
Also when filtering by day of the week, please keep in mind that the days are shortened as follows:
unique(tsdata$DayofWeek)
[1] Tues Wed Thurs Fri Sat Sun Mon
Hope it helps.

apply.daily does exactly what you want.( assumming your data is called d.xts and a xts-object)
apply.daily(d.xts,sum)
another solution would be using aggregate:
aggregate(d.xts,as.Date(index(d.xts)),sum)
Note that the answers are slightly different: apply.daily starts from start(d.xts) to end(d.xts) whereas aggregate goes by day from midnight to midnight.

Related

Can I specify the dates and times of a time series in R?

I have a dataset that contains times and dates in the first column, and the stock prices in the second column.
I used the following format.
Time Price
2015-02-01 10:00 50
I want to turn this into a time series object. I tried ts(data) function, but when I plot the data I cannot observe the dates in the x-axis. Also I tried ts(data, start=) function. Because I have some hours with missing prices, and those hours are not included in my data set, if I set start date and frequency, my plot will be misleading.
Here is the sample data that I have. It is called df.
time price
1 2013-05-01 00:00:00 124.30
2 2013-05-01 01:00:00 98.99
3 2013-05-01 02:00:00 64.00
4 2013-05-01 03:00:00 64.00
This is the code that I used
Time1 <- ts(df)
autoplot(Time1)
Also tried this,
Time1 <- zoo(Time_series_data[,2], order.by = Time_series_data[,1])
Time_n <- ts(Time1)
autoplot(Time1)
However, when I plot the graph with autoplot(Time1) the x-axis doesn't show the times that I specified but numbers from 0 to 4. I want to have plot of a ts object that includes the date columns in the x-axis and values in Y
Is there any way to convert it to a time series object in R. Thanks.
Try the following:
Create some data using the nifty tribble function from the tibble package.
library(tibble)
df <- tribble(~time, ~price,
"2013-05-01 00:00:00", 124.30,
"2013-05-01 01:00:00", 98.99,
"2013-05-01 02:00:00", 64.00,
"2013-05-01 03:00:00", 64.00)
The time column is a character class and cannot be plotted in the usual way. So convert it using as.Posixct. I'll use the dplyr package here but that's optional.
library(dplyr)
df <- df %>%
mutate(time=as.POSIXct(time))
Next, convert the data to a time series object. This requires the xts package, although I'm sure there are other options including zoo.
library(xts)
df.ts <- xts(df[, -1], order.by=df$time)
Now you can visualise the data.
plot(df.ts) # This should call the `plot.xts` method
And if you prefer ggplot2.
library(ggplot2)
autoplot(df.ts)

How to conduct timeseries analysis on half-hourly data?

I have the dataset below with half hourly timeseries data.
Date <- c("2018-01-01 08:00:00", "2018-01-01 08:30:00",
"2018-01-01 08:59:59","2018-01-01 09:29:59")
Volume <- c(195, 188, 345, 123)
Dataset <- data.frame(Date, Volume)
I would like to know how to read this dataframe in order to conduct time series analysis. How should I define starting and ending date and what the frequency will be?
I'm not sure what you exactly mean by "half hour data" since it isn't. In case you want to round it to half hours, we can adapt this solution to your case.
Dataset$Date <- as.POSIXlt(round(as.double(Dataset$Date)/(30*60))*(30*60),
origin=(as.POSIXlt('1970-01-01')))
In case you don't want to round it just do
Dataset$Date <- as.POSIXct(Dataset$Date)
Basically your Date column should be formatted to a date format, e.g. "POSIXlt" so that e.g.:
> class(Dataset$Date)
[1] "POSIXlt" "POSIXt"
Then we can convert the data into time series with xts.
library(xts)
Dataset.xts <- xts(Dataset$Volume, order.by=Dataset$Date)
Result (rounded case):
> Dataset.xts
[,1]
2018-01-01 08:00:00 195
2018-01-01 08:30:00 188
2018-01-01 09:00:00 345
2018-01-01 09:30:00 123
you can use dplyr and lubridate from tidyverse to get the data into a POSIX date format, then convert to time series with ts. Within that you can define parameters.
Dataset2 <- Dataset %>%
mutate(Date = as.character(Date),
Date = ymd_hms(Date)) %>%
ts(start = c(2018, 1), end = c(2018, 2), frequency = 1)
try ?ts for more details on the parameters. Personally I think zoo and xts provide a better framework for time series analysis.

How to merge a large dataset and a small dataset on POSIXct and Date respectively?

Subject
I have two (simplified) datasets:
A dataset of 500 observations of some.value every hour (date.time variable as POSIXct)
A dataset of 10 daily temperatures (date variable as Date)
The objective is to add the temperature of the second dataset as a new variable to the first dataset where the variable date.time corresponds to the date variable.
I tried a data.table solution using setkey() and roll="nearest" according to : R – How to join two data frames by nearest time-date?
Unfortunately the temperature that gets merged is always the same value for the entire merged dataset.
A simplified example
Here is the exemple code that illustrates my problem and my solution attempt:
Setting random seed
set.seed(10)
Generating the two datasets
observations <- data.frame(date.time = seq(from=ymd_hms("2017-02-01 00:00:00"), length.out=500, by=60*60), some.value = runif(500,0.0,1.0))
daily.temperature <- data.frame(date = seq(from=as.Date("2017-02-01"), length.out = 10, by=1), temperature = runif(10,10,40))
Solution attempt using data.tables and roll="nearest"
# converting dataframes to datatables
library(data.table)
observations <- as.data.table(observations)
daily.temperature <- as.data.table(daily.temperature)
# setting the keys of the two datasets
setkey(observations,date.time)
setkey(daily.temperature,date)
# Combinding the datasets
combined <- daily.temperature[observations, roll = "nearest" ]
combined
Note that the temperature variable in the combined dataset is always the same regardless of date.
Notes regading the unsimplified (real) problem:
In my real problem the observations are recorded every minute instead of every hour.
In my real problem the daily.temperature dataset does not cover the entire range of observations. In that case, adding 'NA' or nothing at all as the temperature would be fine.
Do you want something like this?
set.seed(10)
library(dplyr)
observations <- data.frame(date.time = seq(from=ymd_hms("2017-02-01 00:00:00"), length.out=500, by=60*60), some.value = runif(500,0.0,1.0))
daily.temperature <- data.frame(date = seq(from=as.Date("2017-02-01"), length.out = 10, by=1), temperature = runif(10,10,40))
observations$date<-as.Date(observations$date.time)
combined<-left_join(observations,daily.temperature,by="date")
> head(combined)
date.time some.value date temperature
1 2017-02-01 00:00:00 0.8561467 2017-02-01 38.64702
2 2017-02-01 01:00:00 0.7820957 2017-02-01 38.64702
3 2017-02-01 02:00:00 0.2443390 2017-02-01 38.64702
4 2017-02-01 03:00:00 0.3138552 2017-02-01 38.64702
5 2017-02-01 04:00:00 0.1284753 2017-02-01 38.64702
6 2017-02-01 05:00:00 0.9299472 2017-02-01 38.64702

ggplot2 and chron barplot of time data scale_x_chron

I have a number of times and want to plot the frequency of each time in a barplot
library(ggplot2)
library(chron)
test <- data.frame(times = c(rep("7:00:00",4), rep("8:00:00",3),
rep("12:00:00",1), rep("13:00:00",5)))
test$times <- times(test$times)
test
times
1 07:00:00
2 07:00:00
3 07:00:00
4 07:00:00
5 08:00:00
6 08:00:00
7 08:00:00
8 12:00:00
9 13:00:00
10 13:00:00
11 13:00:00
12 13:00:00
13 13:00:00
The value of binwidth is chosen to represent minutes
p <- ggplot(test, aes(x = times)) + geom_bar(binwidth=1/24/60)
p + scale_x_chron(format="%H:%M")
As you see the scales are plus one hour in the x-axis:
I have the feeling that is has something to do with the timezone, but I cant really place it:
Sys.timezone()
[1] "CET"
Edit:
Thanks #shadow for comment
UPDATE:
If I run Sys.setenv(TZ='GMT') first it works perfectly. The problem is in the times() function. I automatically sets the timezone to GMT and if I'm plotting the x-axis, ggplot notices that my system-timezone is CET and adds one hour on the plot.
Now if i'm setting my system-timezone to GMT, ggplot doesn't add an hour.
The problem is that times(...) assumes the timezone is GMT, and then ggplot compensates for your actual timezone. This is fair enough: times are meaningless unless you specify timezone. The bigger problem is that it does not seem possible to tell times(...) what the actual timezone is (if someone else knows how to do this I'd love to know).
A workaround is to use POSIXct and identify your timezone (mine is EST).
test <- data.frame(times = c(rep("7:00:00",4), rep("8:00:00",3),
rep("12:00:00",1), rep("13:00:00",5)))
test$times <- as.POSIXct(test$times,format="%H:%M:%S",tz="EST")
p <- ggplot(test, aes(x = times)) + geom_bar(binwidth=60,width=.01)
binwidth=60 is 60 seconds.
It has nothing to do with timeszone, the only problem is that in format, %m represents the month and %M represents the minute. So the following will work
p + scale_x_chron(format="%H:%M")

What is the best method to bin intraday volume figures from a stock price timeseries using XTS / ZOO etc in R?

For instance, let's say you have ~10 years of daily 1 min data for the volume of instrument x as follows (in xts format) from 9:30am to 4:30pm :
Date.Time Volume
2001-01-01 09:30:00 1200
2001-01-01 09:31:00 1110
2001-01-01 09:32:00 1303
All the way through to:
2010-12-20 16:28:00 3200
2010-12-20 16:29:00 4210
2010-12-20 16:30:00 8303
I would like to:
Get the average volume at each minute for the entire series (ie average volume over all 10 years at 9:30, 9:31, 9:32...16:28, 16:29, 16:30)
How should I best go about:
Aggregating the data into one minute buckets
Getting the average of those buckets
Reconstituting those "average" buckets back to a single xts/zoo time series?
I've had a good poke around with aggregate, sapply, period.apply functions etc, but just cannot seem to "bin" the data correctly.
It's easy enough to solve this with a loop, but very slow. I'd prefer to avoid a programmatic solution and use a function that takes advantage of C++ architecture (ie xts based solution)
Can anyone offer some advice / a solution?
Thanks so much in advance.
First lets create some test data:
library(xts) # also pulls in zoo
library(timeDate)
library(chron) # includes times class
# test data
x <- xts(1:3, timeDate(c("2001-01-01 09:30:00", "2001-01-01 09:31:00",
"2001-01-02 09:30:00")))
1) aggregate.zoo. Now try converting it to times class and aggregating using this one-liner:
aggregate(as.zoo(x), times(format(time(x), "%H:%M:%S")), mean)
1a) aggregate.zoo (variation). or this variation which converts the shorter aggregate series to times to avoid having to do it on the longer original series:
ag <- aggregate(as.zoo(x), format(time(x), "%H:%M:%S"), mean)
zoo(coredata(ag), times(time(ag)))
2) tapply. An alternative would be tapply which is likely faster:
ta <- tapply(coredata(x), format(time(x), "%H:%M:%S"), mean)
zoo(unname(ta), times(names(ta)))
EDIT: simplified (1) and added (1a) and (2)
Here is a solution with ddply,
but you can probably also use sqldf, tapply, aggregate, by, etc.
# Sample data
minutes <- 10 * 60
days <- 250 * 10
d <- seq.POSIXt(
ISOdatetime( 2011,01,01,09,00,00, "UTC" ),
by="1 min", length=minutes
)
d <- outer( d, (1:days) * 24*3600, `+` )
d <- sort(d)
library(xts)
d <- xts( round(100*rlnorm(length(d))), d )
# Aggregate
library(plyr)
d <- data.frame(
minute=format(index(d), "%H:%M"),
value=coredata(d)
)
d <- ddply(
d, "minute",
summarize,
value=mean(value, na.rm=TRUE)
)
# Convert to zoo or xts
zoo(x=d$value, order.by=d$minute) # The index does not have to be a date or time
xts(x=d$value, order.by=as.POSIXct(sprintf("2012-01-01 %s:00",d$minute), "%Y-%m-%d %H:%M:%S") )

Resources