In Julia, we can create a Time Array with the following code:
d = [date(1980,1,1):date(2015,1,1)];
t = TimeArray(d,rand(length(d)),["test"])
This would give us daily data. What about getting quarterly or yearly time series?
Merely use the optional step capability of Base.range in combination with the type Datetime.Period
julia> [Date(1980,1,1):Month(3):Date(2015,1,1)]
141-element Array{Date{ISOCalendar},1}:
1980-01-01
1980-04-01
1980-07-01
1980-10-01
1981-01-01
1981-04-01
...
And change the step as necessary
julia> [Date(1980,1,1):Year(1):Date(2015,1,1)]
36-element Array{Date{ISOCalendar},1}:
1980-01-01
1981-01-01
1982-01-01
...
0.3.x vs 0.4.x
In version 0.3.x Dates is available in the package Dates which provides the module Dates, but in version 0.4.x, module Dates is built in. Also (currently) an additional subtle difference is Year and Month must be accessed as Dates.Year and Dates.Month in version 0.4.x.
I know this question is a bit old, but it's worth adding that there is another time series package called Temporal* that has this functionality available.
Here's some example usage:
using Temporal, Base.Dates
date_array = collect(today()-Day(365):Day(1):today())
random_walk = cumsum(randn(length(date_array))) + 100.0
Construct the time series object (type TS). Last argument is for column names, but if not given will autogenerate default column names.
ts_data = TS(random_walk, date_array, :RandomWalk)
# Index RandomWalk
# 2016-08-24 99.8769
# 2016-08-25 99.1643
# 2016-08-26 98.8918
# 2016-08-27 97.7265
# 2016-08-28 97.9675
# 2016-08-29 97.7151
# 2016-08-30 97.0279
# ⋮
# 2017-08-17 81.2998
# 2017-08-18 82.0658
# 2017-08-19 82.1941
# 2017-08-20 81.9021
# 2017-08-21 81.8163
# 2017-08-22 81.5406
# 2017-08-23 81.2229
# 2017-08-24 79.2867
Get the last observation of every quarter (similar logic exists for weeks, months, and years using eow, eom, and eoy respectively):
eoq(ts_data) # get the last observation at every quarter
# 4x1 Temporal.TS{Float64,Date}: 2016-09-30 to 2017-06-30
# Index RandomWalk
# 2016-09-30 88.5629
# 2016-12-31 82.1014
# 2017-03-31 84.9065
# 2017-06-30 92.1997
Can also use functions to aggregate data by the same kinds of periods as given above.
collapse(ts_data, eoq, fun=mean) # get the average value every quarter
# 4x1 Temporal.TS{Float64,Date}: 2016-09-30 to 2017-06-30
# Index RandomWalk
# 2016-09-30 92.5282
# 2016-12-31 86.8291
# 2017-03-31 89.1391
# 2017-06-30 90.3982
* (Disclaimer: I'm the package author.)
Quarterly isn't supported yet, but other time periods such as week, month and year are supported. There is a method called collapse that is used to convert a TimeArray to a larger time frame.
d = [Date(1980,1,1):Date(2015,1,1)];
t = TimeArray(d,rand(length(d)),["test"])
c = collapse(t, last, period=year)
Returns the following
36x1 TimeArray{Float64,1} 1980-12-31 to 2015-01-01
test
1980-12-31 | 0.94
1981-12-31 | 0.37
1982-12-31 | 0.12
1983-12-31 | 0.64
⋮
2012-12-31 | 0.43
2013-12-31 | 0.81
2014-12-31 | 0.88
2015-01-01 | 0.55
Also, note that date has been deprecated in favor of Date as a new updated package now runs the date/time functions underneath.
Related
I am trying to convert the timestamps in the stock data from Google Finance API to a more usable datetime format.
I have used data.table::fread to read the data here:
fread(<url>)
datetime open high low close volume
1: a1497619800 154.230 154.2300 154.2300 154.2300 500
2: 1 153.720 154.3200 153.7000 154.2500 1085946
3: 2 153.510 153.8000 153.2000 153.7700 34882
4: 3 153.239 153.4800 153.1400 153.4800 24343
5: 4 153.250 153.3000 152.9676 153.2700 20212
As you can see, the "datetime" format is rather strange. The format is described in this link:
The full timestamps are denoted by the leading 'a'. Like this: a1092945600. The number after the 'a' is a Unix timestamp. [...]
The numbers without a leading 'a' are "intervals". So, for example, the second row in the data set below has an interval of 1. You can multiply this number by our interval size [...] and add it to the last Unix Timestamp.
In my case, the "interval size" is 300 seconds (5 minutes). This format is restarted at the start of each new day and so trying to format it is quite difficult!
I can pull out the index positions of where the day starts are by using grep and searching for "a";
newDay <- grep(df$V1, pattern = "a")
Then my idea was to split the dataframe into chunks depending on index positions then extend the unix times on each day separately followed by combing them back to a data.table, before storing.
data.table::split looks like it will do the job, but I am unsure of how to supply it the day breaks to split by index positions, or if there is a more logical way to achieve the same result without having to break it down to each day.
Thanks.
You may use grepl to search for "a" in "datetime", which results in a boolean vector. cumsum the boolean to create a grouping variable - for each "a" (TRUE), the counter will increase by one.
Within each group, convert the first element to POSIXct, using an appropriate format and origin (and timezone, tz?). Add multiples of the 'interval size' (300 sec), using zero for the first element and the "datetime" multiples for the others.
d[ , time := {
t1 <- as.POSIXct(datetime[1], format = "a%s", origin = "1970-01-01")
.(t1 + c(0, as.numeric(datetime[-1]) * 300))
}
, by = .(cumsum(grepl("^a", datetime)))]
d
# datetime time
# 1: a1497619800 2017-06-16 15:30:00
# 2: 1 2017-06-16 15:35:00
# 3: 2 2017-06-16 15:40:00
# 4: 3 2017-06-16 15:45:00
# 5: 4 2017-06-16 15:50:00
# 6: a1500000000 2017-07-14 04:40:00
# 7: 3 2017-07-14 04:55:00
# 8: 5 2017-07-14 05:05:00
# 9: 7 2017-07-14 05:15:00
Some toy data:
d <- fread(input = "datetime
a1497619800
1
2
3
4
a1500000000
3
5
7")
With:
DT[grep('^a', date), datetime := as.integer(gsub('\\D+','',date))
][, datetime := zoo::na.locf(datetime)
][nchar(date) < 4, datetime := datetime + (300 * as.integer(date))
][, datetime := as.POSIXct(datetime, origin = '1970-01-01', tz = 'America/New_York')][]
you get:
date close high low open volume datetime
1: a1500298200 153.57 153.7100 153.57 153.5900 1473 2017-07-17 09:30:00
2: 1 153.51 153.8700 153.33 153.7500 205057 2017-07-17 09:35:00
3: 2 153.49 153.7800 153.34 153.5800 70023 2017-07-17 09:40:00
4: 3 153.68 153.7300 153.42 153.5400 53050 2017-07-17 09:45:00
5: 4 153.06 153.7500 153.06 153.7200 120899 2017-07-17 09:50:00
---
2348: 937 143.94 144.0052 143.91 143.9917 36651 2017-08-25 15:40:00
2349: 938 143.90 143.9958 143.90 143.9400 40769 2017-08-25 15:45:00
2350: 939 143.94 143.9500 143.87 143.8900 56616 2017-08-25 15:50:00
2351: 940 143.97 143.9700 143.89 143.9400 56381 2017-08-25 15:55:00
2352: 941 143.74 143.9700 143.74 143.9655 179811 2017-08-25 16:00:00
Used data:
DT <- fread('https://www.google.com/finance/getprices?i=300&p=30d&f=d,t,o,h,l,c,v&df=cpct&q=IBM', skip = 7, header = FALSE)
setnames(DT, 1:6, c('date','close','high','low','open','volume'))
So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)
I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5
tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.
I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.
> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440
I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6