Performance problems when converting timestamped row data

Performance problems when converting timestamped row data - r

I've written a function that takes a data.frame which represent intervals of data which occur across a 1 minute timeframe. The purpose of the function is to take these 1 minute intervals and convert them into higher intervals. Example, 1 minute becomes 5 minute, 60 minute etc...The data set itself has the potential to have gaps in the data i.e. jumps in time so it must accommodate for these bad data occurrences. I've written the following code which appears to work but the performance is absolutely terrible on large data sets.
I'm hoping that someone could provide some suggestions on how I might be able to speed this up. See below.
compressMinute = function(interval, DAT) {
#Grab all data which begins at the same interval length
retSet = NULL
intervalFilter = which(DAT$time$min %% interval == 0)
barSet = NULL
for (x in intervalFilter) {
barEndTime = DAT$time[x] + 60*interval
barIntervals = DAT[x,]
x = x+1
while(x <= nrow(DAT) & DAT[x,"time"] < barEndTime) {
barIntervals = rbind(barIntervals,DAT[x,])
x = x + 1
}
bar = data.frame(date=barIntervals[1,"date"],time=barIntervals[1,"time"],open=barIntervals[1,"open"],high=max(barIntervals[1:nrow(barIntervals),"high"]),
low=min(barIntervals[1:nrow(barIntervals),"low"]),close=tail(barIntervals,1)$close,volume=sum(barIntervals[1:nrow(barIntervals),"volume"]))
if (is.null(barSet)) {
barSet = bar
} else {
barSet = rbind(barSet, bar)
}
}
return(barSet)
}
EDIT:
Below is a row of my data. Each row represents a 1 minute interval, I am trying to convert this into arbitrary buckets which are the aggregates of these 1 minute intervals, i.e. 5 minutes, 15 minutes, 60 minutes, 240 minutes, etc...
date time open high low close volume
2005-09-06 2005-09-06 16:33:00 1297.25 1297.50 1297.25 1297.25 98

You probably want to re-use existing facitlities, specifically the POSIXct time types, as well as existing packages.
For example, look at the xts package --- it already has a generic function to.period() as well as convenience wrappers to.minutes(), to.minutes3(), to.minutes10(), ....
Here is an example from the help page:
R> example(to.minutes)
t.mn10R> data(sample_matrix)
t.mn10R> samplexts <- as.xts(sample_matrix)
t.mn10R> to.monthly(samplexts)
samplexts.Open samplexts.High samplexts.Low samplexts.Close
Jan 2007 50.0398 50.7734 49.7631 50.2258
Feb 2007 50.2245 51.3234 50.1910 50.7709
Mar 2007 50.8162 50.8162 48.2365 48.9749
Apr 2007 48.9441 50.3378 48.8096 49.3397
May 2007 49.3457 49.6910 47.5180 47.7378
Jun 2007 47.7443 47.9413 47.0914 47.7672
t.mn10R> to.monthly(sample_matrix)
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
Jan 2007 50.0398 50.7734 49.7631 50.2258
Feb 2007 50.2245 51.3234 50.1910 50.7709
Mar 2007 50.8162 50.8162 48.2365 48.9749
Apr 2007 48.9441 50.3378 48.8096 49.3397
May 2007 49.3457 49.6910 47.5180 47.7378
Jun 2007 47.7443 47.9413 47.0914 47.7672
t.mn10R> str(to.monthly(samplexts))
An ‘xts’ object from Jan 2007 to Jun 2007 containing:
Data: num [1:6, 1:4] 50 50.2 50.8 48.9 49.3 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "samplexts.Open" "samplexts.High" "samplexts.Low" "samplexts.Close"
Indexed by objects of class: [yearmon] TZ:
xts Attributes:
NULL
t.mn10R> str(to.monthly(sample_matrix))
num [1:6, 1:4] 50 50.2 50.8 48.9 49.3 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:6] "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" ...
..$ : chr [1:4] "sample_matrix.Open" "sample_matrix.High" "sample_matrix.Low" "sample_matrix.Close"
R>

Related

Python, adding a Water-Year time variable in an X-array

I have the following Xarray named 'scatch' with lat long and lev coords eliminated and only the time coord as a dimension. It has several variables. It is now a multivariate daily time-series from 2002 to 2014. I need to add a new variable "water_year", that shows what water-year is that day of the year. It could be by adding another column in the variables by Xarray.assign or by Xarray.resample but I am not sure, and could use some help. Note: "Water Year" starts from Oct 01, and ends on Sep 30 the next year. So water-year-2003 would be 10-01-2002 to 09-30-2003.
See my Xarray here

I'll create a sample dataset with a single variable for this example:
In [2]: scratch = xr.Dataset(
...: {'Baseflow': (('time', ), np.random.random(4018))},
...: coords={'time': pd.date_range('2002-10-01', freq='D', periods=4018)},
...: )
In [3]: scratch
Out[3]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
We can build a water_year array using the Datetime Components accessor .dt:
In [4]: water_year = (scratch.time.dt.month >= 10) + scratch.time.dt.year
...: water_year
Out[4]:
<xarray.DataArray (time: 4018)>
array([2003, 2003, 2003, ..., 2013, 2013, 2013])
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Because water_year is a DataArray indexed by an existing dimension, we can just add it as a coordinate and xarray will understand that it's a non-dimension coordinate. This is important to make sure we don't create a new dimension in our data.
In [7]: scratch.coords['water_year'] = water_year
In [8]: scratch
Out[8]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
water_year (time) int64 2003 2003 2003 2003 2003 ... 2013 2013 2013 2013
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
Because water_year is indexed by time, we still need to select from the arrays using the time dimension, but we can subset the arrays to specific water years:
In [9]: scratch.sel(time=(scratch.water_year == 2010))
Out[9]:
<xarray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2009-10-01 2009-10-02 ... 2010-09-30
water_year (time) int64 2010 2010 2010 2010 2010 ... 2010 2010 2010 2010
Data variables:
Baseflow (time) float64 0.441 0.7586 0.01377 ... 0.2656 0.1054 0.6964
Aggregation operations can use non-dimension coordinates directly, so the following works:
In [10]: scratch.groupby('water_year').sum()
Out[10]:
<xarray.Dataset>
Dimensions: (water_year: 11)
Coordinates:
* water_year (water_year) int64 2003 2004 2005 2006 ... 2010 2011 2012 2013
Data variables:
Baseflow (water_year) float64 187.6 186.4 184.7 ... 185.2 189.6 192.7

Convert a data frame with Product_Type, Date and Demand to timeseries object?

Product_Code Date Order_Demand
Product_1904 09-01-2017 4000
Product_0250 09-01-2017 148
Product_0471 09-01-2017 30
Product_1408 06-01-2017 1000
Product_0689 06-01-2017 200
Product_0689 06-01-2017 300
Product_1926 06-01-2017 2
Product_1938 06-01-2017 20
I am new to R. I want to convert the above data to a time series object ts, such that the rownames will be Product_Code and column names will be months or quarters. Kindly help me!!

I think this should work for you,
library(xts)
library(lubridate)
# dummmy data
test_data <- data.frame(
Product_Code = c("Product_1904","Product_0250","Product_0471"),
Date = mdy(c("09-01-2017","09-02-2017","09-03-2017")),
Order_Demand = c(4000,148,30)
)
# convert dummy data into xts time series
xts::xts(test_data, order.by = test_data$Date) -> time_series_data
str(time_series_data)
An ‘xts’ object on 2017-09-01/2017-09-03 containing:
Data: chr [1:3, 1:3] "Product_1904" "Product_0250" "Product_0471" "2017-09-01" "2017-09-02" "2017-09-03" "4000" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "Product_Code" "Date" "Order_Demand"
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
From next time please use dput() and copy paste the result from the R terminal to provide the data.

Zooreg frequency warning

Suppose I have the following set of data:
date <- structure(c(1986, 1986.08333333333, 1986.16666666667), class = "yearmon")
return <- structure(c(0.000827577426231287, 0.00386371801344005, 0.00382634819565989
), .Dim = 3L, .Dimnames = list(c("1986-01", "1986-02", "1986-03"
)))
I used the following to transform the return array into a zoo/zooreg object:
zooreg(return, order.by = date)
It provides the correct output with a warning:
Jan 1986 Feb 1986 Mar 1986
0.0008275774 0.0038637180 0.0038263482
Warning message:
In zoo(data, order.by, frequency) :
“order.by” and “frequency” do not match: “frequency” ignored
The series is strictly regular and the order.by and frequency should match but I still couldn't figure out why there is a warning.

According to the documentation (?yearmon):
The "yearmon" class is used to represent monthly data. Internally it holds the data as year plus 0 for January, 1/12 for February, 2/12 for March and so on in order that its internal representation is the same as ts class with frequency = 12.
Calling:
zooreg(return, order.by = date)
is equivalent to calling
zoo(return, order.by = date, frequency = 1)
According to the documentation to zoo under Arguments::frequency :
If specified, it is checked whether order.by and frequency comply.
Hence the warning. To get rid of the warning, use
z <- zooreg(return, order.by = date, frequency = 12)
or
z <- zoo(return, order.by = date, frequency = 12)
Both of these will return an object of class zooreg:
str(z)
‘zooreg’ series from Jan 1986 to Mar 1986
Data: Named num [1:3] 0.000828 0.003864 0.003826
- attr(*, "names")= chr [1:3] "1986-01" "1986-02" "1986-03"
Index: Class 'yearmon' num [1:3] 1986 1986 1986
Frequency: 12
which according to the documentation (?zoo),
This is a subclass of "zoo" which relies on having a "zoo" series with an additional "frequency" attribute (which has to comply with the index of that series)
I believe this is what you want.
Note that calling with mismatched "order.by" and "frequency" using
z <- zooreg(return, order.by = date)
you get only a zoo object:
str(z)
‘zoo’ series from Jan 1986 to Mar 1986
Data: Named num [1:3] 0.000828 0.003864 0.003826
- attr(*, "names")= chr [1:3] "1986-01" "1986-02" "1986-03"
Index: Class 'yearmon' num [1:3] 1986 1986 1986

Issue with setting up time series correctly in R

I have been trying to do some basic analysis on some timeseries data. However, I keep getting this error on anything I am trying to do
Error in decompose(data_ts, type = c("additive")) :
time series has no or less than 2 periods
I assume the problem is that I am not setting the data correctly for time series analysis. I am working with data that runs M-F for about a year. Below is the code that I am using to convert the series to time series
train=xts(data$x,as.Date(data$Date,format='%m/%d/%Y'),frequency=365)
data_ts=as.ts(train)
attributes(data_ts)
$tsp
[1] 1 277 1
$class
[1] "ts"
But when I try to do any type of analysis on the time series data, I receive this:
dcomp=decompose(data_ts,type=c('additive'))
Error in decompose(data_ts, type = c("additive")) :
time series has no or less than 2 periods
Am I setting up the time series incorrectly? Is there a better period that I should pick for frequency because technically I don't have a full year worth of data?
Thank you!

I don't see the xts frequency argument doing the same thing as the ts frequency argument.
So, I assume you need to convert your data into a ts object before you use decompose. The way I got it to work is the following:
Using the following data:
data(sample_matrix)
df <- as.data.frame(sample_matrix )
df$date <- as.Date(row.names(df))
If I do the following:
dfxts <- xts(df[1:4], order.by=df$date, frequency=12)
decompose(dfts)
Error in decompose(dfts) : time series has no or less than 2 periods
I get the same error as you.
However if I convert it into a ts object and use that frequency argument:
#use as.ts to convert into ts
#make sure your data frame consists of numeric columns otherwise it will fail
#drop all the others
#in my case df[1:4] has numeric values. I use the date as a separate vector.
dfts <- as.ts(xts(df[1:4], order.by=df$date))
#I guess splitting by month would make sense in your case.
#I think ts works with frequency 4 (quarterly) or 12 (monthly)
#If you see your dfts now you ll see the difference
dfts <- ts(dfts, frequency=12)
And then it works:
dcomp <- decompose(dfts)
Output:
> str(dcomp)
List of 6
$ x : mts [1:180, 1:4] 50 50.2 50.4 50.4 50.2 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:4] "Open" "High" "Low" "Close"
..- attr(*, "tsp")= num [1:3] 1 15.9 12
..- attr(*, "class")= chr [1:3] "mts" "ts" "matrix"
$ seasonal: Time-Series [1:720] from 1 to 60.9: -0.00961 0.02539 0.06149 0.01773 -0.00958 ...
$ trend : mts [1:180, 1:4] NA NA NA NA NA ...
..- attr(*, "tsp")= num [1:3] 1 15.9 12
..- attr(*, "class")= chr [1:2] "mts" "ts"
$ random : mts [1:180, 1:4] NA NA NA NA NA ...
..- attr(*, "tsp")= num [1:3] 1 15.9 12
..- attr(*, "class")= chr [1:3] "mts" "ts" "matrix"
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:4] "x - seasonal.x.Open" "x - seasonal.x.High" "x - seasonal.x.Low" "x - seasonal.x.Close"
$ figure : num [1:12] -0.00961 0.02539 0.06149 0.01773 -0.00958 ...
$ type : chr "additive"
- attr(*, "class")= chr "decomposed.ts"

You're right about changing the frequency. Right now, you're saying there would be 365 observations per period, so you have less than 1 period of data. You could adjust it to set the frequency to 5, since you're getting 5 observations per week. That's not the only option, just an example that should work :)

Try this general approach with your data-set:
# Pre-loading the data
library(TSA)
library(mgcv)
#Choose the 'Temp Monthly.csv' dataset, wherever it is located on your computer
#Additionally, just skip this step and replace ‘fname’ with the files direct location
fname <- file.choose()
#Load Data
data <- read.csv(fname)
data <- data[,2]
#Convert to TS data in proper frame
temp <- ts(data,start=c(1950,1),freq=12)
# Plotting Time Plots
ts.plot(temp,ylab="Temperature")
abline(a=mean(temp),b=0,col='red')
This might help. And for more details about time series in R, this article in medium might be helpful: https://medium.com/analytics-vidhya/time-series-analysis-101-in-r-and-python-1e1a7f7c3e51
Please feel free to comment if you need any help on this. :)

How to subtract these two time series objects?

I have one time series object that looks as follows (sorry, I don't know how to format it any more nicely):
Jan Feb Mar Apr May Jun
Jul Aug Sep
2010 0.051495184 0.012516017
0.029767280 0.046781229 0.041615717 0.002205329 0.056919026 -0.026339813 0.078932572 ...
It contains data from 2010m01 - 2014m12
And one that looks like this:
Time Series: Start = 673 End = 732 Frequency = 1
[1] 0.01241940
0.01238126 0.01234626 0.01227542 ...
They have the same number of observations. However, when I try to subract them I get the error:
Error in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L], deparse(substitute(e2))[1L]), :
not all series have the same frequency
Can anyone tell me what I can do to subtract the two?
Thanks in advance.
Edit:
str() gives:
Time-Series [1:60] from 2010 to 2015: 0.0515 0.0125 0.0298 0.0468
0.0416 ...
and
Time-Series [1:60] from 673 to 732: 0.0124 0.0124 0.0123 0.0123
0.0122 ...

There is no frequency<- function, but you can change the frequency of time-series objects using the ts function:
> x <- ts(1:10, frequency = 4, start = c(1959, 2))
> frequency(x) <- 12
Error in frequency(x) <- 12 : could not find function "frequency<-"
> y <- ts(x, frequency=12)
> frequency(y)
[1] 12

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Performance problems when converting timestamped row data - r

Related

Python, adding a Water-Year time variable in an X-array

Convert a data frame with Product_Type, Date and Demand to timeseries object?

Zooreg frequency warning

Issue with setting up time series correctly in R

How to subtract these two time series objects?

Categories

Resources