Converting 1 minute data to 5 minutes using R - r

I am trying to convert 1 minute data for 5 minutes in R using PROXISct. But I am unable to convert it.
My data is in this format.
Date Time Price Volume No.of.trades
1 01-06-2012 09:15 4901.895 283550 1286
2 01-06-2012 09:16 4907.046 140000 831
3 01-06-2012 09:17 4904.140 96900 639
4 01-06-2012 09:18 4900.609 84350 553
5 01-06-2012 09:19 4900.067 76450 516
6 01-06-2012 09:20 4898.378 84900 551
dt_tm <- as.POSIXct(paste(x[,1], x[,2]),
format="%d-%m-%Y %H:%M", tz="UTC")
cable <- xts(x[,3:5], order.by=dt_tm)
Price Volume No.of.trades
2012-01-07 09:15:00 6054.890 139750 787
2012-01-07 09:16:00 6051.176 56550 335
2012-01-07 09:17:00 6045.232 127400 691
2012-01-07 09:18:00 6039.950 59950 374
2012-01-07 09:19:00 6042.292 55450 214
2012-01-07 09:20:00 6044.140 53600 246
After this step I am getting a different type series, which is not there in my data.
Further, I have to use this code to convert my data to 5 minutes,
colnames(cable)[1] <- "CLOSE"
trades5 <-to.minutes5(cable, indexAt='startof', name=NULL)
Please correct me where I am doing wrong, and suggest me if there is any other way of converting this type of data to 5 minutes.
I am still facing problem related to data.date structure in my data is day-month-year, you have suggested to swap the day and months, i did the same and get the desired outcome but in head it is fine but when i am looking for tail, i am finding some problem in it.initally tail was
Date Time Price Volume No.of.trades
91561 31-05-2013 15:25 6004.504 86550 622
91562 31-05-2013 15:26 6003.709 117750 651
91563 31-05-2013 15:27 6000.656 160950 856
91564 31-05-2013 15:28 5997.516 215950 1191
91565 31-05-2013 15:29 5995.305 303200 1784
now with the following code
dt_tm <- as.POSIXct(paste(x[,1], x[,2]),
format="%m-%d-%Y %H:%M", tz="UTC")
ct <- cut(dt_tm, breaks="5 mins")
ct_tm <- as.POSIXct(as.character(ct))
cable <- xts(x[,3:5], order.by=ct_tm)
head(cable)
Price Volume No.of.trades
2012-01-06 09:15:00 4901.895 283550 1286
2012-01-06 09:15:00 4907.046 140000 831
2012-01-06 09:15:00 4904.140 96900 639
2012-01-06 09:15:00 4900.609 84350 553
2012-01-06 09:15:00 4900.067 76450 516
2012-01-06 09:20:00 4898.378 84900 551
but when i am looking for tail
tail(cable)
Price Volume No.of.trades
<NA> 6004.504 86550 622
<NA> 6003.709 117750 651
<NA> 6000.656 160950 856
<NA> 5997.516 215950 1191
<NA> 5995.305 303200 1784
<NA> 5991.419 550 8
kindly help me where now i am going wrong.

I think you may be formating your data incorrectly - swapping day and month.
dt_tm <- as.POSIXct(paste(x[,1], x[,2]),
format="%m-%d-%Y %H:%M", tz="UTC")
Converting to 5 minutes data frame can be achieved like this:
# cut dt_tm to 5 minutes intervals
ct <- cut(dt_tm, breaks="5 mins")
# convert to POSIXct
ct_tm <- as.POSIXct(as.character(ct))
# aggregate
cable <- xts(x[,3:5], order.by=ct_tm)
Time Price Volume
2012-01-06 09:15:00 "09:15" "4901.895" "283550"
2012-01-06 09:15:00 "09:16" "4907.046" "140000"
2012-01-06 09:15:00 "09:17" "4904.140" " 96900"
2012-01-06 09:15:00 "09:18" "4900.609" " 84350"
2012-01-06 09:15:00 "09:19" "4900.067" " 76450"
2012-01-06 09:20:00 "09:20" "4898.378" " 84900"

Related

How to calculate the sequential date diff in a dataframe and make it as another column for further analysis?

Please before make it as duplicate read carefully my question!
I am new in R and I am trying to figure it out how to calculate the sequential date difference from one row/variable compare to the next row/variable in based on weeks and create another field/column for making a graph accordingly.
There are couple of answer here Q1 , Q2 , Q3 but none specifically talk about making difference in one column sequentially between rows lets say from top to bottom.
Below is the example and the expected results:
Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234
Expected
Date Var1 week
2/6/2017 493 0
2/20/2017 558 2
3/6/2017 595 4
3/20/2017 636 6
4/6/2017 697 8
4/20/2017 566 10
5/6/2017 234 12
You can use a similar approach to that in your first linked answer by saving the difftime result as a new column in your data frame.
# Set up data
df <- read.table(text = "Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234", header = T)
df$Date <- as.Date(as.character(df$Date), format = "%m/%d/%Y")
# Create exact week variable
df$week <- difftime(df$Date, first(df$Date), units = "weeks")
# Create rounded week variable
df$week2 <- floor(difftime(df$Date, first(df$Date), units = "weeks"))
df
# Date Var1 week week2
# 2017-02-06 493 0.000000 weeks 0 weeks
# 2017-02-20 558 2.000000 weeks 2 weeks
# 2017-03-06 595 4.000000 weeks 4 weeks
# 2017-03-20 636 6.000000 weeks 6 weeks
# 2017-04-06 697 8.428571 weeks 8 weeks
# 2017-04-20 566 10.428571 weeks 10 weeks
# 2017-05-05 234 12.571429 weeks 12 weeks

How do I convert a character field to date/time format in R?

This is what my data looks like. I would like to convert date and time columns to a time stamp and put it in a single column.
Any help appreciated. Thanks
DATE TIME CLOSE HIGH LOW OPEN VOLUME
1 20150216 1520 2283.85 2284 2275.6 2275.6 48309
2 20150216 1530 2282 2284 2273.15 2283.85 108856
3 20150218 920 2276.1 2280.1 2260.6 2280.1 94279
4 20150218 930 2271.6 2277.95 2271 2276.1 65932
5 20150218 940 2270.35 2275 2268.2 2271.6 53595
6 20150218 950 2270.65 2271.2 2265.55 2270.5 34546
7 20150218 1000 2274.15 2274.25 2268.65 2270.6 35414
8 20150218 1010 2270.1 2274.9 2267.1 2274.25 37334
You can try
df$DateTime <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
df1 <- df[-(1:2)]
head(df1,2)
# CLOSE HIGH LOW OPEN VOLUME DateTime
#1 2283.85 2284 2275.60 2275.60 48309 2015-02-16 15:20:00
#2 2282.00 2284 2273.15 2283.85 108856 2015-02-16 15:30:00
Update
If you need to convert to xts, instead of creating a new column, we can remove the columns that are not needed (df[-(1:2)]) and specify order.by as the datetime vector ('indx')
library(xts)
indx <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
xt1 <- xts(df[-(1:2)], order.by=indx)

How can I apply "sapply" in R with multiple codes in one function?

I am a new R user. I have a simple sapply function example for calculating mean and sd for a splitted data frame. My data contains half hourly wind speed with direction. I want to know daily Weibull distribution for my study for 13 years. That is why my dataset is splitted based on time.
My data looks like this:
Time windspeed direction Date day_index
1 24/07/2000 13:00 31 310 2000-07-24 13:00:00 2000_206
2 24/07/2000 13:30 41 320 2000-07-24 13:30:00 2000_206
3 24/07/2000 14:30 37 290 2000-07-24 14:30:00 2000_206
4 24/07/2000 15:00 30 300 2000-07-24 15:00:00 2000_206
5 24/07/2000 15:30 24 320 2000-07-24 15:30:00 2000_206
6 24/07/2000 16:00 22 330 2000-07-24 16:00:00 2000_206
7 24/07/2000 16:30 37 270 2000-07-24 16:30:00 2000_206
The example R code I have for the split-apply to look over the days is:
my.summary <- sapply(split(ballarat_alldata[1:200, ],
ballarat_alldata$day_index[1:200]),
function(x) {
return(c(my.mean=mean(x$windspeed),
my.sd=sd(x$windspeed)))
})
The Weibull distribution code to calculate shape and scale parameters is:
set1 <- createSet(height=10,
v.avg=ballarat_alldata[,2],
dir.avg=ballarat_alldata[,3])
time_ballarat <- strptime(ballarat_alldata[,1], "%d/%m/%Y %H:%M")
ballarat <- createMast(time.stamp=time_ballarat, set1)
ballarat <- clean(mast=ballarat)
ballarat.wb <- weibull(mast=ballarat, v.set=1, print=FALSE)
How can I combine these two set of R codes to calculate Weibull parameters each day and store in a matrix?
I tried many ways but it doesn't work out well.
If these two sets of R codes are combined, should I change wind speed and direction range in set1 <- createSet(height=10, v.avg=ballarat_alldata[,2], dir.avg=ballarat_alldata[,3]) too?
It seems as though you have 2 separate problems here: 1) aggregating your data 2) calculating Weibull parameters. For the first question I can recommend something like:
library(plyr)
Wind <- ddply(Wind, .(as.Date(Date)), transform,
Wind.mean = mean(windspeed), Wind.sd = sd(windspeed))
# windspeed direction Date2 Time2 day_index Wind.mean Wind.sd
# 1 31 310 2000-07-24 13:00:00 2000_206 36.33333 5.033223
# 2 41 320 2000-07-24 13:30:00 2000_206 36.33333 5.033223
# 3 37 290 2000-07-24 14:30:00 2000_206 36.33333 5.033223
# 4 30 300 2000-07-25 15:00:00 2000_206 28.25000 6.751543
# 5 24 320 2000-07-25 15:30:00 2000_206 28.25000 6.751543
# 6 22 330 2000-07-25 16:00:00 2000_206 28.25000 6.751543
# 7 37 270 2000-07-25 16:30:00 2000_206 28.25000 6.751543
If you give me a little bit more of a hint on how you are calculating the parameters you can also use the summarise from the plyr library, something like
ddply(Wind, .(Date2), summarise, rweibull(# I'm not sure what goes here
Hope this helps.

Intraday high/low clustering

I am attempting to perform a study on the clustering of high/low points based on time. I managed to achieve the above by using to.daily on intraday data and merging the two using:
intraday.merge <- merge(intraday,daily)
intraday.merge <- na.locf(intraday.merge)
intraday.merge <- intraday.merge["T08:30:00/T16:30:00"] # remove record at 00:00:00
Next, I tried to obtain the records where the high == daily.high/low == daily.low using:
intradayhi <- test[test$High == test$Daily.High]
intradaylo <- test[test$Low == test$Daily.Low]
Resulting data resembles the following:
Open High Low Close Volume Daily.Open Daily.High Daily.Low Daily.Close Daily.Volume
2012-06-19 08:45:00 258.9 259.1 258.5 258.7 1424 258.9 259.1 257.7 258.7 31523
2012-06-20 13:30:00 260.8 260.9 260.6 260.6 1616 260.4 260.9 259.2 260.8 35358
2012-06-21 08:40:00 260.7 260.8 260.4 260.5 493 260.7 260.8 257.4 258.3 31360
2012-06-22 12:10:00 255.9 256.2 255.9 256.1 626 254.5 256.2 253.9 255.3 50515
2012-06-22 12:15:00 256.1 256.2 255.9 255.9 779 254.5 256.2 253.9 255.3 50515
2012-06-25 11:55:00 254.5 254.7 254.4 254.6 1589 253.8 254.7 251.5 253.9 65621
2012-06-26 08:45:00 253.4 254.2 253.2 253.7 5849 253.8 254.2 252.4 253.1 70635
2012-06-27 11:25:00 255.6 256.0 255.5 255.9 973 251.8 256.0 251.8 255.2 53335
2012-06-28 09:00:00 257.0 257.3 256.9 257.1 601 255.3 257.3 255.0 255.1 23978
2012-06-29 13:45:00 253.0 253.4 253.0 253.4 451 247.3 253.4 246.9 253.4 52539
There are duplicated results using the subset, how do I achieve only the first record of the day? I would then be able to plot the count of records for periods in the day.
Also, are there alternate methods to get the results I want? Thanks in advance.
Edit:
Sample output should look like this, count could either be 1st result for day or aggregated (more than 1 occurrence in that day):
Time Count
08:40:00 60
08:45:00 54
08:50:00 60
...
14:00:00 20
14:05:00 12
14:10:00 30
You can get the first observation of each day via:
y <- apply.daily(x, first)
Then you can simply aggregate the count based on hours and minutes:
z <- aggregate(1:NROW(y), by=list(Time=format(index(y),"%H:%M")), sum)

extracting and index of dates and times, with varying opening and closing times for minutely ohlc data in R

I would like to be able to get an index of dates and times that represent the opening and closing times of a financial stock-market index for each day.
However the opening and closing times vary due to changes in the rules from an exchange or due to daylight savings, therefore I would to be able to use this index to accurately get Open to Close returns.
I am currently look at the Hang Seng futures index which also has a lunch-break in the middle so I would like this to noted as well in the index. I.E. I would have two opening to closing returns per day due to this lunch-break gap in the data. The time that the lunch break is not always consistent so using the xts function of xts["THH:MM/THH:MM"], would not work. In subsetting the timeseries to be able to get Open to Close data for a specific day
For example the lunch-break times changed in 2011 in March, so when comparing the 14th Feb 2011 lunch break vs the 14th March 2011 lunch break you have the following data...
> HI.raw.sing['20110214']["T12:25/T14:35"]
HI.Open HI.High HI.Low HI.Close HI.Volume
2011-02-14 12:25:00 23020 23028 23018 23018 180
2011-02-14 12:26:00 23018 23023 23014 23019 108
2011-02-14 12:27:00 23020 23033 23016 23033 142
2011-02-14 12:28:00 23031 23038 23025 23026 173
2011-02-14 12:29:00 23026 23046 23026 23042 264
2011-02-14 12:30:00 23044 23059 23041 23042 314
2011-02-14 14:30:00 23044 23044 23044 23044 311
2011-02-14 14:31:00 23118 23129 23099 23117 781
2011-02-14 14:32:00 23117 23143 23113 23143 554
2011-02-14 14:33:00 23143 23156 23139 23139 762
2011-02-14 14:34:00 23139 23161 23138 23138 644
2011-02-14 14:35:00 23139 23149 23137 23144 326
Warning message:
timezone of object (Asia/Singapore) is different than current timezone ().
> HI.raw.sing['20110314']["T11:55/T13:35"]
HI.Open HI.High HI.Low HI.Close HI.Volume
2011-03-14 11:55:00 23060 23075 23059 23071 195
2011-03-14 11:56:00 23071 23071 23059 23064 187
2011-03-14 11:57:00 23064 23074 23063 23068 96
2011-03-14 11:58:00 23069 23075 23068 23075 116
2011-03-14 11:59:00 23075 23078 23069 23073 120
2011-03-14 12:00:00 23073 23098 23073 23089 231
2011-03-14 13:30:00 23090 23090 23090 23090 103
2011-03-14 13:31:00 23082 23112 23074 23108 326
2011-03-14 13:32:00 23108 23124 23100 23123 179
2011-03-14 13:33:00 23124 23133 23111 23111 326
2011-03-14 13:34:00 23110 23119 23103 23115 148
2011-03-14 13:35:00 23115 23139 23114 23129 284
Warning message:
timezone of object (Asia/Singapore) is different than current timezone ().
Notice how the lunch break started at 12:30 on the 14th Feb 2011 but started at 12:00 on the 14th March.
Basically what I am looking for is an ability to detect these breaks in the timestamps. However using missing consecutive timestamp does not always work as there are sometimes missing minutes where nothing traded during the middle of the trading day, and so it is missed when the data was recorded. What I'm looking for is, gaps in the timeseries xts data greater than 5 minutes, output as a list which can be manipulated or be used as an index, which could help me subset the data easily.
You can use diff(index(x)) to identify holes exceeding 5 minutes.
# Sample data
k <- 100
library(xts)
x <- xts( rnorm(100), sort(Sys.time() + runif(100, 0, 5*3600)) )
# Start of the breaks exceeding 5 minutes
i <- diff(index(x)) > 300
close <- x[c(which(i),length(x))]
open <- x[c(1,which(i)+1)]
break_start <- index(close)
break_end <- index(open)

Resources