I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.
I am using xts to convert to hourly average data. I am starting with a year's worth of 10-minute data. Some hours have one 10-minute period (such as 'UTSP' in row 229) that is NA (missing).
For such hours, I would still like the average of the data that are available, however in the output I get NA for that variable for the hour.
Other hours may have no data (all data are missing). I want these completely missing hours to return NA, but where some data exist for an hour, I want that data to be used.
Here is a reproducible example of what I've been trying:
Lines <- "date,time,UTSP,UPM10,UPM25,UPM1,UWS,UWDT,PTSP,PPM10,PPM25,PPM1,PWS,PWDT
218,2014/10/15,22:00,9.7,4.9,4.66,1.54,6,152.56,102,53.6,33.71,10.34,NA,NA
219,2014/10/15,22:10,9.3,5.1,4.57,1.61,6.4,147.56,106.4,55.1,33.92,10.47,NA,NA
220,2014/10/15,22:20,8.9,5,4.7,1.55,6.4,147.56,108.3,54.8,33.19,10.53,NA,NA
221,2014/10/15,22:30,9.7,5.3,4.93,1.62,6.8,152.56,110.3,57.4,34.97,11.14,NA,NA
222,2014/10/15,22:40,9.1,5.2,4.76,1.54,6.8,152.56,118.9,62.3,37.58,11.63,NA,NA
223,2014/10/15,22:50,9.8,5.5,5.07,1.62,6.7,152.56,120.5,61.8,36.24,11.9,NA,NA
224,2014/10/15,23:00,11.1,5.6,5.2,1.59,6.4,152.56,108.6,57.1,34.93,11.66,NA,NA
225,2014/10/15,23:10,9.8,5.4,4.89,1.63,7.3,152.56,116,59.6,35.08,11.14,NA,NA
226,2014/10/15,23:20,9.1,5,4.95,1.63,7.1,152.56,122.6,63.8,38.28,12.17,NA,NA
227,2014/10/15,23:30,9.7,5.2,4.88,1.58,7.3,147.56,88.1,46.7,29.59,9.78,NA,NA
228,2014/10/15,23:40,9.2,5.2,4.79,1.66,7.1,152.56,92.4,48.8,30.11,9.69,NA,NA
229,2014/10/15,23:50,NA,NA,NA,NA,NA,NA,89.7,48.1,30.53,9.89,NA,NA
230,2014/10/16,00:00,9.8,5.5,5.03,1.6,7,147.56,91.2,47.5,30.09,9.38,NA,NA
231,2014/10/16,00:10,9.7,5.1,4.81,1.57,7.1,152.56,91.2,47.6,29.44,9.4,NA,NA
232,2014/10/16,00:20,9.9,5.4,5.09,1.61,7.4,147.56,91.1,48.3,29.78,9.23,NA,NA
233,2014/10/16,00:30,9.8,5.4,4.82,1.62,6.9,152.56,95.7,48.6,29.47,9.8,NA,NA
234,2014/10/16,00:40,10.6,5.7,4.99,1.58,6.8,147.56,91.3,47.9,29.57,9.94,NA,NA
235,2014/10/16,00:50,10.1,5.4,4.93,1.65,7,147.56,86.3,44.9,27.9,8.93,NA,NA"
conn <- textConnection(Lines)
dframe <- read.csv(conn)
close(conn)
library(xts)
USP_TSP.xts <- xts(dframe$UTSP,
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
I have also tried several variations of na.contiguous, na.omit, na.action.
My resultant output always seems to be the same (excerpt):
[,1]
2014-10-15 22:50:00 9.4166667
2014-10-15 23:50:00 NA
2014-10-16 00:50:00 9.9833333
... with the value for 2014-10-15 hr 23 being NA, even though there were 5 out of 6 non-missing values
Also, I am calculating all the columns separately, then combining them later. Is there an easier way - like calculating all the columns at once?
Calling na.exclude doesn't change the USP_TSP.xts object. You would need to assign the output of na.exclude to USP_TSP.xts to achieve that.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
But if you want to process all the columns in the object at once, using na.exclude is going to remove all rows that have at least one column with a missing value.
xData <- xts(dframe[,-(1:2)],
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(xData)
# UTSP UPM10 UPM25 UPM1 UWS UWDT PTSP PPM10 PPM25 PPM1 PWS PWDT
str(na.exclude(xData))
# An 'xts' object of zero-width
Instead, you should supply na.rm=TRUE to the call to mean inside the period.apply call. If you want to process all columns at the same time, you can use colMeans.
xDataMeans <- period.apply(xData, endpoints(xData, "hours"), colMeans, na.rm=TRUE)
xDataMeans
# UTSP UPM10 UPM25 UPM1 UWS UWDT
# 2014-10-15 22:50:00 9.416667 5.166667 4.781667 1.580 6.516667 150.8933
# 2014-10-15 23:50:00 9.780000 5.280000 4.942000 1.618 7.040000 151.5600
# 2014-10-16 00:50:00 9.983333 5.416667 4.945000 1.605 7.033333 149.2267
# PTSP PPM10 PPM25 PPM1 PWS PWDT
# 2014-10-15 22:50:00 111.06667 57.50000 34.93500 11.001667 NaN NaN
# 2014-10-15 23:50:00 102.90000 54.01667 33.08667 10.721667 NaN NaN
# 2014-10-16 00:50:00 91.13333 47.46667 29.37500 9.446667 NaN NaN
Your code works fine. You just need to assign USP_TSP.xts <- na.exclude(USP_TSP.xts). If you merely call na.exclude(USP_TSP.xts), then the output without NAs is printed, but it is not stored in any variable.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
# [,1]
#2014-10-15 22:50:00 9.416667
#2014-10-15 23:40:00 9.780000
#2014-10-16 00:50:00 9.983333
Alternatively you can use period.apply(USP_TSP.xts,ep,mean, na.rm=T) if you don't want to modify the original xts object.
I have some data, where I aggregate the information on a unique minute basis with the below code based on a dataset for 1 day.
I would however like to be able to run this code with a datafile that is combined of multiple days. I have a date column in the dataset, so I can use that as a unique identifier for each day. Is there a way to aggregate the data on a 1 minute basis, given that the dates aren't the same?
The problem is that the unique function extracts the unique events that occur the first day, and then adds all the same events that happen in that minute afterwards. If i base it on the date too, I believe I can create unique 1-minute entries for each day in one long dataset.
Below is the code that works for a single days data.
novo <- read.csv("C:/Users/Morten/Desktop/data.csv", header = TRUE, stringsAsFactors=FALSE )
TimeStamp <- novo[,1]
price <- novo[, 2]
volume <- novo[,3]
nV <- sum(volume)
MinutesFloor <- unique(floor(TimeStamp))
nTradingMinutes <- length(MinutesFloor)
PriceMin <- rep(0, nTradingMinutes)
VolumeMin <- rep(0, nTradingMinutes)
for( j in 1:nTradingMinutes){
ThisMinutes <- (floor(TimeStamp) == MinutesFloor[j])
PriceMin[j] <- mean(price[ThisMinutes])
VolumeMin[j] <- sum(volume[ThisMinutes])
}
Thanks in advance
data format:
date,"ord","shares","finalprice","time","stock"
20100301,C,80,389,540.004,1158
20100301,C,77,389,540.004,1158
20100301,C,60,389,540.004,1158
20100301,C,28,389,540.004,1158
20100301,C,7,389,540.004,1158
20100302,C,25,394.7,540.00293333,1158
20100302,C,170,394.7,540.00293333,1158
20100302,C,40,394.7,540.00293333,1158
20100302,C,75,394.7,540.00293333,1158
20100302,C,100,394.7,540.00293333,1158
20100302,C,1,394.7,540.00293333,1158
I would like to suggest a radically simplified version of your code.
You are doing quite a few things rather inefficient. R is made to compute summary statistics clustered by different data values.
We will use this methods heavily.
I assume your data to be of the form you provided. At my system, this looks like
novo <- read.csv("test.csv", header = TRUE, stringsAsFactors=FALSE )
This gives us:
> str(novo)
'data.frame': 11 obs. of 6 variables:
$ date : int 20100301 20100301 20100301 20100301 20100301 20100302 20100302 20100302 20100302 20100302 ...
$ ord : chr "C" "C" "C" "C" ...
$ shares : int 80 77 60 28 7 25 170 40 75 100 ...
$ finalprice: num 389 389 389 389 389 ...
$ time : num 540 540 540 540 540 ...
$ stock : int 1158 1158 1158 1158 1158 1158 1158 1158 1158 1158 ...
Now, I assume that your date is ordered YearMonthDate. If you have a different ordering, you would have to alter the format command below. Furthermore, your time probably is in minutes.
Then we can create timestamps containing both the date and the time using the POSIXct datatype:
timestamps <- as.POSIXct(as.character(novo$date), format='%Y%m%d') + novo$time*60
Now, we do the rounding up minutes by creating a factor variable and using the cut function:
timestampsByMinute <- droplevels(cut(timestamps, 'min'))
Note that the additional droplevels function just removes the minutes that have no data item s available.
Finally, we may compute the summary statistics you did in the for-loop:
tapply is a function taking it's first argument, dividing it into groups defined by the second argument and applying the function given as third argument to that data. Thus we may just throw the tapply function on your data. (I have the feeling that the column numbers you used in your code do not match the column names in your example data - feel free to adapt to different columns if I interpreted your meaning the wrong way)
PriceMin <- tapply(novo$finalprice, timestampsByMinute, mean)
VolumeMin <- tapply(novo$shares, timestampsByMinute, sum)
This gives us
> PriceMin
2010-03-01 09:00:00 2010-03-02 09:00:00
389.0 394.7
> VolumeMin
2010-03-01 09:00:00 2010-03-02 09:00:00
252 411
which is probably what you want.
Note that tapply is much faster that the loop you used. If you have huge datafiles, this may be important.
I hope there are no errors left in my code - testing was not easy given the fact that you provided only data for one minute per day.
Edit:
As per request, here a small modification that removes the time information from the data:
> unname(VolumeMin)
[1] 252 411
> unname(PriceMin)
[1] 389.0 394.7
I am familiar with the zoo function rollapply which allows you to do rolling computations on zoo or xts objects and you can specify the rolling increment via the by parameter. I am specifically interested in applying a function every month but using all of the past daily data in the computation. For example say my data set looks like this:
dte, val
1/01/2001, 10
1/02/2001, 11
...
1/31/2001, 2
2/01/2001, 54
2/02/2001, 34
...
2/30/2001, 29
I would like to select the end of each month and apply a function that uses all the daily data. This doesn't seem like it would work with rollapply since the by argument would be 30 sometimes, 29 other months, etc. My current idea is:
f <- function(xts_obj) { coef(lm(a ~ b, data=as.data.frame(xts_obj)))[1] }
month_end <- endpoints(my_xts, on="months", k=1)
rslt <- apply(month_end, 1, function(idx) { my_xts[paste0("/",idx)] })
Surely there is a better way to do this that would be quicker no?
To clarify: I would like to use overlapping periods just the rolling should be done monthly.
If I understand correctly, you can get the dates of your endpoints, then for each endpoint (i.e. using lapply or for), call rollapply using data up to that point.
getSymbols("SPY", src='yahoo', from='2012-01-01', to='2012-08-01')
idx <- index(SPY)[endpoints(SPY, 'months')]
out <- lapply(idx, function(i) {
as.xts(rollapplyr(as.zoo(SPY[paste0("/", i)]), 5,
function(x) coef(lm(x[, 4] ~ x[, 1]))[2], by.column=FALSE))
})
sapply(out, NROW)
#[1] 16 36 58 78 100 121 142 143
I temporarily coerce to zoo for the rollapplyr to make sure the rollapply.zoo method is being used (as opposed to the unexported rollapply.xts method), then coerce back to xts
As an answer to "Is the zoo/xts conversion needed?":
It isn't needed in this case, but rollapply won't work if you send it a dataframe, as I recently discovered from this StackOverflow answer
You want period.apply(), or its convenience helper apply.monthly(), both in xts.
Example:
R> foo <- xts(1:100, order.by=Sys.Date()+0:99)
R> apply.monthly(foo, sum)
[,1]
2012-08-31 105
2012-09-30 885
2012-10-31 1860
2012-11-25 2200
R>
or equally
R> apply.monthly(foo, quantile)
0% 25% 50% 75% 100%
2012-08-31 1 4.25 7.5 10.75 14
2012-09-30 15 22.25 29.5 36.75 44
2012-10-31 45 52.50 60.0 67.50 75
2012-11-25 76 82.00 88.0 94.00 100
R>
just to prove that functions returning more than one value can be used too.