Average xts object with missing values to hourly endpoints - r
I am using xts to convert to hourly average data. I am starting with a year's worth of 10-minute data. Some hours have one 10-minute period (such as 'UTSP' in row 229) that is NA (missing).
For such hours, I would still like the average of the data that are available, however in the output I get NA for that variable for the hour.
Other hours may have no data (all data are missing). I want these completely missing hours to return NA, but where some data exist for an hour, I want that data to be used.
Here is a reproducible example of what I've been trying:
Lines <- "date,time,UTSP,UPM10,UPM25,UPM1,UWS,UWDT,PTSP,PPM10,PPM25,PPM1,PWS,PWDT
218,2014/10/15,22:00,9.7,4.9,4.66,1.54,6,152.56,102,53.6,33.71,10.34,NA,NA
219,2014/10/15,22:10,9.3,5.1,4.57,1.61,6.4,147.56,106.4,55.1,33.92,10.47,NA,NA
220,2014/10/15,22:20,8.9,5,4.7,1.55,6.4,147.56,108.3,54.8,33.19,10.53,NA,NA
221,2014/10/15,22:30,9.7,5.3,4.93,1.62,6.8,152.56,110.3,57.4,34.97,11.14,NA,NA
222,2014/10/15,22:40,9.1,5.2,4.76,1.54,6.8,152.56,118.9,62.3,37.58,11.63,NA,NA
223,2014/10/15,22:50,9.8,5.5,5.07,1.62,6.7,152.56,120.5,61.8,36.24,11.9,NA,NA
224,2014/10/15,23:00,11.1,5.6,5.2,1.59,6.4,152.56,108.6,57.1,34.93,11.66,NA,NA
225,2014/10/15,23:10,9.8,5.4,4.89,1.63,7.3,152.56,116,59.6,35.08,11.14,NA,NA
226,2014/10/15,23:20,9.1,5,4.95,1.63,7.1,152.56,122.6,63.8,38.28,12.17,NA,NA
227,2014/10/15,23:30,9.7,5.2,4.88,1.58,7.3,147.56,88.1,46.7,29.59,9.78,NA,NA
228,2014/10/15,23:40,9.2,5.2,4.79,1.66,7.1,152.56,92.4,48.8,30.11,9.69,NA,NA
229,2014/10/15,23:50,NA,NA,NA,NA,NA,NA,89.7,48.1,30.53,9.89,NA,NA
230,2014/10/16,00:00,9.8,5.5,5.03,1.6,7,147.56,91.2,47.5,30.09,9.38,NA,NA
231,2014/10/16,00:10,9.7,5.1,4.81,1.57,7.1,152.56,91.2,47.6,29.44,9.4,NA,NA
232,2014/10/16,00:20,9.9,5.4,5.09,1.61,7.4,147.56,91.1,48.3,29.78,9.23,NA,NA
233,2014/10/16,00:30,9.8,5.4,4.82,1.62,6.9,152.56,95.7,48.6,29.47,9.8,NA,NA
234,2014/10/16,00:40,10.6,5.7,4.99,1.58,6.8,147.56,91.3,47.9,29.57,9.94,NA,NA
235,2014/10/16,00:50,10.1,5.4,4.93,1.65,7,147.56,86.3,44.9,27.9,8.93,NA,NA"
conn <- textConnection(Lines)
dframe <- read.csv(conn)
close(conn)
library(xts)
USP_TSP.xts <- xts(dframe$UTSP,
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
I have also tried several variations of na.contiguous, na.omit, na.action.
My resultant output always seems to be the same (excerpt):
[,1]
2014-10-15 22:50:00 9.4166667
2014-10-15 23:50:00 NA
2014-10-16 00:50:00 9.9833333
... with the value for 2014-10-15 hr 23 being NA, even though there were 5 out of 6 non-missing values
Also, I am calculating all the columns separately, then combining them later. Is there an easier way - like calculating all the columns at once?
Calling na.exclude doesn't change the USP_TSP.xts object. You would need to assign the output of na.exclude to USP_TSP.xts to achieve that.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
But if you want to process all the columns in the object at once, using na.exclude is going to remove all rows that have at least one column with a missing value.
xData <- xts(dframe[,-(1:2)],
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(xData)
# UTSP UPM10 UPM25 UPM1 UWS UWDT PTSP PPM10 PPM25 PPM1 PWS PWDT
str(na.exclude(xData))
# An 'xts' object of zero-width
Instead, you should supply na.rm=TRUE to the call to mean inside the period.apply call. If you want to process all columns at the same time, you can use colMeans.
xDataMeans <- period.apply(xData, endpoints(xData, "hours"), colMeans, na.rm=TRUE)
xDataMeans
# UTSP UPM10 UPM25 UPM1 UWS UWDT
# 2014-10-15 22:50:00 9.416667 5.166667 4.781667 1.580 6.516667 150.8933
# 2014-10-15 23:50:00 9.780000 5.280000 4.942000 1.618 7.040000 151.5600
# 2014-10-16 00:50:00 9.983333 5.416667 4.945000 1.605 7.033333 149.2267
# PTSP PPM10 PPM25 PPM1 PWS PWDT
# 2014-10-15 22:50:00 111.06667 57.50000 34.93500 11.001667 NaN NaN
# 2014-10-15 23:50:00 102.90000 54.01667 33.08667 10.721667 NaN NaN
# 2014-10-16 00:50:00 91.13333 47.46667 29.37500 9.446667 NaN NaN
Your code works fine. You just need to assign USP_TSP.xts <- na.exclude(USP_TSP.xts). If you merely call na.exclude(USP_TSP.xts), then the output without NAs is printed, but it is not stored in any variable.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
# [,1]
#2014-10-15 22:50:00 9.416667
#2014-10-15 23:40:00 9.780000
#2014-10-16 00:50:00 9.983333
Alternatively you can use period.apply(USP_TSP.xts,ep,mean, na.rm=T) if you don't want to modify the original xts object.
Related
apply.yearly() works with subset but not on full time series dataset in R
When I run the following code on my dataset, I get an output (partial one shown) like this: all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))] Senegal Muslims Serbia Muslims Seychelles Muslims 1970-01-01 3693807 200000 170 2000-01-01 8936283 529322 730 2010-01-01 11713126 527598 821 2015-01-01 13621382 471414 844 However, when I try to use the function apply.yearly on it to sum across the years, I just get an NA result: apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum) 1970-01-01 NA 2000-01-01 NA 2010-01-01 NA 2015-01-01 NA The funny thing is that it works with some inputs but not others. For example, if I use input "Agnostics" instead of "Muslims", I get a good result. There isn't an error, so I can't seem to figure out what exactly is happening here. all_countries_ts is stored as a xts object. One thing to note is that apply.yearly() always work on a subset of this dataset. I have written a function and you can see it below: sum_by_category <- function(religious_group, dataset) { apply.yearly(dataset[,grepl(paste(religious_group), colnames(dataset))], FUN = sum) } country_search <- function(country_name, z){ z <- foreach(i = 1:length(country_name), .combine = merge.xts) %do%{ all_countries_ts[,grepl(country_name[i], colnames(all_countries_ts))] } return(z)} When I type in the following, it works perfectly: sum_by_category("Muslims", country_search("Senegal")) Senegal Muslims 1970-01-01 3693807 2000-01-01 8936283 2010-01-01 11713126 2015-01-01 13621382 I really can't figure out what's going on since it works with some inputs and not others. Thanks in advance for any help / insights!
The xts::apply.yearly expects x argument coercible to xts object. Perhaps your data.frame is not a xts compatible data frame. The help for apply.yearly explains: Arguments x an time-series object coercible to xts FUN an R function I have created a sample data based on data shared by OP and converted it to xts class. apply.yearly works correctly on the same. library(xts) # Convert data.frame to xts class all_countries_ts <- xts(df[,-1], order.by = df$Date) #Now one can use `apply.yearly` apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum) # [,1] # 1970-01-01 3893977 # 2000-01-01 9466335 # 2010-01-01 12241545 # 2015-01-01 14093640 Edited: Review of the OP's data suggest that it contains NA for many column which is causing total sum to be shown as NA. The fix is simple. OP needs to use as: apply.yearly(all_countries_ts[,grepl("Muslims",colnames(all_countries_ts))], FUN = sum, na.rm = TRUE) # [,1] # 1970-01-01 570772699 # 2000-01-01 1292170756 # 2010-01-01 1571250533 # 2015-01-01 1734531709 Data: df <- read.table(text = " Date 'Senegal Muslims' 'Serbia Muslims' 'Seychelles Muslims' Others 1970-01-01 3693807 200000 170 200 2000-01-01 8936283 529322 730 100 2010-01-01 11713126 527598 821 300 2015-01-01 13621382 471414 844 500", header = TRUE, stringsAsFactors = FALSE) #convert Date column to Date format df$Date <- as.Date(df$Date)
R: data.frame to xts object: inadvertent conversion of numerical data to strings
I am trying different ways to convert a data.frame to an xts time series. My current attempt gives me a result that I don't quite understand; I'd like to know why. RX1 <- read.csv("RX1.csv") RX1=setNames(RX1, c("Date", "PX_LAST")) RX1 <- RX1[-1,] #just getting rid of first row as data not valid RX1$Date <- as.Date(as.character(RX1$Date), format="%m/%d/%Y") Now we should be good to go as we have the following data.frame Date PX_LAST 2 2006-07-26 89.57 3 2006-07-27 89.86 4 2006-07-28 90.15 5 2006-07-31 90.17 6 2006-08-01 90.06 7 2006-08-02 90.04 RX1.ts <- xts(RX1$PX_LAST, order.by = RX1$Date) The result is 2006-07-26 "89.57" 2006-07-27 "89.86" 2006-07-28 "90.15" 2006-07-31 "90.17" 2006-08-01 "90.06" 2006-08-02 "90.04" Can somebody help me understand what I did that caused the conversion of the price to characters ?
Map a list of events (instants) to a list of periods (intervals) in R (with or without lubridate)
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them Period DF (code): periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03") periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00", "2016/02/12 19:00", "2016/02/13 19:00")) periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00", "2016/02/12 21:00", "2016/02/13 21:00")) periodDF <- data.frame(periodID, periodStart, periodEnd) Period DF: periodID periodStart periodEnd 1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00 2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00 3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00 4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00 Event DF (code): eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03") eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01", "2016/02/11 20:22:01", "2016/02/15 19:00:01")) eventDF <- data.frame(eventID, eventTime) Event DF: eventID eventTime 1 E_UID_00 2016-02-09 19:55:01 2 E_UID_01 2016-02-11 19:12:01 3 E_UDI_02 2016-02-11 20:22:01 4 E_UID_03 2016-02-15 19:00:01 I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like: eventID periodID 1 E_UID_00 NA 2 NA P_UID_00 3 E_UID_01 P_UID_01 4 E_UDI_02 P_UID_01 5 NA P_UID_02 6 NA P_UID_03 7 E_UID_03 NA I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code Additional considerations: - periods are completely arbitrary and can last from seconds to years - periods never overlap, so this is not an issue - more than one event could be associated with a time period - it is possible for DFs to contain unassociatable events and time periods - the solution must not include loops - does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated! First off, the start and end times in the period DF should be converted to lubridate intervals: intervalsP <- as.interval(periodStart, periodEnd) Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply: PeriodAssign <- function(x, y){ # x - instants # y - intervals variable1 <- mapply(`%within%`, x, y) if (length(y[variable1]) != 0) { as.character(y[variable1]) } else { NA } } NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for: eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP) periodDF$intervals <- as.character(intervalsP) mergedDF <- merge(periodDF, eventDF, by = "intervals") presentableDF <- mergedDF[, c(2, 5)] # adding in the unmatched Periods and Evenets tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA) colnames(tDF1) <- c("periodID", "eventID") presentableDF <- rbind(presentableDF, tDF1) tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1]) colnames(tDF2) <- c("periodID", "eventID") presentableDF <- rbind(presentableDF, tDF2) presentableDF <- presentableDF[order(presentableDF[,1]),] The eventual DF looks like: > presentableDF periodID eventID 3 P_UID_00 <NA> 1 P_UID_01 E_UID_01 2 P_UID_01 E_UDI_02 4 P_UID_02 <NA> 5 P_UID_03 <NA> 6 <NA> E_UID_00 7 <NA> E_UID_03
Issues with Indexing and Merging XTS Objects in R
Apologies in advance if this is answered elsewhere. I have searched for roughly 24 hrs and have come up empty at every turn. This is the data set I am working with Sys.setenv(TZ='GMT') dat = read.csv("SPY_MINUTE_TRADE.csv", header = TRUE) #QuantQuote sample minute data dat[,2] <- sprintf('%04d', dat[,2]) #add a zero to front of time IE 400 becomes 0400 aka 4AM #Create a zoo object ordered by day and time from the dat dataframe datzoo <- read.zoo(file=dat, sep=",", header=TRUE, index.column=1:2, format="%Y%m%d %H%M", tz="", colClasses = rep(c("character", "numeric"), c(2, 8))) Spy <- as.xts(datzoo) # Create regular series from 00:00 to 23:59 of 1 minute prints y <- xts(seq(from = 1, to = 60*24, by = 1), as.POSIXlt((0), origin="2013-03-30 00:00", tz='GMT')+seq(from = 0, to = 60*60*24-1, by = 60)) colnames(y) <- "TempIndex" #Merge the regular ts (y) with Spy and remove the original Spy column SpyReg <- merge(y,Spy, join='left') SpyReg$TempIndex <- NULL #Capture the index of Spy ISpy <- index(Spy) I have a few questions about the above code... 1) SpyReg["2012-03-30 04:00:00 GMT"] returns OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS Spy["2012-03-30 04:00:00 GMT"] returns the correct values of Spy for the given index OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS 2012-03-30 04:00:00 140.66 140.66 140.66 140.66 2160 1 0 0 However, SpyReg["T04:00:00/T04:01:00"] OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS 2013-03-30 04:00:00 NA NA NA NA NA NA NA NA 2013-03-30 04:01:00 NA NA NA NA NA NA NA NA why is this, when both are xts objects of the same index type, month, and time? Shouldn't SpyReg[""2012-03-30 04:00:00 GMT"] return: OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS 2013-03-30 04:00:00 NA NA NA NA NA NA NA NA 2) Why did the merge not give SpyReg the Spy value for the same index (such as the 4AM print?) I tried all 4 "join" options, but none worked... 3) I assume there is a MUCH more elegant way to solve this problem than what I am trying to do. After creating Spy, it was not regular, minute by minute. I wanted to create a regular xts object that had no gaps and flowed continuously minute by minute from midnight to 23:59, add the entries from Spy into it, then do a na.locf to replace the rest of the NAs with the original data.
Setting the index of an xts object to POSIXlt can cause some strange behaviors. I'd simply recommend you use POSIXct instead. URL <- "http://quantquote.com/sample/SPY_MINUTE_TRADE.csv" Spy <- read.zoo(URL, sep=",", header=TRUE, index.column=1:2, FUN=function(x) as.POSIXct(sprintf("%8d %04d",x[,1],x[,2]), format="%Y%m%d %H%M", tz="")) Spy <- as.xts(Spy) Now you can merge Spy with an 'empty' xts object that has the regular index values you want. SpyReg <- merge(Spy, xts(, seq(start(Spy),end(Spy),by="1 min")), fill=na.locf)
How to merge couples of Dates and values contained in a unique csv
We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows: Date,Manager A,Date,Manager B 41346.6666666667,100,40932.6666666667,100 41347.6666666667,100,40942.6666666667,99.9999936329992 41348.6666666667,100,40945.6666666667,99.9999936397787 41351.6666666667,100,40946.6666666667,99.9999936714362 41352.6666666667,100,40947.6666666667,100.051441180137 41353.6666666667,100,40948.6666666667,100.04877283951 41354.6666666667,100.000077579585,40949.6666666667,100.068400298752 41355.6666666667,100.00007861475,40952.6666666667,100.070263374822 41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006 41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173 41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116 41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978 41366.6666666667,100,40960.6666666667,100.14859511024 41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769 41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123 41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299 41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018 41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685 ,,40968.6666666667,99.4543885041996 ,,40969.6666666667,99.3424528379521 We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav]. After reading the csv file with: data = read.csv("...", header=TRUE, sep=",") we set an index for splitting the object and use lapply to split INDEX <- seq(1, by = 2, length = ncol(data) / 2) data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data)) I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this... # Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so... data$Date <- as.Date( data$Date , origin = "1899/12/30") data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30") # For more info see ?as.Date If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default); data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" ) head(data) # Date Manager.A Date.1 Manager.B # 1 2013-03-13 16:00:00 100 2012-01-24 100.00000 # 2 2013-03-14 16:00:00 100 2012-02-03 99.99999 # 3 2013-03-15 16:00:00 100 2012-02-06 99.99999 # 4 2013-03-18 16:00:00 100 2012-02-07 99.99999 # 5 2013-03-19 16:00:00 100 2012-02-08 100.05144 # 6 2013-03-20 16:00:00 100 2012-02-09 100.04877