Issues with Indexing and Merging XTS Objects in R - r

Apologies in advance if this is answered elsewhere. I have searched for roughly 24 hrs and have come up empty at every turn.
This is the data set I am working with
Sys.setenv(TZ='GMT')
dat = read.csv("SPY_MINUTE_TRADE.csv", header = TRUE) #QuantQuote sample minute data
dat[,2] <- sprintf('%04d', dat[,2]) #add a zero to front of time IE 400 becomes 0400 aka 4AM
#Create a zoo object ordered by day and time from the dat dataframe
datzoo <- read.zoo(file=dat, sep=",", header=TRUE,
index.column=1:2, format="%Y%m%d %H%M", tz="",
colClasses = rep(c("character", "numeric"), c(2, 8)))
Spy <- as.xts(datzoo)
# Create regular series from 00:00 to 23:59 of 1 minute prints
y <- xts(seq(from = 1, to = 60*24, by = 1), as.POSIXlt((0),
origin="2013-03-30 00:00", tz='GMT')+seq(from = 0, to = 60*60*24-1, by = 60))
colnames(y) <- "TempIndex"
#Merge the regular ts (y) with Spy and remove the original Spy column
SpyReg <- merge(y,Spy, join='left')
SpyReg$TempIndex <- NULL
#Capture the index of Spy
ISpy <- index(Spy)
I have a few questions about the above code...
1) SpyReg["2012-03-30 04:00:00 GMT"] returns
OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS
Spy["2012-03-30 04:00:00 GMT"] returns the correct values of Spy for the given index
OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS
2012-03-30 04:00:00 140.66 140.66 140.66 140.66 2160 1 0 0
However,
SpyReg["T04:00:00/T04:01:00"]
OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS
2013-03-30 04:00:00 NA NA NA NA NA NA NA NA
2013-03-30 04:01:00 NA NA NA NA NA NA NA NA
why is this, when both are xts objects of the same index type, month, and time? Shouldn't SpyReg[""2012-03-30 04:00:00 GMT"] return:
OPEN HIGH LOW CLOSE VOLUME SPLITS EARNINGS DIVIDENDS
2013-03-30 04:00:00 NA NA NA NA NA NA NA NA
2) Why did the merge not give SpyReg the Spy value for the same index (such as the 4AM print?) I tried all 4 "join" options, but none worked...
3) I assume there is a MUCH more elegant way to solve this problem than what I am trying to do. After creating Spy, it was not regular, minute by minute. I wanted to create a regular xts object that had no gaps and flowed continuously minute by minute from midnight to 23:59, add the entries from Spy into it, then do a na.locf to replace the rest of the NAs with the original data.

Setting the index of an xts object to POSIXlt can cause some strange behaviors. I'd simply recommend you use POSIXct instead.
URL <- "http://quantquote.com/sample/SPY_MINUTE_TRADE.csv"
Spy <- read.zoo(URL, sep=",", header=TRUE, index.column=1:2, FUN=function(x)
as.POSIXct(sprintf("%8d %04d",x[,1],x[,2]), format="%Y%m%d %H%M", tz=""))
Spy <- as.xts(Spy)
Now you can merge Spy with an 'empty' xts object that has the regular index values you want.
SpyReg <- merge(Spy, xts(, seq(start(Spy),end(Spy),by="1 min")), fill=na.locf)

Related

to.period function keeps the 'missing values'

Very new to R, so I hope I don't frustrate anyone.
Putting together pieces from online searches and using quantmode and purr packages, I have the following code to create an xts data frame called stocks:
symbols <- c("RYCVX","AJA","IEMG")
start <- as.Date("2006-06-22")
end <- as.Date("2020-07-30")
# collect adjusted column of all symbols in one matrix
stocks <- getSymbols(symbols,src = "yahoo", from = start, to = end,
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<-`(symbols)
This is daily, but I want to have a monthly matrix, yet still keep the NA fields.
I tried this line of code:
mstocks <- to.monthly(stocks, indexAt = "last", OHLC = FALSE)
but my resulting data frame is shrunk down to the symbol with the least amount of data, since any row with any missing value is omitted, so I end losing data on the more historically rich symbol.
Is there a way I could keep the missing values and have monthly data that, like my daily data, has rows where one symbol is NA?
So here is what I get:
RYCX AJA IEMG
2018-12-30 29.3045 4.5523 33.2045 <- first date all symbols have data
...
2020-07-30 34.2344 5.6664 12.2234
What I get now with Walts's help:
V1
2006-06-30 NA
...
2020-07-29 52.66000
What I need:
RYCX AJA IEMG
2006-06-30 29.3045 NA NA
....
2020-07-30 34.2344 5.6664 12.2234
All prices are made up
I'll assume that by monthly data, you mean the adjusted-close price on the last trading day of the month. To simply the code a bit, I've used auto.assign = FALSE so that getSymbols returns an xts time-series object rather than placing it in the environment. I've also used the function setNames rather than colnames<-(symbols) which works but is somewhat opaque. To convert to monthly, use apply.monthly(last) which takes the last day of each month in the time series. Data for all months is returned including those with NA in some of the time series.
library(tidyverse)
library(quantmod)
symbols <- c("RYCVX","AJA","IEMG")
start <- as.Date("2006-06-22")
end <- as.Date("2020-07-30")
stocks <- symbols %>% map( ~Ad(getSymbols(.x, src = "yahoo", from = start, to = end,
auto.assign = FALSE,
warnings = FALSE))) %>%
reduce(merge) %>%
setNames(symbols) %>%
apply.monthly(last)
which gives:
> stocks
RYCVX AJA IEMG
2006-06-30 21.901295 5.810 NA
2006-07-31 21.892862 6.260 NA
2006-08-31 22.643713 18.400 NA
2006-09-29 23.732025 19.250 NA
2006-10-31 25.284351 6.160 NA
2006-11-30 25.908657 6.960 NA
2006-12-29 26.817636 20.900 NA

apply.yearly() works with subset but not on full time series dataset in R

When I run the following code on my dataset, I get an output (partial one shown) like this:
all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))]
Senegal Muslims Serbia Muslims Seychelles Muslims
1970-01-01 3693807 200000 170
2000-01-01 8936283 529322 730
2010-01-01 11713126 527598 821
2015-01-01 13621382 471414 844
However, when I try to use the function apply.yearly on it to sum across the years, I just get an NA result:
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
1970-01-01 NA
2000-01-01 NA
2010-01-01 NA
2015-01-01 NA
The funny thing is that it works with some inputs but not others. For example, if I use input "Agnostics" instead of "Muslims", I get a good result. There isn't an error, so I can't seem to figure out what exactly is happening here.
all_countries_ts is stored as a xts object. One thing to note is that apply.yearly() always work on a subset of this dataset. I have written a function and you can see it below:
sum_by_category <- function(religious_group, dataset) {
apply.yearly(dataset[,grepl(paste(religious_group), colnames(dataset))], FUN =
sum)
}
country_search <- function(country_name, z){
z <- foreach(i = 1:length(country_name), .combine = merge.xts) %do%{
all_countries_ts[,grepl(country_name[i], colnames(all_countries_ts))]
}
return(z)}
When I type in the following, it works perfectly:
sum_by_category("Muslims", country_search("Senegal"))
Senegal Muslims
1970-01-01 3693807
2000-01-01 8936283
2010-01-01 11713126
2015-01-01 13621382
I really can't figure out what's going on since it works with some inputs and not others. Thanks in advance for any help / insights!
The xts::apply.yearly expects x argument coercible to xts object. Perhaps your data.frame is not a xts compatible data frame.
The help for apply.yearly explains:
Arguments
x an time-series object coercible to xts
FUN an R function
I have created a sample data based on data shared by OP and converted it to xts class. apply.yearly works correctly on the same.
library(xts)
# Convert data.frame to xts class
all_countries_ts <- xts(df[,-1], order.by = df$Date)
#Now one can use `apply.yearly`
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
# [,1]
# 1970-01-01 3893977
# 2000-01-01 9466335
# 2010-01-01 12241545
# 2015-01-01 14093640
Edited: Review of the OP's data suggest that it contains NA for many column which is causing total sum to be shown as NA. The fix is simple. OP needs to use as:
apply.yearly(all_countries_ts[,grepl("Muslims",colnames(all_countries_ts))],
FUN = sum, na.rm = TRUE)
# [,1]
# 1970-01-01 570772699
# 2000-01-01 1292170756
# 2010-01-01 1571250533
# 2015-01-01 1734531709
Data:
df <- read.table(text =
" Date 'Senegal Muslims' 'Serbia Muslims' 'Seychelles Muslims' Others
1970-01-01 3693807 200000 170 200
2000-01-01 8936283 529322 730 100
2010-01-01 11713126 527598 821 300
2015-01-01 13621382 471414 844 500",
header = TRUE, stringsAsFactors = FALSE)
#convert Date column to Date format
df$Date <- as.Date(df$Date)

Map a list of events (instants) to a list of periods (intervals) in R (with or without lubridate)

I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03

Average xts object with missing values to hourly endpoints

I am using xts to convert to hourly average data. I am starting with a year's worth of 10-minute data. Some hours have one 10-minute period (such as 'UTSP' in row 229) that is NA (missing).
For such hours, I would still like the average of the data that are available, however in the output I get NA for that variable for the hour.
Other hours may have no data (all data are missing). I want these completely missing hours to return NA, but where some data exist for an hour, I want that data to be used.
Here is a reproducible example of what I've been trying:
Lines <- "date,time,UTSP,UPM10,UPM25,UPM1,UWS,UWDT,PTSP,PPM10,PPM25,PPM1,PWS,PWDT
218,2014/10/15,22:00,9.7,4.9,4.66,1.54,6,152.56,102,53.6,33.71,10.34,NA,NA
219,2014/10/15,22:10,9.3,5.1,4.57,1.61,6.4,147.56,106.4,55.1,33.92,10.47,NA,NA
220,2014/10/15,22:20,8.9,5,4.7,1.55,6.4,147.56,108.3,54.8,33.19,10.53,NA,NA
221,2014/10/15,22:30,9.7,5.3,4.93,1.62,6.8,152.56,110.3,57.4,34.97,11.14,NA,NA
222,2014/10/15,22:40,9.1,5.2,4.76,1.54,6.8,152.56,118.9,62.3,37.58,11.63,NA,NA
223,2014/10/15,22:50,9.8,5.5,5.07,1.62,6.7,152.56,120.5,61.8,36.24,11.9,NA,NA
224,2014/10/15,23:00,11.1,5.6,5.2,1.59,6.4,152.56,108.6,57.1,34.93,11.66,NA,NA
225,2014/10/15,23:10,9.8,5.4,4.89,1.63,7.3,152.56,116,59.6,35.08,11.14,NA,NA
226,2014/10/15,23:20,9.1,5,4.95,1.63,7.1,152.56,122.6,63.8,38.28,12.17,NA,NA
227,2014/10/15,23:30,9.7,5.2,4.88,1.58,7.3,147.56,88.1,46.7,29.59,9.78,NA,NA
228,2014/10/15,23:40,9.2,5.2,4.79,1.66,7.1,152.56,92.4,48.8,30.11,9.69,NA,NA
229,2014/10/15,23:50,NA,NA,NA,NA,NA,NA,89.7,48.1,30.53,9.89,NA,NA
230,2014/10/16,00:00,9.8,5.5,5.03,1.6,7,147.56,91.2,47.5,30.09,9.38,NA,NA
231,2014/10/16,00:10,9.7,5.1,4.81,1.57,7.1,152.56,91.2,47.6,29.44,9.4,NA,NA
232,2014/10/16,00:20,9.9,5.4,5.09,1.61,7.4,147.56,91.1,48.3,29.78,9.23,NA,NA
233,2014/10/16,00:30,9.8,5.4,4.82,1.62,6.9,152.56,95.7,48.6,29.47,9.8,NA,NA
234,2014/10/16,00:40,10.6,5.7,4.99,1.58,6.8,147.56,91.3,47.9,29.57,9.94,NA,NA
235,2014/10/16,00:50,10.1,5.4,4.93,1.65,7,147.56,86.3,44.9,27.9,8.93,NA,NA"
conn <- textConnection(Lines)
dframe <- read.csv(conn)
close(conn)
library(xts)
USP_TSP.xts <- xts(dframe$UTSP,
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
I have also tried several variations of na.contiguous, na.omit, na.action.
My resultant output always seems to be the same (excerpt):
[,1]
2014-10-15 22:50:00 9.4166667
2014-10-15 23:50:00 NA
2014-10-16 00:50:00 9.9833333
... with the value for 2014-10-15 hr 23 being NA, even though there were 5 out of 6 non-missing values
Also, I am calculating all the columns separately, then combining them later. Is there an easier way - like calculating all the columns at once?
Calling na.exclude doesn't change the USP_TSP.xts object. You would need to assign the output of na.exclude to USP_TSP.xts to achieve that.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
But if you want to process all the columns in the object at once, using na.exclude is going to remove all rows that have at least one column with a missing value.
xData <- xts(dframe[,-(1:2)],
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(xData)
# UTSP UPM10 UPM25 UPM1 UWS UWDT PTSP PPM10 PPM25 PPM1 PWS PWDT
str(na.exclude(xData))
# An 'xts' object of zero-width
Instead, you should supply na.rm=TRUE to the call to mean inside the period.apply call. If you want to process all columns at the same time, you can use colMeans.
xDataMeans <- period.apply(xData, endpoints(xData, "hours"), colMeans, na.rm=TRUE)
xDataMeans
# UTSP UPM10 UPM25 UPM1 UWS UWDT
# 2014-10-15 22:50:00 9.416667 5.166667 4.781667 1.580 6.516667 150.8933
# 2014-10-15 23:50:00 9.780000 5.280000 4.942000 1.618 7.040000 151.5600
# 2014-10-16 00:50:00 9.983333 5.416667 4.945000 1.605 7.033333 149.2267
# PTSP PPM10 PPM25 PPM1 PWS PWDT
# 2014-10-15 22:50:00 111.06667 57.50000 34.93500 11.001667 NaN NaN
# 2014-10-15 23:50:00 102.90000 54.01667 33.08667 10.721667 NaN NaN
# 2014-10-16 00:50:00 91.13333 47.46667 29.37500 9.446667 NaN NaN
Your code works fine. You just need to assign USP_TSP.xts <- na.exclude(USP_TSP.xts). If you merely call na.exclude(USP_TSP.xts), then the output without NAs is printed, but it is not stored in any variable.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
# [,1]
#2014-10-15 22:50:00 9.416667
#2014-10-15 23:40:00 9.780000
#2014-10-16 00:50:00 9.983333
Alternatively you can use period.apply(USP_TSP.xts,ep,mean, na.rm=T) if you don't want to modify the original xts object.

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

Resources