Merging xts in R - Converting Characters to NA - r

I have 3 xts objects
logged <- xts::xts(x = loggedInUsers$loggedInUsers, order.by = Sys.time())
loadValue <- xts::xts(x = loadAvg, order.by = Sys.time())
hostname <- xts::xts(x = loadHost, order.by = Sys.time())
dput(hostname)
dput(loadValue)
dput(logged)
dput gives the following result
structure("deliverforgoodportal", .Dim = c(1L, 1L), index = structure(1551088127.27724, tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "", tzone = "")
structure(0, .Dim = c(1L, 1L), .Dimnames = list(NULL, "load"), index = structure(1551088127.27676, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", class = c("xts", "zoo"))
structure(1, .Dim = c(1L, 1L), index = structure(1551088127.27637, tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "", tzone = "")
When I am merging this three and printing the hostname is converted to NA
tmp <- merge.xts(hostname, logged, loadValue, all = TRUE)
print(tmp)
Output is: (hostname is NA)
hostname logged load
2019-02-25 09:48:47 NA 1 NA
2019-02-25 09:48:47 NA NA 0
2019-02-25 09:48:47 NA NA NA
Why is this coming as NA?

You should realise that an xts object is a timeseries and a matrix. Now a matrix can only contain one type of values, either character or numeric. But not both. Your merge is trying to combine a character value matrix (hostname) with numeric values (logged and load). This results in the hostname values being coerced to NA.
If you want to join this data, you have to use a data.frame (or data.table). Also note that your time values are not equal, they are of by milliseconds. So if you want to join on minutes, first use floor_date from the lubridate package. See below two examples with and without lubridate. I use the package timetk to convert the xts objects to a tibble, but depending on your source data that might not be necessary.
with full_join, no lubridate
library(timetk)
library(dplyr)
hostname <- tk_tbl(hostname)
loadValue <- tk_tbl(loadValue)
logged <- tk_tbl(logged)
hostname %>%
full_join(loadValue) %>%
full_join(logged,
by = "index",
suffix = c("_hostname", "_logged"))
Joining, by = "index"
# A tibble: 3 x 4
index value_hostname load value_logged
<dttm> <chr> <dbl> <dbl>
1 2019-02-25 10:48:47 deliverforgoodportal NA NA
2 2019-02-25 10:48:47 NA 0 NA
3 2019-02-25 10:48:47 NA NA 1
with lubridate and left join:
hostname %>%
mutate(index = lubridate::floor_date(index, unit = "seconds")) %>%
left_join(loadValue %>% mutate(index = lubridate::floor_date(index, unit = "seconds"))) %>%
left_join(logged %>% mutate(index = lubridate::floor_date(index, unit = "seconds")),
by = "index",
suffix = c("_hostname", "_logged"))
Joining, by = "index"
# A tibble: 1 x 4
index value_hostname load value_logged
<dttm> <chr> <dbl> <dbl>
1 2019-02-25 10:48:47 deliverforgoodportal 0 1

Related

Merge multiple xts objects with matching or nearest dates

I have two xts files with daily data (The data is only one date for a month).
The first file is this: - The dates in this xts are typically end of the month trading dates in a given month.
structure(c(-0.0329199999999997, 0.0874901766141374, 0.0545883292605231,
0.0687945180777207, 0.0550784545301166, 0.074678777314922, -0.0866534235058661,
0.161206236457536, 0.0704023794825748, 0.074691325661258), class = c("xts",
"zoo"), ret_type = "discrete", coredata_content = "discreteReturn", index = structure(c(1114732800,
1117497600, 1120089600, 1122595200, 1125446400, 1128038400, 1130716800,
1133308800, 1135900800, 1138665600), tzone = "UTC", tclass = "Date"), dim = c(10L,
1L), dimnames = list(NULL, "xts_left"))
The second xts file is:
structure(c(0.0052512320343876, 0.00540733325225928, 0.00580017750416384,
0.005701283061746, 0.00556285472234541, 0.00561113650865441,
0.00580424365658105, 0.005816988308881, 0.00571552920344676,
0.00574088497469671, 0.00574737930337577, 0.00589584054618375,
0.00592325487612455), class = c("xts", "zoo"), .CLASS = "double", index = structure(c(1107216000,
1109635200, 1112313600, 1114905600, 1117584000, 1120176000, 1122854400,
1125532800, 1128124800, 1130803200, 1133395200, 1136073600, 1138752000
), tzone = "UTC", tclass = "Date"), dim = c(13L, 1L))
This is how I want the output of the merge: To pick the value from right xts that corresponds to the closest date value in the left xts. For example, the value on 29-04-2005 be matched with the nearest, i.e. 01-05-2005 (dd-mm-yyyy format).
I have seen a possible way to do this using data.table with rolling joins, but I wanted to know if there is a way to do this within the xts (or similar) framework.
Using x1 and x2 in the Note at the end define near which given a date, tt, finds the nearest date in x2 and returns the corresponding data value. Then apply that to each date in x1.
near <- function(tt) x2[which.min(abs(time(x2) - tt))]
x12 <- transform(x1, xts_right = sapply(time(x1), near)); x12
giving:
xts_left xts_right
2005-04-29 -0.03292000 0.005701283
2005-05-31 0.08749018 0.005562855
2005-06-30 0.05458833 0.005611137
2005-07-29 0.06879452 0.005804244
2005-08-31 0.05507845 0.005816988
2005-09-30 0.07467878 0.005715529
2005-10-31 -0.08665342 0.005740885
2005-11-30 0.16120624 0.005747379
2005-12-30 0.07040238 0.005895841
2006-01-31 0.07469133 0.005923255
In the example shown in the question the nearest x2 is always at a strictly later date than x1 and x2 starts before x1. If those are general features of the problem it could alternately be expressed as:
transform(x1, xts_right = coredata(x2)[findInterval(time(x1), time(x2)) + 1])
Note
x1 <-
structure(c(-0.0329199999999997, 0.0874901766141374, 0.0545883292605231,
0.0687945180777207, 0.0550784545301166, 0.074678777314922, -0.0866534235058661,
0.161206236457536, 0.0704023794825748, 0.074691325661258), .Dim = c(10L,
1L), class = c("xts", "zoo"), ret_type = "discrete",
coredata_content = "discreteReturn", index = structure(c(1114732800,
1117497600, 1120089600, 1122595200, 1125446400, 1128038400, 1130716800,
1133308800, 1135900800, 1138665600), tzone = "UTC", tclass = "Date"),
.Dimnames = list(NULL, "xts_left"))
x2 <-
structure(c(0.0052512320343876, 0.00540733325225928, 0.00580017750416384,
0.005701283061746, 0.00556285472234541, 0.00561113650865441,
0.00580424365658105, 0.005816988308881, 0.00571552920344676,
0.00574088497469671, 0.00574737930337577, 0.00589584054618375,
0.00592325487612455), .Dim = c(13L, 1L), class = c("xts", "zoo"
), .CLASS = "double", index = structure(c(1107216000, 1109635200,
1112313600, 1114905600, 1117584000, 1120176000, 1122854400, 1125532800,
1128124800, 1130803200, 1133395200, 1136073600, 1138752000),
tzone = "UTC", tclass = "Date"))

Is there a way to aggregate by time on xts regardless the date

I'm trying to split by hour/time without having the date affecting the results, I'm using an xts object indexed by date and time
When i'm using split by hour I get the results but within the date, and I wish to Ignore the date and only get it by time, tried to strip the date and get back to posixct, but none helped, I also tried using period.apply with endpoint but it's the same results.
lapply(split(temp[,"GROSS"] , f = "hour"), FUN = cumsum)
[[1]]
GROSS
2018-10-12 09:46:38 "11"
[[2]]
GROSS
2018-10-12 10:04:08 "-4"
2018-10-12 10:23:58 "5.2"
2018-10-12 10:24:08 "-1.1"
[[3]]
GROSS
2018-10-15 09:35:46 "20.7"
[[4]]
GROSS
2018-10-17 09:30:56 "-7.2"
[[5]]
GROSS
2018-10-17 10:44:48 "13.5"
I expect to get the results accumulated by hour without the date.
When dealing with date, time or datetime variables the lubridate package is really useful.
library(xts)
library(lubridate)
df_ts <- structure(
c(" 11.00", " -4.00", " 9.20", " -6.30", " 20.70", " -7.20"),
class = c("xts", "zoo"), .indexCLASS = c("POSIXlt", "POSIXt"),
tclass = c("POSIXlt", "POSIXt"),
.indexTZ = "UTC",
tzone = "UTC",
index = structure(c(1539337598, 1539338648, 1539339838, 1539339848, 1539596146, 1539768656),
tzone = "UTC",
tclass = c("POSIXlt", "POSIXt")),
.Dim = c(6L, 1L),
.Dimnames = list(NULL, "GROSS"))
lapply(split(df_ts,
f = hour(as_datetime(attr(df_ts, "index")))),
FUN = cumsum)

Change data to numeric type to determine which distribution fits better

I am trying to figure out which distribution fits best logarithmic stock returns. Here is my code:
library(TTR)
sign="^GSPC"
start=19900101
end=20160101
x <- getYahooData(sign, start = start, end = end, freq = "daily")
x$logret <- log(x$Close) - lag(log(x$Close))
x=x[,6]
I want to use the function descdist(x, discrete = FALSE) which I got from this amazing post https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best Nonetheless r gives me this error: Error in descdist(x, discrete = FALSE) : data must be a numeric vector How do I transform my data to numeric vector??
The output from dput(head(x)) is:
structure(c(NA, -0.00258888580664607, -0.00865029791190164, -0.00980414107803274,
0.00450431207515223, -0.011856706127011), class = c("xts", "zoo"
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", index = structure(c(631238400,
631324800, 631411200, 631497600, 631756800, 631843200), tzone = "UTC", tclass = "Date"), .Dim = c(6L,
1L), .Dimnames = list(NULL, "logret"))
Pre-process x using as.numeric(na.omit(x)), or simply run
descdist(as.numeric(na.omit(x)), discrete = FALSE)

How to remove a row from zoo/xts object, given a timestamp

I was happily running with this code:
z=lapply(filename_list, function(fname){
read.zoo(file=fname,header=TRUE,sep = ",",tz = "")
})
xts( do.call(rbind,z) )
until Dirty Data came along with this at the end of one file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
and this at the start of the next file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.3702 2230
So rbind.zoo complains about a duplicate.
I can't use something like:
y <- x[ ! duplicated( index(x) ), ]
as they are in different zoo objects, inside a list. And I cannot use aggregate, as suggested here because they are a list of zoo objects, not one big zoo object. And I can't get one big object 'cos of the duplicates. Catch-22.
So, when the going gets tough, the tough hack together some for loops (excuse the prints and a stop, as this isn't working code yet):
indexes <- do.call("c", unname(lapply(z, index)))
dups=duplicated(indexes)
if(any(dups)){
duplicate_timestamps=indexes[dups]
for(tix in 1:length(duplicate_timestamps)){
t=duplicate_timestamps[tix]
print("We have a duplicate:");print(t)
for(zix in 1:length(z)){
if(t %in% index(z[[zix]])){
print(z[[zix]][t])
if(z[[zix]][t]$Volume==1){
print("-->Deleting this one");
z[[zix]][t]=NULL #<-- PROBLEM
}
}
}
}
stop("There are duplicate bars!!")
}
The bit I've got stumped on is assigning NULL to a zoo row doesn't delete it (Error in NextMethod("[<-") : replacement has length zero). OK, so I do a filter-copy, without the offending item... but I'm tripping up on these:
> z[[zix]][!t,]
Error in Ops.POSIXt(t) : unary '!' not defined for "POSIXt" objects
> z[[zix]][-t,]
Error in `-.POSIXt`(t) : unary '-' is not defined for "POSIXt" objects
P.S. While high-level solutions to my real problem of "duplicates rows across a list of zoo objects" are very welcome, the question here is specifically about how to delete a row from a zoo object given a POSIXt index object.
A small bit of test data:
list(structure(c(1.36864, 1.367045, 1.370105, 1.36928, 1.37039,
1.370105, 1.36604, 1.36676, 1.370105, 1.367065, 1.37009, 1.370105,
5498, 3244, 1), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume")), index = structure(c(1316512800,
1316516400, 1316520000), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"),
structure(c(1.370105, 1.370115, 1.36913, 1.371045, 1.37023,
1.37075, 1.369685, 1.36847, 1.367885, 1.3702, 1.36917, 1.37061,
2230, 2909, 2782), .Dim = c(3L, 5L), .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316520000,
1316523600, 1316527200), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"))
UPDATE: Thanks to G. Grothendieck for the row-deleting solution. In the actual code I followed the advice of Joshua and GSee to get a list of xts objects instead of a list of zoo objects. So my code became:
z=lapply(filename_list, function(fname){
xts(read.zoo(file=fname,header=TRUE,sep = ",",tz = ""))
})
x=do.call.rbind(z)
(As an aside, please note the call to do.call.rbind. This is because rbind.xts has some serious memory issues. See https://stackoverflow.com/a/12029366/841830 )
Then I remove duplicates as a post-process step:
dups=duplicated(index(x))
if(any(dups)){
duplicate_timestamps=index(x)[dups]
to_delete=x[ (index(x) %in% duplicate_timestamps) & x$Volume<=1]
if(nrow(to_delete)>0){
#Next line says all lines that are not in the duplicate_timestamp group
# OR are in the duplicate timestamps, but have a volume greater than 1.
print("Will delete the volume=1 entry")
x=x[ !(index(x) %in% duplicate_timestamps) | x$Volume>1]
}else{
stop("Duplicate timestamps, and we cannot easily remove them just based on low volume.")
}
}
If z1 and z2 are your zoo objects then to rbind while removing any duplicates in z2:
rbind( z1, z2[ ! time(z2) %in% time(z1) ] )
Regarding deleting points in a zoo object having specified times, the above already illustrates this but in general if tt is a vector of times to delete:
z[ ! time(z) %in% tt ]
or if we knew there were a single element in tt then z[ time(z) != tt ] .
rbind.xts will allow duplicate index values, so it could work if you convert to xts first.
x <- lapply(z, as.xts)
y <- do.call(rbind, x)
# keep last value of any duplicates
y <- y[!duplicated(index(y),fromLast=TRUE),]
I think you'll have better luck if you convert to xts first.
a <- structure(c(1.370105, 1.370105, 1.370105, 1.370105, 1), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
b <- structure(c(1.370105, 1.371045, 1.369685, 1.3702, 2230), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
(comb <- rbind(a, b))
# Open High Low Close Volume
#2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
#2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.370200 2230
dupidx <- index(comb)[duplicated(index(comb))] # indexes of duplicates
tail(comb[dupidx], 1) #last duplicate
# now rbind the last duplicated row with all non-duplicated data
rbind(comb[!index(comb) %in% dupidx], tail(comb[dupidx], 1))

xts merge memory performance

I am trying to improve the memory performance for the following example:
basline df with 4 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L), .Label =
c("AAA1", "AAA2","AAA3", "AAA4"), class = "factor"), bitrateinbps = c(10000000,
10000000, 10000000, 10000000), startdate = structure(c(1326758507, 1326758671,
1326759569, 1326760589), class = c("POSIXct", "POSIXt"), tzone = ""), enddate =
structure(c(1326765780, 1326758734, 1326760629, 1326761592), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("sessionid", "bitrateinbps", "startdate",
"enddate"), row.names = c(NA, 4L), class =
"data.frame")
alternate df with 8 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L),
.Label = c("AAA1", "AAA2", "AAA3", "AAA4", "AAA5", "AAA6", "AAA7", "AAA8"),
class = "factor"), bitrateinbps =c(10000000, 10000000, 10000000, 10000000,
10000000, 10000000, 10000000, 10000000), startdate = structure(c(1326758507,
1326758671, 1326759569, 1326760589, 1326761589, 1326762589, 1326763589, 1326764589),
class = c("POSIXct",
"POSIXt"), tzone = ""), enddate = structure(c(1326765780, 1326758734, 1326760629,
1326761592, 1326767592,
1326768592, 1326768700, 1326769592), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("sessionid",
"bitrateinbps", "startdate", "enddate"), row.names = c(NA, 8L), class =
"data.frame")
try df analysis memory usage and again for alternate df
library(xts)
fun0 <- function(i, d) {
idx0 <- seq(d$startdate[i],d$enddate[i],1) # create sequence for index
dat0 <- rep(1,length(idx0)) # create data over sequence
xts(dat0, idx0, dimnames=list(NULL,d$sessionid[i])) # xts object
}
# loop over each row and put each row into its own xts object
xl0 <- lapply(1:NROW(df), fun0, d=df)
# merge all the xts objects
xx0 <- do.call(merge, xl0)
# apply a function (e.g. colMeans) to each 15-minute period
xa0 <- period.apply(xx0, endpoints(xx0, 'minutes', 15), colSums, na.rm=TRUE)/900
xa1 <- t(xa0)
# convert from atomic vector to data frame
xa1 = as.data.frame(xa1)
# bind to df
out1 = cbind(df, xa1)
# print aggregate memory usage statistics
print(paste('R is using', memory.size(), 'MB out of limit', memory.limit(), 'MB'))
# create function to return matrix of memory consumption
object.sizes <- function()
{
return(rev(sort(sapply(ls(envir=.GlobalEnv), function (object.name)
object.size(get(object.name))))))
}
# print to console in table format
object.sizes()
results as follows:
4 row df:
xx0 = 292104 Bytes .... do.call(merge, xl0)
xl0 = 154648 Bytes .... lapply(1:NROW(df), fun0, d=df)
8 row df:
xx0 = 799480 Bytes .... do.call(merge, xl0)
xl0 = 512808 Bytes .... lapply(1:NROW(df), fun0, d=df)
I'm looking for something a little more memory efficient for the merge and lapply functions, so I can scale out the number of rows, if anyone has any suggestions and can show the comparative results for alternatives.

Resources