How to remove a row from zoo/xts object, given a timestamp - r

I was happily running with this code:
z=lapply(filename_list, function(fname){
read.zoo(file=fname,header=TRUE,sep = ",",tz = "")
})
xts( do.call(rbind,z) )
until Dirty Data came along with this at the end of one file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
and this at the start of the next file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.3702 2230
So rbind.zoo complains about a duplicate.
I can't use something like:
y <- x[ ! duplicated( index(x) ), ]
as they are in different zoo objects, inside a list. And I cannot use aggregate, as suggested here because they are a list of zoo objects, not one big zoo object. And I can't get one big object 'cos of the duplicates. Catch-22.
So, when the going gets tough, the tough hack together some for loops (excuse the prints and a stop, as this isn't working code yet):
indexes <- do.call("c", unname(lapply(z, index)))
dups=duplicated(indexes)
if(any(dups)){
duplicate_timestamps=indexes[dups]
for(tix in 1:length(duplicate_timestamps)){
t=duplicate_timestamps[tix]
print("We have a duplicate:");print(t)
for(zix in 1:length(z)){
if(t %in% index(z[[zix]])){
print(z[[zix]][t])
if(z[[zix]][t]$Volume==1){
print("-->Deleting this one");
z[[zix]][t]=NULL #<-- PROBLEM
}
}
}
}
stop("There are duplicate bars!!")
}
The bit I've got stumped on is assigning NULL to a zoo row doesn't delete it (Error in NextMethod("[<-") : replacement has length zero). OK, so I do a filter-copy, without the offending item... but I'm tripping up on these:
> z[[zix]][!t,]
Error in Ops.POSIXt(t) : unary '!' not defined for "POSIXt" objects
> z[[zix]][-t,]
Error in `-.POSIXt`(t) : unary '-' is not defined for "POSIXt" objects
P.S. While high-level solutions to my real problem of "duplicates rows across a list of zoo objects" are very welcome, the question here is specifically about how to delete a row from a zoo object given a POSIXt index object.
A small bit of test data:
list(structure(c(1.36864, 1.367045, 1.370105, 1.36928, 1.37039,
1.370105, 1.36604, 1.36676, 1.370105, 1.367065, 1.37009, 1.370105,
5498, 3244, 1), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume")), index = structure(c(1316512800,
1316516400, 1316520000), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"),
structure(c(1.370105, 1.370115, 1.36913, 1.371045, 1.37023,
1.37075, 1.369685, 1.36847, 1.367885, 1.3702, 1.36917, 1.37061,
2230, 2909, 2782), .Dim = c(3L, 5L), .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316520000,
1316523600, 1316527200), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"))
UPDATE: Thanks to G. Grothendieck for the row-deleting solution. In the actual code I followed the advice of Joshua and GSee to get a list of xts objects instead of a list of zoo objects. So my code became:
z=lapply(filename_list, function(fname){
xts(read.zoo(file=fname,header=TRUE,sep = ",",tz = ""))
})
x=do.call.rbind(z)
(As an aside, please note the call to do.call.rbind. This is because rbind.xts has some serious memory issues. See https://stackoverflow.com/a/12029366/841830 )
Then I remove duplicates as a post-process step:
dups=duplicated(index(x))
if(any(dups)){
duplicate_timestamps=index(x)[dups]
to_delete=x[ (index(x) %in% duplicate_timestamps) & x$Volume<=1]
if(nrow(to_delete)>0){
#Next line says all lines that are not in the duplicate_timestamp group
# OR are in the duplicate timestamps, but have a volume greater than 1.
print("Will delete the volume=1 entry")
x=x[ !(index(x) %in% duplicate_timestamps) | x$Volume>1]
}else{
stop("Duplicate timestamps, and we cannot easily remove them just based on low volume.")
}
}

If z1 and z2 are your zoo objects then to rbind while removing any duplicates in z2:
rbind( z1, z2[ ! time(z2) %in% time(z1) ] )
Regarding deleting points in a zoo object having specified times, the above already illustrates this but in general if tt is a vector of times to delete:
z[ ! time(z) %in% tt ]
or if we knew there were a single element in tt then z[ time(z) != tt ] .

rbind.xts will allow duplicate index values, so it could work if you convert to xts first.
x <- lapply(z, as.xts)
y <- do.call(rbind, x)
# keep last value of any duplicates
y <- y[!duplicated(index(y),fromLast=TRUE),]

I think you'll have better luck if you convert to xts first.
a <- structure(c(1.370105, 1.370105, 1.370105, 1.370105, 1), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
b <- structure(c(1.370105, 1.371045, 1.369685, 1.3702, 2230), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
(comb <- rbind(a, b))
# Open High Low Close Volume
#2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
#2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.370200 2230
dupidx <- index(comb)[duplicated(index(comb))] # indexes of duplicates
tail(comb[dupidx], 1) #last duplicate
# now rbind the last duplicated row with all non-duplicated data
rbind(comb[!index(comb) %in% dupidx], tail(comb[dupidx], 1))

Related

R loop doesn't work while single command works

I am trying to covert many time series xts objects to tibbles, and the for loop I wrote does not work properly, I don't know why.
This does not only happen to this particular task, but other task I perform, I have a list called "code", which contains a list of names for all the xts objects I want to convert from.
code <- c('ABT','BA','CL','ROK')
for (i in code)
{
i <- tk_tbl(i, preserve_index = TRUE, rename_index = "index",
timetk_idx = FALSE, silent = FALSE)
}
What is strange is that, if I use a single one without loop, it works beautifully and convert the xts "ABT" to a tibble "ABT"
ABT <- tk_tbl(ABT, preserve_index = TRUE, rename_index = "index",
timetk_idx = FALSE, silent = FALSE)
The error message for the first code is
Warning: No index to preserve. Object otherwise converted to tibble
successfully.
38: In tk_tbl.data.frame(as.data.frame(data), preserve_index, ... :
Edit:
tk_tabl is a function from the package timetk, and it "Coerce time-series objects to tibble."
And code is a vector containing names.
library(timetk)
code <- c('ABT','BA','CL','ROK')
> dput(head(ROK))
structure(c(8.14062, 8.15625, 8.03125, 7.78125, 7.6875, 7.71875,
8.25, 8.15625, 8.125, 7.90625, 7.71875, 7.75, 8.03125, 8.125,
7.90625, 7.65625, 7.625, 7.65625, 8.1875, 8.125, 7.90625, 7.71875,
7.65625, 7.6875, 109600, 80800, 138400, 151600, 96800, 258800,
0.684505, 0.67928, 0.660992, 0.645316, 0.640091, 0.642704),
class=c("xts", "zoo"), .indexCLASS = "Date", tclass = "Date",
.indexTZ = "UTC", tzone = "UTC", src = "yahoo",
updated = structure(1558826745.23035, class = c("POSIXct","POSIXt")),
index = structure(c(378604800, 378950400, 379036800,
379123200, 379209600, 379296000), tzone = "UTC", tclass = "Date"),
.Dim = c(6L, 6L), .Dimnames = list(NULL, c("ROK.Open", "ROK.High",
"ROK.Low", "ROK.Close", "ROK.Volume", "ROK.Adjusted")))
For me it looks like that you expect <- to do what assign is doing.
I think you get your expected result when you change your loop to:
for (i in code) {
assign(i, tk_tbl(i, preserve_index = TRUE, rename_index = "index", timetk_idx = FALSE, silent = FALSE))
}

Merging xts in R - Converting Characters to NA

I have 3 xts objects
logged <- xts::xts(x = loggedInUsers$loggedInUsers, order.by = Sys.time())
loadValue <- xts::xts(x = loadAvg, order.by = Sys.time())
hostname <- xts::xts(x = loadHost, order.by = Sys.time())
dput(hostname)
dput(loadValue)
dput(logged)
dput gives the following result
structure("deliverforgoodportal", .Dim = c(1L, 1L), index = structure(1551088127.27724, tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "", tzone = "")
structure(0, .Dim = c(1L, 1L), .Dimnames = list(NULL, "load"), index = structure(1551088127.27676, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", class = c("xts", "zoo"))
structure(1, .Dim = c(1L, 1L), index = structure(1551088127.27637, tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "", tzone = "")
When I am merging this three and printing the hostname is converted to NA
tmp <- merge.xts(hostname, logged, loadValue, all = TRUE)
print(tmp)
Output is: (hostname is NA)
hostname logged load
2019-02-25 09:48:47 NA 1 NA
2019-02-25 09:48:47 NA NA 0
2019-02-25 09:48:47 NA NA NA
Why is this coming as NA?
You should realise that an xts object is a timeseries and a matrix. Now a matrix can only contain one type of values, either character or numeric. But not both. Your merge is trying to combine a character value matrix (hostname) with numeric values (logged and load). This results in the hostname values being coerced to NA.
If you want to join this data, you have to use a data.frame (or data.table). Also note that your time values are not equal, they are of by milliseconds. So if you want to join on minutes, first use floor_date from the lubridate package. See below two examples with and without lubridate. I use the package timetk to convert the xts objects to a tibble, but depending on your source data that might not be necessary.
with full_join, no lubridate
library(timetk)
library(dplyr)
hostname <- tk_tbl(hostname)
loadValue <- tk_tbl(loadValue)
logged <- tk_tbl(logged)
hostname %>%
full_join(loadValue) %>%
full_join(logged,
by = "index",
suffix = c("_hostname", "_logged"))
Joining, by = "index"
# A tibble: 3 x 4
index value_hostname load value_logged
<dttm> <chr> <dbl> <dbl>
1 2019-02-25 10:48:47 deliverforgoodportal NA NA
2 2019-02-25 10:48:47 NA 0 NA
3 2019-02-25 10:48:47 NA NA 1
with lubridate and left join:
hostname %>%
mutate(index = lubridate::floor_date(index, unit = "seconds")) %>%
left_join(loadValue %>% mutate(index = lubridate::floor_date(index, unit = "seconds"))) %>%
left_join(logged %>% mutate(index = lubridate::floor_date(index, unit = "seconds")),
by = "index",
suffix = c("_hostname", "_logged"))
Joining, by = "index"
# A tibble: 1 x 4
index value_hostname load value_logged
<dttm> <chr> <dbl> <dbl>
1 2019-02-25 10:48:47 deliverforgoodportal 0 1

Change data to numeric type to determine which distribution fits better

I am trying to figure out which distribution fits best logarithmic stock returns. Here is my code:
library(TTR)
sign="^GSPC"
start=19900101
end=20160101
x <- getYahooData(sign, start = start, end = end, freq = "daily")
x$logret <- log(x$Close) - lag(log(x$Close))
x=x[,6]
I want to use the function descdist(x, discrete = FALSE) which I got from this amazing post https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best Nonetheless r gives me this error: Error in descdist(x, discrete = FALSE) : data must be a numeric vector How do I transform my data to numeric vector??
The output from dput(head(x)) is:
structure(c(NA, -0.00258888580664607, -0.00865029791190164, -0.00980414107803274,
0.00450431207515223, -0.011856706127011), class = c("xts", "zoo"
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", index = structure(c(631238400,
631324800, 631411200, 631497600, 631756800, 631843200), tzone = "UTC", tclass = "Date"), .Dim = c(6L,
1L), .Dimnames = list(NULL, "logret"))
Pre-process x using as.numeric(na.omit(x)), or simply run
descdist(as.numeric(na.omit(x)), discrete = FALSE)

optimisation of a condition for loop in r

I have 2 datasets, one of which contains measurements of temperature at 30 min intervals
ordered.temp<-structure(list(time = structure(c(1385244720, 1385246520, 1385248320,
1385250120, 1385251920, 1385253720, 1385255520, 1385257320, 1385259120,
1385260920), class = c("POSIXct", "POSIXt"), tzone = ""), temp = c(30.419,
29.34, 28.965, 28.866, 28.891, 28.866, 28.692, 28.419, 28.122,
27.85), hoboID = c(2392890L, 2392890L, 2392890L, 2392890L, 2392890L,
2392890L, 2392890L, 2392890L, 2392890L, 2392890L)), .Names = c("time",
"temp", "hoboID"), row.names = c(NA, 10L), class = "data.frame")
, the other I created to be able to assign each measurement into 3-hour bins
df<-structure(list(start = structure(c(1385182800, 1385193600, 1385204400,
1385215200, 1385226000, 1385236800, 1385247600, 1385258400, 1385269200,
1385280000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1385193600,
1385204400, 1385215200, 1385226000, 1385236800, 1385247600, 1385258400,
1385269200, 1385280000, 1385290800), class = c("POSIXct", "POSIXt"
), tzone = ""), b = 1:10), .Names = c("start", "end", "b"), row.names = c(NA,
10L), class = "data.frame")
For simplicity, I created a subset of the data, but in reality the temp dataframe is 460k rows long and growing bigger every year. I wrote a for loop to compare each line in temp with lines in bin and assign it the corresponding b value from the bin dataframe.
m <- length(ordered.temp$time)
b <- numeric(m)
n <- length(df$start)
for (i in 1:m){
for (j in 1:n){
if (df$start[j] < ordered.temp$time[i] & ordered.temp$time[i] <= df$end[j]){
b[i] <- df$b[j]
print(i/dim(ordered.temp)[1]*100)
}
}
}
Running this loop with 460k rows takes a very long time (i ran the loop for 1 minute and calculated that it would take ±277 hours to complete it. Therefore, it is imperative to speed this loop up or find alternative methods if this is not possible. I however have no idea how I achieve the desired result. Any help would be greatly appreciated. thanks.

Applying a function to a few rows then the next few rows

I am trying to find the max of rows 2:5, then 3:6, then 4:7 and so on for nrows(df). I am however having a problem thinking of how to do this because I have never used a for loop in the past successfully. Any help is greatly appreciated.
structure(c(76.89, 77.08, 77.05, 77.28, 77.28, 77.61, 77.03,
77.61, 77.28, 77.3, 77.37, 77.61, 76.7, 77, 76.98, 77.09, 77.21,
77.5, 76.74, 77.49, 76.98, 77.2, 77.29, 77.58, NA, 76.91, 77.27,
77.13, 77.24, 77.45, NA, 0.910154726303475, 0.0129416332341208,
0.220407104887854, 0.168306576903153, 0.20658489347966, NA, 0.117019893381879,
-0.3753073637893, -0.0518604952677195, -0.0388399792853642, 0.0645577792123914
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", class = c("xts",
"zoo"), index = structure(c(631324800, 631411200, 631497600,
631756800, 631843200, 631929600), tzone = "UTC", tclass = "Date"), .Dim = 6:7, .Dimnames = list(
NULL, c("open", "high", "low", "close", "avgco", "percenthigh",
"percentlow")))
Specifically I want to apply the max function over the AD1$high column for rows 2 through 5 then rows 3 through 6 and so on and have this in a new column.
Thank You
You could do it by making three copies of your column (i.e "high") and offsetting them so one starts ahead one value and one starts behind one value. Then just take the max as you iterate across them:
y <- yourdata
t <- y[,"high"]
tback <- t[2:length(t)]
tforward <- append(NA,t)
using a loop
for(i in 1:length(t)) {
maxvals[i] <- max(c(t[i],tback[i],tforward[i]), na.rm=T)
}
output
> maxvals
[1] 77.61 77.61 77.61 77.37 77.61 77.61
Or you could do it more efficiently without a loop by initializing maxvals to the proper length and filling its values.
Using the zoo function "rollapply" solved my problem.

Resources