append a datetime object on POSIXct - r

I need to append a datetime object to my POSIXct element. Some sampledata:
my_chr<-c('2017-02-19 06:00','2017-03-10 06:00','2017-04-15 06:00')
myPSX<-as.POSIXct(my_chr,format='%Y-%m-%d %H:%M',tz='UTC')
PSXappend<-as.POSIXct('2017-08-09 06:00',format='%Y-%m-%d %H:%M',tz='UTC')
But somehow if I try c() it changes the timezone. If i try to coerce it together with as.POSIXct it drops the datetime object I need to append.

In this case you could append a value by indexing, which will neither change the time zone nor the class of myPSX:
myPSX[length(myPSX) + 1] <- PSXappend

Since I have to run this on quite a large dataset, I ran some benchmarks to compare the different possibilities. Actually #Dan's solution is quite fast. However using attr(dttm,'tzone')<-'UTC' is slightly faster.
myfun1<-function(){
myPSX[length(myPSX) + 1] <- PSXappend
}
myfun2<-function(){
dttm<-c(myPSX,PSXappend)
attr(dttm,'tzone')<-'UTC'
}
library(lubridate)
myfun3<-function(){
dttm<-c(myPSX,PSXappend)
with_tz(dttm, "UTC")
}
myfun4<-function(){
dttm<-as.POSIXct(c(my_chr,'2017-08-09 06:00'),format='%Y-%m-%d %H:%M',tz='UTC')
}
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4())
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 12.642 15.210 17.92005 16.9875 17.7780 59.654 100
myfun2() 11.852 13.827 16.39909 14.4200 15.8025 43.062 100
myfun3() 26.864 29.432 121.86874 30.8150 33.1850 5852.844 100
myfun4() 31.605 34.766 61.66142 36.3460 40.2970 2182.323 100

Related

R speed of data.table

I have a specific performance issue, that i wish to extend more generally if possible.
Context:
I've been playing around on google colab with a python code sample for a Q-Learning agent, which associate a state and an action to a value using a defaultdict:
self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
return self._qvalues[state][action]
Not an expert but my understanding is it returns the value or add and returns 0 if the key is not found.
i'm adapting part of this in R.
the problem is I don't how many state/values combinations I have, and technically i should not know how many states I guess.
At first I went the wrong way, with the rbind of data.frames and that was very slow.
I then replaced my R object with a data.frame(state, action, value = NA_real).
it works but it's still very slow. another problem is my data.frame object has the maximum size which might be problematic in the future.
then I chanded my data.frame to a data.table, which gave me worst performance, then I finally indexed it by (state, action).
qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE)
setkey(qvalues, "qstate", "qaction")
Problem:
Comparing googlecolab/python vs my local R implementation, google performs 1000x10e4 access to the object in let's say 15s, while my code performs 100x100 access in 28s. I got 2s improvements by byte compiling but that's still too bad.
Using profvis, I see most of the time is spent accessing the data.table on these two calls:
qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
self$qvalues[J(state, action)]$qvalue <- value
I don't really know what google has, but my desktop is a beast. Also I saw some benchmarks stating data.table was faster than pandas, so I suppose the problem lies in my choice of container.
Questions:
is my use of a data.table wrong and can be fixed to improve and match the python implementation?
is another design possible to avoid declaring all the state/actions combinations which could be a problem if the dimensions become too large?
i've seen about the hash package, is it the way to go?
Thanks a lot for any pointer!
UPDATE:
thanks for all the input.
So what I did was to replace 3 access to my data.table using your suggestions:
#self$qvalues[J(state, action)]$qvalue <- value
self$qvalues[J(state, action), qvalue := value]
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
this dropped the runtime from 33s to 21s
that's a massive improvement, but that's still extremely slow compared to the python defaultdict implementation.
I noted the following:
working in batch: I don't think I can do as the call to the function depends on the previous call.
peudospin> I see you are surprised the get is time consuming. so am I but that's what profvis states:
and here the code of the function as a reference:
QAgent$set("public", "get_qvalue", function( state, action) {
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
if (is.na(qval)) {
#self$qvalues[self$qvalues$qstate == state & self$qvalues$qaction == action,]$qvalue <- 0
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
return(0)
}
return(qval)
})
At this point, if no more suggestion, I will conclude the data.table is just too slow for this kind of task, and I should look into using an env or a collections. (as suggested there: R fast single item lookup from list vs data.table vs hash )
CONCLUSION:
I replaced the data.table for a collections::dict and the bottleneck completely disappeared.
data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.
You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.
Benchmark
library(data.table)
Sys.setenv('R_MAX_VSIZE'=32000000000) # add to the ram limit
setDTthreads(threads=0) # use maximum threads possible
nbstates <- 1e3
nbactions <- 1e5
cartesian <- function(nbstates,nbactions){
x= data.table(qstate=1:nbactions)
y= data.table(qaction=1:nbstates)
k = NULL
x = x[, c(k=1, .SD)]
setkey(x, k)
y = y[, c(k=1, .SD)]
setkey(y, NULL)
x[y, allow.cartesian=TRUE][, c("k", "qvalue") := list(NULL, NA_real_)][]
}
#comparing seq with `:`
(bench = microbenchmark::microbenchmark(
1:1e9,
seq(1e9),
times=1000L
))
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> 1:1e+09 120 143 176.264 156.0 201 5097 1000
#> seq(1e+09) 3039 3165 3333.339 3242.5 3371 21648 1000
ggplot2::autoplot(bench)
(bench = microbenchmark::microbenchmark(
"Cartesian product" = cartesian(nbstates,nbactions),
"data.table assignement"=qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE),
times=100L))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Cartesian product 3.181805 3.690535 4.093756 3.992223 4.306766 7.662306 100
#> data.table assignement 5.207858 5.554164 5.965930 5.895183 6.279175 7.670521 100
#> data.table (1:nb) 5.006773 5.609738 5.828659 5.80034 5.979303 6.727074 100
#>
#>
ggplot2::autoplot(bench)
it is clear the using seq consumes more time than calling the 1:nb. plus using a cartesian product makes the code faster even when 1:nb is used

R string operation : How could i optimize this one?

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple
I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)
suppose :
c("yop",NA,"1234567","19","12AN","PLOP","5689777")
Given for exemple an objective size of 10, i want :
[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
as a result, as fast as possible.
I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.
Here's my actual code :
library(purrr)
zero_left <- function(field,nb){
map2_chr(
map(abs(nb-nchar(field)),~ rep("0",.x)),
field,
~ paste0(c(.x,.y),collapse=""))
}
trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)
This code does not even treat the NA case... But without it it works, but too slow.
This relies on an external package but takes 1/30 of the time your zero_left() function takes:
nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
Edit 1:
Base-R solution that is seems probably isn't just as fast:
gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)
Edit 2:
Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:
stringi::stri_pad_left(trial, width = nb, pad = "0")
If speed is your concern, base R can be faster than stringr/stringi:
library(microbenchmark)
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470 100
# stringi 10.473 12.359 13.15298 13.0180 13.5445 21.418 100
# base 7.848 9.392 10.83702 10.2035 10.8980 43.620 100
The only consequence is that the NA is turned into a literal "NANA" here
paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA" "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"
so the workaround is
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096 100
# stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135 100
# base 10.766 11.9185 13.69714 13.0665 13.8035 87.226 100
(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)
(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

How is hashing done in environment in R? (for optimizing lookup performance)

When using a lookup by name in a list, it is possible to first turn the list into an environment with hashing. For example:
x <- 1:1e5
names(x) <- x
lx <- as.list(x)
elx <- list2env(lx, hash = TRUE) # takes some time
library(microbenchmark)
microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
With the following performance gain:
> microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
Unit: nanoseconds
expr min lq mean median uq max neval cld
x[[which(x == 1000)]] 547213 681609.5 1063382.25 720718.5 788538.5 5999776 100 b
x[["1000"]] 6518 6829.0 7961.83 7139.0 8070.0 22659 100 a
lx[["1000"]] 6518 6829.0 8284.63 7140.0 8070.5 33212 100 a
get("1000", envir = elx) 621 931.0 2477.22 1242.0 2794.0 20175 100 a
elx[["1000"]] 0 1.0 1288.47 311.0 1552.0 22659 100 a
When looking at the help page for list2env:
(for the case envir = NULL): logical indicating if the created
environment should use hashing, see new.env.
When looking at the help for new.env, it doesn't explain how the hash table is created, but it does say:
For the performance implications of hashing or not, see
https://en.wikipedia.org/wiki/Hash_table.
So it's obvious that hashing is done, and works well (at least for the example I gave), but seeing from the Wikipedia page, it is clear there are various ways of creating hash tables. Hence, my question is: how is the hash table created in list2env?

R package: hash: save into R objects on disk very slow

I am using the great hash package by Christopher Brown. In my use case, I have several thousand keys in the first level, and each related value may save another 1-3 layers of hash objects. When I try to save it using save, it seems to take a really long time.
I then tried save and load on the exact same setup, except a smaller use case with about 100 keys. The save and load work well, except it does seem to take longer than doing the same for usual R objects of similar size.
Is this a known problem, and is there any work around of the speed issue?
My machine setup: Mac OSX 10.6, RStudio 0.98.1091, hash 3.0.1.
The code used to generate the data, and the outcome (in comment) is below:
library(hash)
library(microbenchmark)
create_hash = function(ahash, level1=10, level2=5, level3=2) {
for (i in 1:level1) {
ahash[[paste0('a',i)]] = hash()
for (j in 1:level2) {
ahash[[paste0('a',i)]][[paste0('b',j)]] = hash()
for (k in 1:level3) {
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]] = hash()
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key1']] = 'value1'
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key2']] = 'value2'
}
}
}
}
base1 = hash()
create_hash(base1, 100, 10, 2)
microbenchmark(save(base1, file='base1.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base1, file = "base1.Robj") 4.962731 4.987589 5.212594 5.102403 5.316056 5.694193 5
# File size: 1.6 MB
base2 = hash()
create_hash(base2, 1000, 10, 2)
microbenchmark(save(base2, file='base2.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base2, file = "base2.Robj") 108.6682 109.2254 110.4126 109.3526 111.1013 113.7154 5
# File size: 16.1 MB

Reading large RDS files in R in a faster way

I have a large RDS file to read in R. However, it takes quite some time to read the file.
Is there a way to speed up the reading ? I tried data.table library with its fread function, but I get an error.
data <- readRDS("myData.rds")
data <- fread("myData.rds") # error
One way to fasten the read operations of large files is to read it in a compressed mode
system.time(read.table("bigdata.txt", sep=","))
user: 170.901
system: 1.996
elapsed: 192.137
Now trying the same reading but with a compressed file
system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user: 65.511
system: 0.937
elapsed: 66.198
Compression can also influence the speed of reading for rds files:
n<-1000
m<-matrix(runif(n^2), ncol=n)
default<-tempfile()
unComp<-tempfile()
saveRDS(m,default)
saveRDS(m, unComp,compress = F)
microbenchmark::microbenchmark(readRDS(default), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readRDS(default) 46.37050 49.54836 56.03324 56.19446 59.99967 96.16305 100
#> readRDS(unComp) 11.60771 13.16521 15.54902 14.01063 17.36194 27.35329 100
#> cld
#> b
#> a
file.info(default)$size
#> [1] 5326357
file.info(unComp)$size
#> [1] 8000070
require(qs)
#> Loading required package: qs
#> qs v0.25.1.
qs<-tempfile()
qsave(m, qs)
microbenchmark::microbenchmark(qread(qs), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> qread(qs) 10.164793 12.26211 15.31887 14.71873 17.25536 27.08779 100
#> readRDS(unComp) 9.342042 12.59317 15.63974 14.44625 17.93492 35.12563 100
#> cld
#> a
#> a
file.info(qs)$size
#> [1] 4187017
However as seen here it comes at the cost of file size. It might also be that the speed of storage has an influence. On slow storage (e.g. network, spinning disks) it might actually be better to use compression as the file is quicker read from disk. It is thus work experimenting. Specific packages might even provide slightly better performance here qs has the same speed but a smaller size combining the good of both worlds. For specific data formats other packages might work better see this overview: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets

Resources