R package: hash: save into R objects on disk very slow - r

I am using the great hash package by Christopher Brown. In my use case, I have several thousand keys in the first level, and each related value may save another 1-3 layers of hash objects. When I try to save it using save, it seems to take a really long time.
I then tried save and load on the exact same setup, except a smaller use case with about 100 keys. The save and load work well, except it does seem to take longer than doing the same for usual R objects of similar size.
Is this a known problem, and is there any work around of the speed issue?
My machine setup: Mac OSX 10.6, RStudio 0.98.1091, hash 3.0.1.
The code used to generate the data, and the outcome (in comment) is below:
library(hash)
library(microbenchmark)
create_hash = function(ahash, level1=10, level2=5, level3=2) {
for (i in 1:level1) {
ahash[[paste0('a',i)]] = hash()
for (j in 1:level2) {
ahash[[paste0('a',i)]][[paste0('b',j)]] = hash()
for (k in 1:level3) {
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]] = hash()
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key1']] = 'value1'
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key2']] = 'value2'
}
}
}
}
base1 = hash()
create_hash(base1, 100, 10, 2)
microbenchmark(save(base1, file='base1.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base1, file = "base1.Robj") 4.962731 4.987589 5.212594 5.102403 5.316056 5.694193 5
# File size: 1.6 MB
base2 = hash()
create_hash(base2, 1000, 10, 2)
microbenchmark(save(base2, file='base2.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base2, file = "base2.Robj") 108.6682 109.2254 110.4126 109.3526 111.1013 113.7154 5
# File size: 16.1 MB

Related

R string operation : How could i optimize this one?

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple
I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)
suppose :
c("yop",NA,"1234567","19","12AN","PLOP","5689777")
Given for exemple an objective size of 10, i want :
[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
as a result, as fast as possible.
I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.
Here's my actual code :
library(purrr)
zero_left <- function(field,nb){
map2_chr(
map(abs(nb-nchar(field)),~ rep("0",.x)),
field,
~ paste0(c(.x,.y),collapse=""))
}
trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)
This code does not even treat the NA case... But without it it works, but too slow.
This relies on an external package but takes 1/30 of the time your zero_left() function takes:
nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
Edit 1:
Base-R solution that is seems probably isn't just as fast:
gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)
Edit 2:
Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:
stringi::stri_pad_left(trial, width = nb, pad = "0")
If speed is your concern, base R can be faster than stringr/stringi:
library(microbenchmark)
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470 100
# stringi 10.473 12.359 13.15298 13.0180 13.5445 21.418 100
# base 7.848 9.392 10.83702 10.2035 10.8980 43.620 100
The only consequence is that the NA is turned into a literal "NANA" here
paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA" "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"
so the workaround is
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096 100
# stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135 100
# base 10.766 11.9185 13.69714 13.0665 13.8035 87.226 100
(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)
(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

append a datetime object on POSIXct

I need to append a datetime object to my POSIXct element. Some sampledata:
my_chr<-c('2017-02-19 06:00','2017-03-10 06:00','2017-04-15 06:00')
myPSX<-as.POSIXct(my_chr,format='%Y-%m-%d %H:%M',tz='UTC')
PSXappend<-as.POSIXct('2017-08-09 06:00',format='%Y-%m-%d %H:%M',tz='UTC')
But somehow if I try c() it changes the timezone. If i try to coerce it together with as.POSIXct it drops the datetime object I need to append.
In this case you could append a value by indexing, which will neither change the time zone nor the class of myPSX:
myPSX[length(myPSX) + 1] <- PSXappend
Since I have to run this on quite a large dataset, I ran some benchmarks to compare the different possibilities. Actually #Dan's solution is quite fast. However using attr(dttm,'tzone')<-'UTC' is slightly faster.
myfun1<-function(){
myPSX[length(myPSX) + 1] <- PSXappend
}
myfun2<-function(){
dttm<-c(myPSX,PSXappend)
attr(dttm,'tzone')<-'UTC'
}
library(lubridate)
myfun3<-function(){
dttm<-c(myPSX,PSXappend)
with_tz(dttm, "UTC")
}
myfun4<-function(){
dttm<-as.POSIXct(c(my_chr,'2017-08-09 06:00'),format='%Y-%m-%d %H:%M',tz='UTC')
}
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4())
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 12.642 15.210 17.92005 16.9875 17.7780 59.654 100
myfun2() 11.852 13.827 16.39909 14.4200 15.8025 43.062 100
myfun3() 26.864 29.432 121.86874 30.8150 33.1850 5852.844 100
myfun4() 31.605 34.766 61.66142 36.3460 40.2970 2182.323 100

How is hashing done in environment in R? (for optimizing lookup performance)

When using a lookup by name in a list, it is possible to first turn the list into an environment with hashing. For example:
x <- 1:1e5
names(x) <- x
lx <- as.list(x)
elx <- list2env(lx, hash = TRUE) # takes some time
library(microbenchmark)
microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
With the following performance gain:
> microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
Unit: nanoseconds
expr min lq mean median uq max neval cld
x[[which(x == 1000)]] 547213 681609.5 1063382.25 720718.5 788538.5 5999776 100 b
x[["1000"]] 6518 6829.0 7961.83 7139.0 8070.0 22659 100 a
lx[["1000"]] 6518 6829.0 8284.63 7140.0 8070.5 33212 100 a
get("1000", envir = elx) 621 931.0 2477.22 1242.0 2794.0 20175 100 a
elx[["1000"]] 0 1.0 1288.47 311.0 1552.0 22659 100 a
When looking at the help page for list2env:
(for the case envir = NULL): logical indicating if the created
environment should use hashing, see new.env.
When looking at the help for new.env, it doesn't explain how the hash table is created, but it does say:
For the performance implications of hashing or not, see
https://en.wikipedia.org/wiki/Hash_table.
So it's obvious that hashing is done, and works well (at least for the example I gave), but seeing from the Wikipedia page, it is clear there are various ways of creating hash tables. Hence, my question is: how is the hash table created in list2env?

How to compute large object's hash value in R?

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).
I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.
One fast way to do that is to calculate the database's hash with the previously stored hash.
The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(
Does anyone know about a way around this problem?
There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)
Similar problem has been described in our issue tracker here:
https://github.com/eddelbuettel/digest/issues/33
The current version of digest can read a file to compute the hash.
Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.
The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.
library(parallel)
library(digest)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist
producer <- mcparallel({
mystream <- file(fname, "w")
writeLines(x[1:5], mystream)
writeLines(x[6:10], mystream)
close(mystream) # sends signal to the consumer (digester)
})
digester <- mcparallel({
digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})
# runs both processes in parallel
mccollect(list(producer, digester))
unlink(fname) # named pipe removed
UPDATE: Henrik Bengtsson provided a modified example based on futures:
library("future")
plan(multiprocess)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists
producer %<-% {
mystream <- file(fname, open="wb")
writeBin(x[1:5], endian="little", con=mystream)
writeBin(x[6:10], endian="little", con=mystream)
close(mystream) # sends signal to the consumer (digester)
}
# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).
library(microbenchmark)
#iris
nrow(iris)
microbenchmark(
whole = digest::digest(iris),
cols = digest::digest(lapply(iris, digest::digest))
)
#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)
microbenchmark(
whole = digest::digest(long_iris),
cols = digest::digest(lapply(long_iris, digest::digest))
)
Results:
#normal
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 12.6 13.6 14.4 14.0 14.6 24.9 100 b
cols 12.5 12.8 13.3 13.1 13.5 23.0 100 a
#long
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 296 306 317 311 316 470 100 b
cols 261 276 290 282 291 429 100 a

Reading large RDS files in R in a faster way

I have a large RDS file to read in R. However, it takes quite some time to read the file.
Is there a way to speed up the reading ? I tried data.table library with its fread function, but I get an error.
data <- readRDS("myData.rds")
data <- fread("myData.rds") # error
One way to fasten the read operations of large files is to read it in a compressed mode
system.time(read.table("bigdata.txt", sep=","))
user: 170.901
system: 1.996
elapsed: 192.137
Now trying the same reading but with a compressed file
system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user: 65.511
system: 0.937
elapsed: 66.198
Compression can also influence the speed of reading for rds files:
n<-1000
m<-matrix(runif(n^2), ncol=n)
default<-tempfile()
unComp<-tempfile()
saveRDS(m,default)
saveRDS(m, unComp,compress = F)
microbenchmark::microbenchmark(readRDS(default), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readRDS(default) 46.37050 49.54836 56.03324 56.19446 59.99967 96.16305 100
#> readRDS(unComp) 11.60771 13.16521 15.54902 14.01063 17.36194 27.35329 100
#> cld
#> b
#> a
file.info(default)$size
#> [1] 5326357
file.info(unComp)$size
#> [1] 8000070
require(qs)
#> Loading required package: qs
#> qs v0.25.1.
qs<-tempfile()
qsave(m, qs)
microbenchmark::microbenchmark(qread(qs), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> qread(qs) 10.164793 12.26211 15.31887 14.71873 17.25536 27.08779 100
#> readRDS(unComp) 9.342042 12.59317 15.63974 14.44625 17.93492 35.12563 100
#> cld
#> a
#> a
file.info(qs)$size
#> [1] 4187017
However as seen here it comes at the cost of file size. It might also be that the speed of storage has an influence. On slow storage (e.g. network, spinning disks) it might actually be better to use compression as the file is quicker read from disk. It is thus work experimenting. Specific packages might even provide slightly better performance here qs has the same speed but a smaller size combining the good of both worlds. For specific data formats other packages might work better see this overview: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets

Resources