Reading large RDS files in R in a faster way - r

I have a large RDS file to read in R. However, it takes quite some time to read the file.
Is there a way to speed up the reading ? I tried data.table library with its fread function, but I get an error.
data <- readRDS("myData.rds")
data <- fread("myData.rds") # error

One way to fasten the read operations of large files is to read it in a compressed mode
system.time(read.table("bigdata.txt", sep=","))
user: 170.901
system: 1.996
elapsed: 192.137
Now trying the same reading but with a compressed file
system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user: 65.511
system: 0.937
elapsed: 66.198

Compression can also influence the speed of reading for rds files:
n<-1000
m<-matrix(runif(n^2), ncol=n)
default<-tempfile()
unComp<-tempfile()
saveRDS(m,default)
saveRDS(m, unComp,compress = F)
microbenchmark::microbenchmark(readRDS(default), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readRDS(default) 46.37050 49.54836 56.03324 56.19446 59.99967 96.16305 100
#> readRDS(unComp) 11.60771 13.16521 15.54902 14.01063 17.36194 27.35329 100
#> cld
#> b
#> a
file.info(default)$size
#> [1] 5326357
file.info(unComp)$size
#> [1] 8000070
require(qs)
#> Loading required package: qs
#> qs v0.25.1.
qs<-tempfile()
qsave(m, qs)
microbenchmark::microbenchmark(qread(qs), readRDS(unComp))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> qread(qs) 10.164793 12.26211 15.31887 14.71873 17.25536 27.08779 100
#> readRDS(unComp) 9.342042 12.59317 15.63974 14.44625 17.93492 35.12563 100
#> cld
#> a
#> a
file.info(qs)$size
#> [1] 4187017
However as seen here it comes at the cost of file size. It might also be that the speed of storage has an influence. On slow storage (e.g. network, spinning disks) it might actually be better to use compression as the file is quicker read from disk. It is thus work experimenting. Specific packages might even provide slightly better performance here qs has the same speed but a smaller size combining the good of both worlds. For specific data formats other packages might work better see this overview: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets

Related

Extract jpg name from a url using R

someone could help me, this is my problem:
I have a list of urls in a tbl and I have to extract the jpg nane.
this is the url
https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2
and this one the part to extract
13643048_612108275661958_805860992_n
thanks for helps
This requires two things:
parse the URL itself
get the filename from the path of the URL
You can do both manually but it’s much better to use existing tools. The first part is solved by the parseURI function from the ‹XML› package:
uri = 'https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2
parts = XML::parseURI(uri)
And the second part is trivially solved by the basename function:
filename = basename(parts$path)
Googling for "R parse URL" could have saved you from typing ~400 keystrokes (tho I expect the URL was pasted).
In any event, you want to process a vector of these things, so there's a better way. In fact there are multiple ways to do this URL path extraction in R. Here are 3:
library(stringi)
library(urltools)
library(httr)
library(XML)
library(dplyr)
We'll generate 100 unique URLs that fit the same Instagram pattern (NOTE: scraping instagram is a violation of their ToS & controlled by robots.txt. If your URLs did not come from the Instagram API, please let me know so I can delete this answer as I don't help content thieves).
set.seed(0)
paste(
"https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2",
stri_rand_strings(100, 8, "[0-9]"), "_",
stri_rand_strings(100, 15, "[0-9]"), "_",
stri_rand_strings(100, 9, "[0-9]"), "_",
stri_rand_strings(100, 1, "[a-z]"),
".jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2",
sep=""
) -> img_urls
head(img_urls)
## [1] "https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2"
## [2] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/66021637_359927357880233_471353444_q.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [3] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/47937926_769874508959124_426288550_z.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [4] "https://https://content_xxx.xxx.com/vp/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/12303834_440673970920272_460810703_n.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [5] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/54186717_202600346704982_713363439_y.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [6] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/48675570_402479399847865_689787883_e.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
Now, let's try to parse those URLs:
invisible(urltools::url_parse(img_urls))
invisible(httr::parse_url(img_urls))
## Error in httr::parse_url(img_urls): length(url) == 1 is not TRUE
DOH! httr can't do it.
invisible(XML::parseURI(img_urls))
## Error in if (is.na(uri)) return(structure(as.character(uri), class = "URI")): the condition has length > 1
DOH! XML can't do it either.
That means we need to use an sapply() crutch for httr and XML to get the path component (you can run basename() on any resultant vector as Konrad showed):
data_frame(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
) -> paths
glimpse(paths)
## Observations: 100
## Variables: 3
## $ urltools <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ httr <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ XML <chr> "/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_...
Note the not really standard inclusion of the initial, / in the path from XML. That's not important for you for this example, but it's important to note the difference in general.
We'll process one of them since XML and httr have that woeful limitation:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls[1])$path,
httr = httr::parse_url(img_urls[1])$path,
XML = XML::parseURI(img_urls[1])$path
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 351.268 397.6040 557.09641 499.2220 618.5945 1309.454 100
## httr 550.298 619.5080 843.26520 717.0705 888.3915 4213.070 100
## XML 11.858 16.9115 27.97848 26.1450 33.9065 109.882 100
XML looks faster, but it's not in practice:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 718.887 853.374 1093.404 918.3045 1146.540 2872.076 100
## httr 58513.970 64738.477 80697.548 68908.7635 81549.154 224157.857 100
## XML 1155.370 1245.415 2012.660 1359.8215 1880.372 26184.943 100
If you really want to go the regex route, you can read the RFC for the URL BNF and a naive regex for hacking bits out of one and Google for the seminal example that has over a dozen regular expressions that handle not-so-well-formed URIs, but parsing is generally a better strategy for diverse URL content. For your case, splitting and regex'ing might work just fine but it isn't necessarily going to be that much faster than parsing:
microbenchmark::microbenchmark(
urltools = tools::file_path_sans_ext(basename(urltools::url_parse(img_urls)$path)),
httr = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE))),
XML = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE))),
regex = stri_match_first_regex(img_urls, "/([[:digit:]]{8}_[[:digit:]]{15}_[[:digit:]]{9}_[[:alpha:]]{1})\\.jpg\\?")[,2]
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## urltools 1.140421 1.228988 1.502525 1.286650 1.444522 6.970044 100
## httr 56.563403 65.696242 77.492290 69.809393 80.075763 157.657508 100
## XML 1.513174 1.604012 2.039502 1.702018 1.931468 11.306436 100
## regex 1.137204 1.223683 1.337675 1.260339 1.397273 2.241121 100
As noted in that final example, you'll need to run tools::file_path_sans_ext() on the result to remove the .jpg (or sub() it away).

How is hashing done in environment in R? (for optimizing lookup performance)

When using a lookup by name in a list, it is possible to first turn the list into an environment with hashing. For example:
x <- 1:1e5
names(x) <- x
lx <- as.list(x)
elx <- list2env(lx, hash = TRUE) # takes some time
library(microbenchmark)
microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
With the following performance gain:
> microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
Unit: nanoseconds
expr min lq mean median uq max neval cld
x[[which(x == 1000)]] 547213 681609.5 1063382.25 720718.5 788538.5 5999776 100 b
x[["1000"]] 6518 6829.0 7961.83 7139.0 8070.0 22659 100 a
lx[["1000"]] 6518 6829.0 8284.63 7140.0 8070.5 33212 100 a
get("1000", envir = elx) 621 931.0 2477.22 1242.0 2794.0 20175 100 a
elx[["1000"]] 0 1.0 1288.47 311.0 1552.0 22659 100 a
When looking at the help page for list2env:
(for the case envir = NULL): logical indicating if the created
environment should use hashing, see new.env.
When looking at the help for new.env, it doesn't explain how the hash table is created, but it does say:
For the performance implications of hashing or not, see
https://en.wikipedia.org/wiki/Hash_table.
So it's obvious that hashing is done, and works well (at least for the example I gave), but seeing from the Wikipedia page, it is clear there are various ways of creating hash tables. Hence, my question is: how is the hash table created in list2env?

How to compute large object's hash value in R?

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).
I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.
One fast way to do that is to calculate the database's hash with the previously stored hash.
The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(
Does anyone know about a way around this problem?
There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)
Similar problem has been described in our issue tracker here:
https://github.com/eddelbuettel/digest/issues/33
The current version of digest can read a file to compute the hash.
Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.
The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.
library(parallel)
library(digest)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist
producer <- mcparallel({
mystream <- file(fname, "w")
writeLines(x[1:5], mystream)
writeLines(x[6:10], mystream)
close(mystream) # sends signal to the consumer (digester)
})
digester <- mcparallel({
digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})
# runs both processes in parallel
mccollect(list(producer, digester))
unlink(fname) # named pipe removed
UPDATE: Henrik Bengtsson provided a modified example based on futures:
library("future")
plan(multiprocess)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists
producer %<-% {
mystream <- file(fname, open="wb")
writeBin(x[1:5], endian="little", con=mystream)
writeBin(x[6:10], endian="little", con=mystream)
close(mystream) # sends signal to the consumer (digester)
}
# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).
library(microbenchmark)
#iris
nrow(iris)
microbenchmark(
whole = digest::digest(iris),
cols = digest::digest(lapply(iris, digest::digest))
)
#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)
microbenchmark(
whole = digest::digest(long_iris),
cols = digest::digest(lapply(long_iris, digest::digest))
)
Results:
#normal
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 12.6 13.6 14.4 14.0 14.6 24.9 100 b
cols 12.5 12.8 13.3 13.1 13.5 23.0 100 a
#long
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 296 306 317 311 316 470 100 b
cols 261 276 290 282 291 429 100 a

R package: hash: save into R objects on disk very slow

I am using the great hash package by Christopher Brown. In my use case, I have several thousand keys in the first level, and each related value may save another 1-3 layers of hash objects. When I try to save it using save, it seems to take a really long time.
I then tried save and load on the exact same setup, except a smaller use case with about 100 keys. The save and load work well, except it does seem to take longer than doing the same for usual R objects of similar size.
Is this a known problem, and is there any work around of the speed issue?
My machine setup: Mac OSX 10.6, RStudio 0.98.1091, hash 3.0.1.
The code used to generate the data, and the outcome (in comment) is below:
library(hash)
library(microbenchmark)
create_hash = function(ahash, level1=10, level2=5, level3=2) {
for (i in 1:level1) {
ahash[[paste0('a',i)]] = hash()
for (j in 1:level2) {
ahash[[paste0('a',i)]][[paste0('b',j)]] = hash()
for (k in 1:level3) {
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]] = hash()
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key1']] = 'value1'
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key2']] = 'value2'
}
}
}
}
base1 = hash()
create_hash(base1, 100, 10, 2)
microbenchmark(save(base1, file='base1.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base1, file = "base1.Robj") 4.962731 4.987589 5.212594 5.102403 5.316056 5.694193 5
# File size: 1.6 MB
base2 = hash()
create_hash(base2, 1000, 10, 2)
microbenchmark(save(base2, file='base2.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base2, file = "base2.Robj") 108.6682 109.2254 110.4126 109.3526 111.1013 113.7154 5
# File size: 16.1 MB

Are locked environments faster than unlocked environments?

I noticed the lockEnvironment function and was wondering if/when I should use it for environments. I often use environments as lookup tables because, being hash tables, they're faster than lists. Can locking an environment improve performance?
I did some testing but couldn't find a difference:
> library(microbenchmark)
> lst = as.list(paste0(rep(letters,each=10),1:10))
> names(lst) = lst
> a = list2env(lst,hash=TRUE,parent=emptyenv())
> b = list2env(lst,hash=TRUE,parent=emptyenv())
> lockEnvironment(b,bindings=TRUE)
> microbenchmark(a$z1,b$z1)
Unit: nanoseconds
expr min lq median uq max neval
a$z1 612 615 623.5 679.0 6238 100
b$z1 613 615 619.5 675.5 1943 100
Is locking just a reliability feature or are there ever performance differences?
lockEnvironment is used primarily internally by R to lock the package environment after loading. There shouldn't be any performance impact for locking the environment either good or bad.

Resources