R string operation : How could i optimize this one? - r

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple
I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)
suppose :
c("yop",NA,"1234567","19","12AN","PLOP","5689777")
Given for exemple an objective size of 10, i want :
[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
as a result, as fast as possible.
I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.
Here's my actual code :
library(purrr)
zero_left <- function(field,nb){
map2_chr(
map(abs(nb-nchar(field)),~ rep("0",.x)),
field,
~ paste0(c(.x,.y),collapse=""))
}
trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)
This code does not even treat the NA case... But without it it works, but too slow.

This relies on an external package but takes 1/30 of the time your zero_left() function takes:
nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
Edit 1:
Base-R solution that is seems probably isn't just as fast:
gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)
Edit 2:
Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:
stringi::stri_pad_left(trial, width = nb, pad = "0")

If speed is your concern, base R can be faster than stringr/stringi:
library(microbenchmark)
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470 100
# stringi 10.473 12.359 13.15298 13.0180 13.5445 21.418 100
# base 7.848 9.392 10.83702 10.2035 10.8980 43.620 100
The only consequence is that the NA is turned into a literal "NANA" here
paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA" "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"
so the workaround is
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096 100
# stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135 100
# base 10.766 11.9185 13.69714 13.0665 13.8035 87.226 100
(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)
(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

Related

R speed of data.table

I have a specific performance issue, that i wish to extend more generally if possible.
Context:
I've been playing around on google colab with a python code sample for a Q-Learning agent, which associate a state and an action to a value using a defaultdict:
self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
return self._qvalues[state][action]
Not an expert but my understanding is it returns the value or add and returns 0 if the key is not found.
i'm adapting part of this in R.
the problem is I don't how many state/values combinations I have, and technically i should not know how many states I guess.
At first I went the wrong way, with the rbind of data.frames and that was very slow.
I then replaced my R object with a data.frame(state, action, value = NA_real).
it works but it's still very slow. another problem is my data.frame object has the maximum size which might be problematic in the future.
then I chanded my data.frame to a data.table, which gave me worst performance, then I finally indexed it by (state, action).
qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE)
setkey(qvalues, "qstate", "qaction")
Problem:
Comparing googlecolab/python vs my local R implementation, google performs 1000x10e4 access to the object in let's say 15s, while my code performs 100x100 access in 28s. I got 2s improvements by byte compiling but that's still too bad.
Using profvis, I see most of the time is spent accessing the data.table on these two calls:
qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
self$qvalues[J(state, action)]$qvalue <- value
I don't really know what google has, but my desktop is a beast. Also I saw some benchmarks stating data.table was faster than pandas, so I suppose the problem lies in my choice of container.
Questions:
is my use of a data.table wrong and can be fixed to improve and match the python implementation?
is another design possible to avoid declaring all the state/actions combinations which could be a problem if the dimensions become too large?
i've seen about the hash package, is it the way to go?
Thanks a lot for any pointer!
UPDATE:
thanks for all the input.
So what I did was to replace 3 access to my data.table using your suggestions:
#self$qvalues[J(state, action)]$qvalue <- value
self$qvalues[J(state, action), qvalue := value]
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
this dropped the runtime from 33s to 21s
that's a massive improvement, but that's still extremely slow compared to the python defaultdict implementation.
I noted the following:
working in batch: I don't think I can do as the call to the function depends on the previous call.
peudospin> I see you are surprised the get is time consuming. so am I but that's what profvis states:
and here the code of the function as a reference:
QAgent$set("public", "get_qvalue", function( state, action) {
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
if (is.na(qval)) {
#self$qvalues[self$qvalues$qstate == state & self$qvalues$qaction == action,]$qvalue <- 0
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
return(0)
}
return(qval)
})
At this point, if no more suggestion, I will conclude the data.table is just too slow for this kind of task, and I should look into using an env or a collections. (as suggested there: R fast single item lookup from list vs data.table vs hash )
CONCLUSION:
I replaced the data.table for a collections::dict and the bottleneck completely disappeared.
data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.
You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.
Benchmark
library(data.table)
Sys.setenv('R_MAX_VSIZE'=32000000000) # add to the ram limit
setDTthreads(threads=0) # use maximum threads possible
nbstates <- 1e3
nbactions <- 1e5
cartesian <- function(nbstates,nbactions){
x= data.table(qstate=1:nbactions)
y= data.table(qaction=1:nbstates)
k = NULL
x = x[, c(k=1, .SD)]
setkey(x, k)
y = y[, c(k=1, .SD)]
setkey(y, NULL)
x[y, allow.cartesian=TRUE][, c("k", "qvalue") := list(NULL, NA_real_)][]
}
#comparing seq with `:`
(bench = microbenchmark::microbenchmark(
1:1e9,
seq(1e9),
times=1000L
))
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> 1:1e+09 120 143 176.264 156.0 201 5097 1000
#> seq(1e+09) 3039 3165 3333.339 3242.5 3371 21648 1000
ggplot2::autoplot(bench)
(bench = microbenchmark::microbenchmark(
"Cartesian product" = cartesian(nbstates,nbactions),
"data.table assignement"=qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE),
times=100L))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Cartesian product 3.181805 3.690535 4.093756 3.992223 4.306766 7.662306 100
#> data.table assignement 5.207858 5.554164 5.965930 5.895183 6.279175 7.670521 100
#> data.table (1:nb) 5.006773 5.609738 5.828659 5.80034 5.979303 6.727074 100
#>
#>
ggplot2::autoplot(bench)
it is clear the using seq consumes more time than calling the 1:nb. plus using a cartesian product makes the code faster even when 1:nb is used

Extract jpg name from a url using R

someone could help me, this is my problem:
I have a list of urls in a tbl and I have to extract the jpg nane.
this is the url
https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2
and this one the part to extract
13643048_612108275661958_805860992_n
thanks for helps
This requires two things:
parse the URL itself
get the filename from the path of the URL
You can do both manually but it’s much better to use existing tools. The first part is solved by the parseURI function from the ‹XML› package:
uri = 'https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2
parts = XML::parseURI(uri)
And the second part is trivially solved by the basename function:
filename = basename(parts$path)
Googling for "R parse URL" could have saved you from typing ~400 keystrokes (tho I expect the URL was pasted).
In any event, you want to process a vector of these things, so there's a better way. In fact there are multiple ways to do this URL path extraction in R. Here are 3:
library(stringi)
library(urltools)
library(httr)
library(XML)
library(dplyr)
We'll generate 100 unique URLs that fit the same Instagram pattern (NOTE: scraping instagram is a violation of their ToS & controlled by robots.txt. If your URLs did not come from the Instagram API, please let me know so I can delete this answer as I don't help content thieves).
set.seed(0)
paste(
"https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2",
stri_rand_strings(100, 8, "[0-9]"), "_",
stri_rand_strings(100, 15, "[0-9]"), "_",
stri_rand_strings(100, 9, "[0-9]"), "_",
stri_rand_strings(100, 1, "[a-z]"),
".jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2",
sep=""
) -> img_urls
head(img_urls)
## [1] "https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2"
## [2] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/66021637_359927357880233_471353444_q.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [3] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/47937926_769874508959124_426288550_z.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [4] "https://https://content_xxx.xxx.com/vp/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/12303834_440673970920272_460810703_n.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [5] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/54186717_202600346704982_713363439_y.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [6] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/48675570_402479399847865_689787883_e.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
Now, let's try to parse those URLs:
invisible(urltools::url_parse(img_urls))
invisible(httr::parse_url(img_urls))
## Error in httr::parse_url(img_urls): length(url) == 1 is not TRUE
DOH! httr can't do it.
invisible(XML::parseURI(img_urls))
## Error in if (is.na(uri)) return(structure(as.character(uri), class = "URI")): the condition has length > 1
DOH! XML can't do it either.
That means we need to use an sapply() crutch for httr and XML to get the path component (you can run basename() on any resultant vector as Konrad showed):
data_frame(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
) -> paths
glimpse(paths)
## Observations: 100
## Variables: 3
## $ urltools <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ httr <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ XML <chr> "/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_...
Note the not really standard inclusion of the initial, / in the path from XML. That's not important for you for this example, but it's important to note the difference in general.
We'll process one of them since XML and httr have that woeful limitation:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls[1])$path,
httr = httr::parse_url(img_urls[1])$path,
XML = XML::parseURI(img_urls[1])$path
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 351.268 397.6040 557.09641 499.2220 618.5945 1309.454 100
## httr 550.298 619.5080 843.26520 717.0705 888.3915 4213.070 100
## XML 11.858 16.9115 27.97848 26.1450 33.9065 109.882 100
XML looks faster, but it's not in practice:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 718.887 853.374 1093.404 918.3045 1146.540 2872.076 100
## httr 58513.970 64738.477 80697.548 68908.7635 81549.154 224157.857 100
## XML 1155.370 1245.415 2012.660 1359.8215 1880.372 26184.943 100
If you really want to go the regex route, you can read the RFC for the URL BNF and a naive regex for hacking bits out of one and Google for the seminal example that has over a dozen regular expressions that handle not-so-well-formed URIs, but parsing is generally a better strategy for diverse URL content. For your case, splitting and regex'ing might work just fine but it isn't necessarily going to be that much faster than parsing:
microbenchmark::microbenchmark(
urltools = tools::file_path_sans_ext(basename(urltools::url_parse(img_urls)$path)),
httr = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE))),
XML = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE))),
regex = stri_match_first_regex(img_urls, "/([[:digit:]]{8}_[[:digit:]]{15}_[[:digit:]]{9}_[[:alpha:]]{1})\\.jpg\\?")[,2]
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## urltools 1.140421 1.228988 1.502525 1.286650 1.444522 6.970044 100
## httr 56.563403 65.696242 77.492290 69.809393 80.075763 157.657508 100
## XML 1.513174 1.604012 2.039502 1.702018 1.931468 11.306436 100
## regex 1.137204 1.223683 1.337675 1.260339 1.397273 2.241121 100
As noted in that final example, you'll need to run tools::file_path_sans_ext() on the result to remove the .jpg (or sub() it away).

How is hashing done in environment in R? (for optimizing lookup performance)

When using a lookup by name in a list, it is possible to first turn the list into an environment with hashing. For example:
x <- 1:1e5
names(x) <- x
lx <- as.list(x)
elx <- list2env(lx, hash = TRUE) # takes some time
library(microbenchmark)
microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
With the following performance gain:
> microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
Unit: nanoseconds
expr min lq mean median uq max neval cld
x[[which(x == 1000)]] 547213 681609.5 1063382.25 720718.5 788538.5 5999776 100 b
x[["1000"]] 6518 6829.0 7961.83 7139.0 8070.0 22659 100 a
lx[["1000"]] 6518 6829.0 8284.63 7140.0 8070.5 33212 100 a
get("1000", envir = elx) 621 931.0 2477.22 1242.0 2794.0 20175 100 a
elx[["1000"]] 0 1.0 1288.47 311.0 1552.0 22659 100 a
When looking at the help page for list2env:
(for the case envir = NULL): logical indicating if the created
environment should use hashing, see new.env.
When looking at the help for new.env, it doesn't explain how the hash table is created, but it does say:
For the performance implications of hashing or not, see
https://en.wikipedia.org/wiki/Hash_table.
So it's obvious that hashing is done, and works well (at least for the example I gave), but seeing from the Wikipedia page, it is clear there are various ways of creating hash tables. Hence, my question is: how is the hash table created in list2env?

How to compute large object's hash value in R?

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).
I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.
One fast way to do that is to calculate the database's hash with the previously stored hash.
The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(
Does anyone know about a way around this problem?
There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)
Similar problem has been described in our issue tracker here:
https://github.com/eddelbuettel/digest/issues/33
The current version of digest can read a file to compute the hash.
Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.
The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.
library(parallel)
library(digest)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist
producer <- mcparallel({
mystream <- file(fname, "w")
writeLines(x[1:5], mystream)
writeLines(x[6:10], mystream)
close(mystream) # sends signal to the consumer (digester)
})
digester <- mcparallel({
digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})
# runs both processes in parallel
mccollect(list(producer, digester))
unlink(fname) # named pipe removed
UPDATE: Henrik Bengtsson provided a modified example based on futures:
library("future")
plan(multiprocess)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists
producer %<-% {
mystream <- file(fname, open="wb")
writeBin(x[1:5], endian="little", con=mystream)
writeBin(x[6:10], endian="little", con=mystream)
close(mystream) # sends signal to the consumer (digester)
}
# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).
library(microbenchmark)
#iris
nrow(iris)
microbenchmark(
whole = digest::digest(iris),
cols = digest::digest(lapply(iris, digest::digest))
)
#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)
microbenchmark(
whole = digest::digest(long_iris),
cols = digest::digest(lapply(long_iris, digest::digest))
)
Results:
#normal
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 12.6 13.6 14.4 14.0 14.6 24.9 100 b
cols 12.5 12.8 13.3 13.1 13.5 23.0 100 a
#long
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 296 306 317 311 316 470 100 b
cols 261 276 290 282 291 429 100 a

R package: hash: save into R objects on disk very slow

I am using the great hash package by Christopher Brown. In my use case, I have several thousand keys in the first level, and each related value may save another 1-3 layers of hash objects. When I try to save it using save, it seems to take a really long time.
I then tried save and load on the exact same setup, except a smaller use case with about 100 keys. The save and load work well, except it does seem to take longer than doing the same for usual R objects of similar size.
Is this a known problem, and is there any work around of the speed issue?
My machine setup: Mac OSX 10.6, RStudio 0.98.1091, hash 3.0.1.
The code used to generate the data, and the outcome (in comment) is below:
library(hash)
library(microbenchmark)
create_hash = function(ahash, level1=10, level2=5, level3=2) {
for (i in 1:level1) {
ahash[[paste0('a',i)]] = hash()
for (j in 1:level2) {
ahash[[paste0('a',i)]][[paste0('b',j)]] = hash()
for (k in 1:level3) {
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]] = hash()
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key1']] = 'value1'
ahash[[paste0('a',i)]][[paste0('b',j)]][[paste0('c',k)]][['key2']] = 'value2'
}
}
}
}
base1 = hash()
create_hash(base1, 100, 10, 2)
microbenchmark(save(base1, file='base1.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base1, file = "base1.Robj") 4.962731 4.987589 5.212594 5.102403 5.316056 5.694193 5
# File size: 1.6 MB
base2 = hash()
create_hash(base2, 1000, 10, 2)
microbenchmark(save(base2, file='base2.Robj'), times=5, unit='s')
# Unit: seconds
# expr min lq mean median uq max neval
# save(base2, file = "base2.Robj") 108.6682 109.2254 110.4126 109.3526 111.1013 113.7154 5
# File size: 16.1 MB

Resources