R speed of data.table - r

I have a specific performance issue, that i wish to extend more generally if possible.
Context:
I've been playing around on google colab with a python code sample for a Q-Learning agent, which associate a state and an action to a value using a defaultdict:
self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
return self._qvalues[state][action]
Not an expert but my understanding is it returns the value or add and returns 0 if the key is not found.
i'm adapting part of this in R.
the problem is I don't how many state/values combinations I have, and technically i should not know how many states I guess.
At first I went the wrong way, with the rbind of data.frames and that was very slow.
I then replaced my R object with a data.frame(state, action, value = NA_real).
it works but it's still very slow. another problem is my data.frame object has the maximum size which might be problematic in the future.
then I chanded my data.frame to a data.table, which gave me worst performance, then I finally indexed it by (state, action).
qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE)
setkey(qvalues, "qstate", "qaction")
Problem:
Comparing googlecolab/python vs my local R implementation, google performs 1000x10e4 access to the object in let's say 15s, while my code performs 100x100 access in 28s. I got 2s improvements by byte compiling but that's still too bad.
Using profvis, I see most of the time is spent accessing the data.table on these two calls:
qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
self$qvalues[J(state, action)]$qvalue <- value
I don't really know what google has, but my desktop is a beast. Also I saw some benchmarks stating data.table was faster than pandas, so I suppose the problem lies in my choice of container.
Questions:
is my use of a data.table wrong and can be fixed to improve and match the python implementation?
is another design possible to avoid declaring all the state/actions combinations which could be a problem if the dimensions become too large?
i've seen about the hash package, is it the way to go?
Thanks a lot for any pointer!
UPDATE:
thanks for all the input.
So what I did was to replace 3 access to my data.table using your suggestions:
#self$qvalues[J(state, action)]$qvalue <- value
self$qvalues[J(state, action), qvalue := value]
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
this dropped the runtime from 33s to 21s
that's a massive improvement, but that's still extremely slow compared to the python defaultdict implementation.
I noted the following:
working in batch: I don't think I can do as the call to the function depends on the previous call.
peudospin> I see you are surprised the get is time consuming. so am I but that's what profvis states:
and here the code of the function as a reference:
QAgent$set("public", "get_qvalue", function( state, action) {
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
if (is.na(qval)) {
#self$qvalues[self$qvalues$qstate == state & self$qvalues$qaction == action,]$qvalue <- 0
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
return(0)
}
return(qval)
})
At this point, if no more suggestion, I will conclude the data.table is just too slow for this kind of task, and I should look into using an env or a collections. (as suggested there: R fast single item lookup from list vs data.table vs hash )
CONCLUSION:
I replaced the data.table for a collections::dict and the bottleneck completely disappeared.

data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.
You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.

Benchmark
library(data.table)
Sys.setenv('R_MAX_VSIZE'=32000000000) # add to the ram limit
setDTthreads(threads=0) # use maximum threads possible
nbstates <- 1e3
nbactions <- 1e5
cartesian <- function(nbstates,nbactions){
x= data.table(qstate=1:nbactions)
y= data.table(qaction=1:nbstates)
k = NULL
x = x[, c(k=1, .SD)]
setkey(x, k)
y = y[, c(k=1, .SD)]
setkey(y, NULL)
x[y, allow.cartesian=TRUE][, c("k", "qvalue") := list(NULL, NA_real_)][]
}
#comparing seq with `:`
(bench = microbenchmark::microbenchmark(
1:1e9,
seq(1e9),
times=1000L
))
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> 1:1e+09 120 143 176.264 156.0 201 5097 1000
#> seq(1e+09) 3039 3165 3333.339 3242.5 3371 21648 1000
ggplot2::autoplot(bench)
(bench = microbenchmark::microbenchmark(
"Cartesian product" = cartesian(nbstates,nbactions),
"data.table assignement"=qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE),
times=100L))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Cartesian product 3.181805 3.690535 4.093756 3.992223 4.306766 7.662306 100
#> data.table assignement 5.207858 5.554164 5.965930 5.895183 6.279175 7.670521 100
#> data.table (1:nb) 5.006773 5.609738 5.828659 5.80034 5.979303 6.727074 100
#>
#>
ggplot2::autoplot(bench)
it is clear the using seq consumes more time than calling the 1:nb. plus using a cartesian product makes the code faster even when 1:nb is used

Related

R string operation : How could i optimize this one?

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple
I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)
suppose :
c("yop",NA,"1234567","19","12AN","PLOP","5689777")
Given for exemple an objective size of 10, i want :
[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
as a result, as fast as possible.
I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.
Here's my actual code :
library(purrr)
zero_left <- function(field,nb){
map2_chr(
map(abs(nb-nchar(field)),~ rep("0",.x)),
field,
~ paste0(c(.x,.y),collapse=""))
}
trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)
This code does not even treat the NA case... But without it it works, but too slow.
This relies on an external package but takes 1/30 of the time your zero_left() function takes:
nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"
Edit 1:
Base-R solution that is seems probably isn't just as fast:
gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)
Edit 2:
Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:
stringi::stri_pad_left(trial, width = nb, pad = "0")
If speed is your concern, base R can be faster than stringr/stringi:
library(microbenchmark)
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470 100
# stringi 10.473 12.359 13.15298 13.0180 13.5445 21.418 100
# base 7.848 9.392 10.83702 10.2035 10.8980 43.620 100
The only consequence is that the NA is turned into a literal "NANA" here
paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA" "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"
so the workaround is
microbenchmark(
stringr=stringr::str_pad(trial, width=nb, pad="0"),
stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096 100
# stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135 100
# base 10.766 11.9185 13.69714 13.0665 13.8035 87.226 100
(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)
(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

A faster way to generate a vector of UUIDs in R

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time
# Time difference of 15.072 secs
Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6
They should be RFC 4122 compliant but the other requirements are flexible.
Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)
In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.
It is possible to get faster performance at scale, just not with uuid. See below.
uuid on Windows
Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:
library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100
... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)
However, there is a problem:
R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20
Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)
uuid on Linux
For comparison, the same results on a non-windows platform:
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100
R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000
(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)
UUID generation with a SQL server
If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.
(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)
Some performance comparisons on windows (all times in seconds):
# n uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31
The use of SQL has some overhead that does not take long to overcome when done at scale.
PostgreSQL needs the uuid-ossp extension, installable with
CREATE EXTENSION "uuid-ossp"
Once installed/available, you can generate n UUIDs with:
n <- 3
pgcon <- DBI::dbConnect(...)
DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
# uuid
# 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
# 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
# 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):
sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
DBI::dbGetQuery(sqlitecon, sprintf("
WITH RECURSIVE cnt(x) as (
select 1 union all select x+1 from cnt limit %d
)
select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
from cnt", n))
# uuid
# 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
# 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
# 3 EAC159D6-7986-F42C-C5F5-35764544C105
This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)
SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)
Other considerations
In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.
However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.
Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.
(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)
Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.
For 10k UUIDs,
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs
Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs

Is there a way to count rows in R without loading data first? [duplicate]

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?
For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename
Option 1:
Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields) for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063
Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.
Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more:
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/
Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100

How is hashing done in environment in R? (for optimizing lookup performance)

When using a lookup by name in a list, it is possible to first turn the list into an environment with hashing. For example:
x <- 1:1e5
names(x) <- x
lx <- as.list(x)
elx <- list2env(lx, hash = TRUE) # takes some time
library(microbenchmark)
microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
With the following performance gain:
> microbenchmark(x[[which(x==1000)]], x[["1000"]], lx[["1000"]], get("1000", envir = elx), elx[["1000"]])
Unit: nanoseconds
expr min lq mean median uq max neval cld
x[[which(x == 1000)]] 547213 681609.5 1063382.25 720718.5 788538.5 5999776 100 b
x[["1000"]] 6518 6829.0 7961.83 7139.0 8070.0 22659 100 a
lx[["1000"]] 6518 6829.0 8284.63 7140.0 8070.5 33212 100 a
get("1000", envir = elx) 621 931.0 2477.22 1242.0 2794.0 20175 100 a
elx[["1000"]] 0 1.0 1288.47 311.0 1552.0 22659 100 a
When looking at the help page for list2env:
(for the case envir = NULL): logical indicating if the created
environment should use hashing, see new.env.
When looking at the help for new.env, it doesn't explain how the hash table is created, but it does say:
For the performance implications of hashing or not, see
https://en.wikipedia.org/wiki/Hash_table.
So it's obvious that hashing is done, and works well (at least for the example I gave), but seeing from the Wikipedia page, it is clear there are various ways of creating hash tables. Hence, my question is: how is the hash table created in list2env?

How to compute large object's hash value in R?

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).
I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.
One fast way to do that is to calculate the database's hash with the previously stored hash.
The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(
Does anyone know about a way around this problem?
There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)
Similar problem has been described in our issue tracker here:
https://github.com/eddelbuettel/digest/issues/33
The current version of digest can read a file to compute the hash.
Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.
The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.
library(parallel)
library(digest)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist
producer <- mcparallel({
mystream <- file(fname, "w")
writeLines(x[1:5], mystream)
writeLines(x[6:10], mystream)
close(mystream) # sends signal to the consumer (digester)
})
digester <- mcparallel({
digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})
# runs both processes in parallel
mccollect(list(producer, digester))
unlink(fname) # named pipe removed
UPDATE: Henrik Bengtsson provided a modified example based on futures:
library("future")
plan(multiprocess)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists
producer %<-% {
mystream <- file(fname, open="wb")
writeBin(x[1:5], endian="little", con=mystream)
writeBin(x[6:10], endian="little", con=mystream)
close(mystream) # sends signal to the consumer (digester)
}
# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).
library(microbenchmark)
#iris
nrow(iris)
microbenchmark(
whole = digest::digest(iris),
cols = digest::digest(lapply(iris, digest::digest))
)
#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)
microbenchmark(
whole = digest::digest(long_iris),
cols = digest::digest(lapply(long_iris, digest::digest))
)
Results:
#normal
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 12.6 13.6 14.4 14.0 14.6 24.9 100 b
cols 12.5 12.8 13.3 13.1 13.5 23.0 100 a
#long
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 296 306 317 311 316 470 100 b
cols 261 276 290 282 291 429 100 a

Resources