Unexpected results in benchmark of read.csv / fread [duplicate]

Unexpected results in benchmark of read.csv / fread [duplicate] - r

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?

As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

Related

subtracting posixCT objects and not getting seconds

I'm getting weird units after I subtract POSIXct objects, which are returned from two calls to Sys.time(). I'm using Sys.time() to time some call to system()--something like this:
start <- Sys.time()
system("./something_complicated_that_takes_a_while")
end <- Sys.time()
cat(end - start, "seconds\n")
I get 1.81494815872775 seconds, which is very strange. The runtime was closer to 1.8 hours, though. Just to check, I can do this:
start <- Sys.time()
system("/bin/sleep 2")
end <- Sys.time()
cat(end - start, "seconds\n")
and I get 2.002262 seconds, so it's working fine here. Any idea what's going on here?

Your first code is ok , it is 1.8 hours not seconds , here is explanation
a <- Sys.time()
b <- Sys.time() + 2 * 60 *60 # i add 2 hours here
b - a
#> Time difference of 2.000352 hours
above the deference b - a gives the answer in hours not seconds , so if you want to use cat try
cat(b-a , attr(b - a , "units"))
#> 2.000352 hours
and if you want your output in seconds try this
difftime(b , a , units = "secs")
#> Time difference of 7201.266 secs

R paste time difference with unit (sec, min etc)

In R I want to get the timing in a character string keeping the unit (e.g., if it is sec or min). Please see example code below.
T1 <- Sys.time()
T2 <- Sys.time()
duration <- T2-T1
# Looking at duration show unit:
duration
time_description <- paste("it took: ", round(duration, 2), sep="", col="")
# However int time description the unit is removed
time_description
Preferably without using additional packages.
Thanks in advance.

You can use units to extract the unit from difftime object.
time_description <- sprintf('it took %.2f %s', duration, units(duration))
time_description
#[1] "it took 0.39 secs"

In R, image processing loop takes an order of magnitude longer to process after ~50 iterations

EDIT: changed the image names from Image1-Image11 to Image50-Image60 for clarity.
EDIT2: Solved by adding a garbage collection command after removing the image file in each loop iteration. Code is updated.
I have 400+ jpeg images in a folder. I'm trying write a script to: read each image, identify some text in the image, and then write the file name and that text into a data frame.
When I run the script below, the first ~50 iterations print a time of .1-.3 seconds. Then, for a few iterations, the iteration will take 1-3 seconds. Then, this bumps up to 1-5 minutes, after which I kill the script.
library(dplyr)
library(magick)
fileList3 = list.files(path = filePath)
printJobXRes = data.frame(
jobName = as.character(),
xRes = as.numeric(),
stringsAsFactors = FALSE
)
i = 0
for (fileName in fileList3){
img = paste0(filePath, '/', fileName, '_TestImage.jpg')
start_time = Sys.time()
temp.xRes = image_read(img, strip = T) %>%
image_rotate(270) %>%
image_crop('90x150+1750') %>%
image_negate %>%
image_convert(type = 'Bilevel') %>%
image_ocr %>%
as.numeric
stop_time = Sys.time()
i = i+1
print(paste(fileName,'first attempt, item #', i))
print(stop_time-start_time)
temp.df3 = data.frame(
jobName = fileName,
xRes = temp.xRes,
stringsAsFactors = FALSE
)
printJobXRes = rbind(printJobXRes, temp.df3)
rm(temp.xRes)
rm(temp.df3)
rm(img)
gc() #This solved the issue
}
Here's a couple lines of the output:
#Images 1-49 process in .1-.3 seconds each
[1] "Image50.jpg first attempt, item # 50"
Time difference of 0.2320111 secs
[1] "Image51.jpg first attempt, item # 51"
Time difference of 0.213742 secs
[1] "Image52.jpg first attempt, item # 52"
Time difference of 0.2536581 secs
[1] "Image53.jpg first attempt, item # 53"
Time difference of 1.253844 secs
[1] "Image54.jpg first attempt, item # 54"
Time difference of 1.149764 secs
[1] "Image55.jpg first attempt, item # 55"
Time difference of 1.171134 secs
[1] "Image56.jpg first attempt, item # 56"
Time difference of 1.397093 secs
[1] "Image57.jpg first attempt, item # 57"
Time difference of 1.201915 secs
[1] "Image58.jpg first attempt, item # 58"
Time difference of 1.455768 secs
[1] "Image59.jpg first attempt, item # 59"
Time difference of 1.618744 secs
[1] "Image60.jpg first attempt, item # 60"
Time difference of 4.527751 mins
Can anyone offer suggestions as to why the loop doesn't continue to take ~.1-.3 seconds? All jpgs are roughly the same size, resolution, and all generated from the same source.

I was able to solve my issue based on Mark's suggestion. I was removing the image file from memory in each loop iteration, but the freed up memory was never realized by R. I added a garbage collection command (gc()) into the loop to fix this issue, and the loop then ran as expected.

Timing R code with Sys.time()

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?

As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

A faster way to generate a vector of UUIDs in R

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time
# Time difference of 15.072 secs
Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6
They should be RFC 4122 compliant but the other requirements are flexible.

Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)
In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.
It is possible to get faster performance at scale, just not with uuid. See below.
uuid on Windows
Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:
library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100
... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)
However, there is a problem:
R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20
Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)
uuid on Linux
For comparison, the same results on a non-windows platform:
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100
R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000
(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)
UUID generation with a SQL server
If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.
(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)
Some performance comparisons on windows (all times in seconds):
# n uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31
The use of SQL has some overhead that does not take long to overcome when done at scale.
PostgreSQL needs the uuid-ossp extension, installable with
CREATE EXTENSION "uuid-ossp"
Once installed/available, you can generate n UUIDs with:
n <- 3
pgcon <- DBI::dbConnect(...)
DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
# uuid
# 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
# 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
# 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):
sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
DBI::dbGetQuery(sqlitecon, sprintf("
WITH RECURSIVE cnt(x) as (
select 1 union all select x+1 from cnt limit %d
)
select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
from cnt", n))
# uuid
# 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
# 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
# 3 EAC159D6-7986-F42C-C5F5-35764544C105
This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)
SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)
Other considerations
In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.
However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.
Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.
(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)

Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.
For 10k UUIDs,
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs
Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unexpected results in benchmark of read.csv / fread [duplicate] - r

Related

subtracting posixCT objects and not getting seconds

R paste time difference with unit (sec, min etc)

In R, image processing loop takes an order of magnitude longer to process after ~50 iterations

Timing R code with Sys.time()

A faster way to generate a vector of UUIDs in R

Categories

Resources