Optimizing file write speed in R - r

I am wondering about the possibility to speed up the process of writing to a file.
with my SSD and core i5 vPro I am getting following results for the file of 5234 KB:
system.time({
write(reportData, "aaa.txt")
})
user system elapsed
1.42 3.56 12.28
as well as
system.time({
fileConn<-file("aaa.txt")
writeLines(reportData, fileConn)
close(fileConn)
})
user system elapsed
1.43 3.46 13.61
and
system.time({
fileConn <- file("aaa.txt","w")
cat(reportData,file=fileConn,sep="")
close(fileConn)
})
user system elapsed
1.50 4.13 14.12
All of them seem to be implemented in the similar manner since the time execution is almost identical.
Is it possible to use Rcpp library, for c++ could definitely do it much faster?
EDIT
Without using Rcpp writeChar seems to be the fastest.
system.time({
fileConn<-file("aaa.txt")
writeChar(reportData, fileConn,nchar(reportData, type = "chars"))
close(fileConn)
})
user system elapsed
0.01 0.14 1.31

Related

Need to inspect memoised package function for memoisation to work

Let's say I have a package with the following function:
foo <- function() {
Sys.sleep(1) # really expensive operation
return(1)
}
The value of the function is always the same per run, so I would like to use memoisation.
I thought I could simply do
foo <- memoise::memoise(function() {
Sys.sleep(1) # really expensive operation
return(1)
})
However, this doesn't work.
I mean, running it as a GlobalEnv function, it works:
foo <- memoise::memoise(function() {
Sys.sleep(1)
return(1)
})
system.time(foo())
#> user system elapsed
#> 0 0 1
system.time(foo())
#> user system elapsed
#> 0.01 0.00 0.01
Created on 2019-12-23 by the reprex package (v0.3.0)
However, if it's in a package, I get really weird behavior. Basically, memoisation doesn't kick in and I keep getting the same cost. However, if I print the function definition, it starts working!
system.time(bar::foo())
#> user system elapsed
#> 0.47 0.08 2.55
system.time(bar::foo())
#> user system elapsed
#> 0 0 2
system.time(bar::foo())
#> user system elapsed
#> 0.02 0.00 2.02
system.time(bar::foo())
#> user system elapsed
#> 0.01 0.00 2.02
bar::foo
#> Memoised Function:
#> function() {
#> Sys.sleep(2)
#> return (1)
#> }
#> <environment: namespace:bar>
system.time(bar::foo())
#> user system elapsed
#> 0 0 2
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
For the record, here are the relevant parts of the NAMESPACE and DESCRIPTION files:
# NAMESPACE
export(foo)
importFrom(memoise,memoise)
# DESCRIPTION [...]
Imports:
memoise
What's going on here, and what should I do to make memoisation work from the start in my package?
This looks like a bug in the memoise package. When you are working on your own package, R may add debug information (called srcrefs) to functions. Something about those cause the hash to come out differently every time you call the function, so it never recognizes that you are calling with the same arguments.
A simple workaround is to remove the install option "--with-keep.source" when you install your own package. (If you're using RStudio, this is added automatically in Project Options | Build Tools | Install and Restart... .) This will stop R from adding the srcref, and the bug in memoise won't be triggered. Unfortunately, this cripples the debugger in RStudio and other front-ends, so it's not ideal.
Another workaround that doesn't mess with the debugger (except for that one function) is to use removeSource on the target that is being memoised. For example,
foo <- memoise::memoise(removeSource(function() {
Sys.sleep(1) # really expensive operation
return(1)
}))

Copying files over the network is MUCH slower with `file.copy` than `system(mv ...)`

I have been having some issues with R becoming very sluggish when accessing files over our corporate network. So I dropped back and did some testing and I was shocked to discover that the R file.copy() command is much slower than the equivalent file copy using system(mv ...). Is this a known issue or am I doing something wrong here?
Here's my test:
I have three files:
large_random.txt - ~100 MB
medium_random.txt - ~10 MB
small_random.txt - ~1 MB
I created these on my Mac like so:
dd if=/dev/urandom of=small_random.txt bs=1048576 count=1
dd if=/dev/urandom of=medium_random.txt bs=1048576 count=10
dd if=/dev/urandom of=large_random.txt bs=1048576 count=100
But the following R tests were all done using Windows running in a virtual machine. The J: drive is local and the N: drive is 700 miles (1100 km) away.
library(tictoc)
test_copy <- function(source, des){
tic('r file.copy')
file.remove(des)
file.copy(source, des )
toc()
tic('system call')
system(paste('rm', des, sep=' '))
system(paste('cp', source, des, sep=' '))
toc()
}
source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
test_copy(source, des)
source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
test_copy(source, des)
source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
test_copy(source, des)
Which results in the following:
> source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
> test_copy(source, des)
r file.copy: 6.49 sec elapsed
system call: 2.12 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
> test_copy(source, des)
r file.copy: 56.86 sec elapsed
system call: 4.65 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
> test_copy(source, des)
r file.copy: 562.94 sec elapsed
system call: 31.01 sec elapsed
>
So what's going on that makes the system call so much faster? At the large file size it's more than 18 times slower!
I ran into the same problem with low performance of file.copy over network share drives. My solution was to use fs::file_copy() instead which performed even slightly better than the direct system call of copy.

Spark R 2.0 dapply very slow

I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006 `
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome!

Fast test if directory is empty

What is the fastest way to test if a directory is empty?
Of course I can check the length of
list.files(path, all.files = TRUE, include.dirs = TRUE, no.. = TRUE)
but this requires enumerating the entire contents of the directory which I'd rather avoid.
EDIT: I'm looking for portable solutions.
EDIT^2: Some timings for a huge directory (run this in a directory that's initially empty, it will create 100000 empty files):
system.time(file.create(as.character(0:99999)))
# user system elapsed
# 0.720 12.223 14.948
system.time(length(dir()))
# user system elapsed
# 2.419 0.600 3.167
system.time(system("ls | head -n 1"))
# 0
# user system elapsed
# 0.788 0.495 1.312
system.time(system("ls -f | head -n 3"))
# .
# ..
# 99064
# user system elapsed
# 0.002 0.015 0.019
The -f switch is crucial for ls, it will avoid the sorting that will take place otherwise.
How about if(length(dir(all.files=TRUE)) ==0) ?
I'm not sure what you qualify as "fast," but if dir takes a long time, someone is abusing your filesystem :-(.

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources