...besides the fact that Rscript is invoked with #!/usr/bin/env Rscript and littler with #!/usr/local/bin/r (on my system) in first line of script file. I've found certain differences in execution speed (seems like littler is a bit slower).
I've created two dummy scripts, ran each 1000 times and compared average execution time.
Here's the Rscript file:
#!/usr/bin/env Rscript
btime <- proc.time()
x <- rnorm(100)
print(x)
print(plot(x))
etime <- proc.time()
tm <- etime - btime
sink(file = "rscript.r.out", append = TRUE)
cat(paste(tm[1:3], collapse = ";"), "\n")
sink()
print(tm)
and here's the littler file:
#!/usr/local/bin/r
btime <- proc.time()
x <- rnorm(100)
print(x)
print(plot(x))
etime <- proc.time()
tm <- etime - btime
sink(file = "little.r.out", append = TRUE)
cat(paste(tm[1:3], collapse = ";"), "\n")
sink()
print(tm)
As you can see, they are almost identical (first line and sink file argument differ). Output is sinked to text file, hence imported in R with read.table. I've created bash script to execute each script 1000 times, then calculated averages.
Here's bash script:
for i in `seq 1000`
do
./$1
echo "####################"
echo "Iteration #$i"
echo "####################"
done
And the results are:
# littler script
> mean(lit)
user system elapsed
0.489327 0.035458 0.588647
> sapply(lit, median)
L1 L2 L3
0.490 0.036 0.609
# Rscript
> mean(rsc)
user system elapsed
0.219334 0.008042 0.274017
> sapply(rsc, median)
R1 R2 R3
0.220 0.007 0.258
Long story short: beside (obvious) execution-time difference, is there some other difference? More important question is: why should/shouldn't you prefer littler over Rscript (or vice versa)?
Couple quick comments:
The path /usr/local/bin/r is arbitrary, you can use /usr/bin/env r as well as we do in some examples. As I recall, it limits what other arguments you can give to r as it takes only one when invoked via env
I don't understand your benchmark, and why you'd do it that way. We do have timing comparisons in the sources, see tests/timing.sh and tests/timing2.sh. Maybe you want to split the test between startup and graph creation or whatever you are after.
Whenever we ran those tests, littler won. (It still won when I re-ran those right now.) Which made sense to us because if you look at the sources to Rscript.exe, it works different by setting up the environment and a command string before eventually calling execv(cmd, av). littler can start a little quicker.
The main price is portability. The way littler is built, it won't make it to Windows. Or at least not easily. OTOH we have RInside ported so if someone really wanted to...
Littler came first in September 2006 versus Rscript which came with R 2.5.0 in April 2007.
Rscript is now everywhere where R is. That is a big advantage.
Command-line options are a little more sensible for littler in my view.
Both work with CRAN packages getopt and optparse for option parsing.
So it's a personal preference. I co-wrote littler, learned a lot doing that (eg for RInside) and still find it useful -- so I use it dozens of times each day. It drives CRANberries. It drives cran2deb. Your mileage may, as hey say, vary.
Disclaimer: littler is one of my projects.
Postscriptum: I would have written the test as
I would have written this as
fun <- function { X <- rnorm(100); print(x); print(plot(x)) }
replicate(N, system.time( fun )["elapsed"])
or even
mean( replicate(N, system.time(fun)["elapsed"]), trim=0.05)
to get rid of the outliers. Moreover, you only essentially measure I/O (a print, and a plot) which both will get from the R library so I would expect little difference.
Related
I have an R function that loads, processes, and saves many files. Here is a dummy version:
load_process_saveFiles <- function(onlyFiles = c()){
allFiles <- paste(LETTERS, '.csv', sep = '')
# If desired, only include certain files
if(length(onlyFiles) > 0){
allFiles <- allFiles[allFiles %in% onlyFiles]
}
for(file in allFiles){
# load file
rawFile <- file
# Run a super long function
processedFile <- rawFile
# Save file
# write.csv(processedFile, paste('./Other/Path/', file, sep = ''), row.names = FALSE)
cat('\nDone with file ', file, sep = '')
}
}
It has to run through about 30 files, and each one takes about 3 minutes. It can be very time consuming to loop through the entire thing. What I'd like to do is run each one separately at the same time so that it would take 3 minutes all together instead of 3 x 30 = 90 minutes.
I know I can achieve this by creating a bunch of RStudio sessions or many terminal tabs, but I can't handle having that many sessions or tabs open at once.
Ideally, I'd like to have all of the files with separate functions listed in one batchRun.R file which I can run from the terminal:
source('./PathToFunction/load_process_saveFiles.R')
load_process_saveFiles(onlyFiles = 'A.csv')
load_process_saveFiles(onlyFiles = 'B.csv')
load_process_saveFiles(onlyFiles = 'C.csv')
load_process_saveFiles(onlyFiles = 'D.csv')
load_process_saveFiles(onlyFiles = 'E.csv')
load_process_saveFiles(onlyFiles = 'F.csv')
So then run $ RScript batchRun.R from the terminal.
I've tried looking up different examples on SO trying to accomplish something similar, but each have some unique features and I just can't get it to work. Is what I'm trying to do possible? Thanks!
Package parallel gives you a number of options. One option is to parallelize the calls to load_process_saveFiles and have the loop inside of the function run serially. Another option is to parallelize the loop and have the calls run serially. The best way to assess which approach is more suitable for your job is to time them both yourself.
Evaluating the calls to load_process_saveFiles in parallel is relatively straightforward with mclapply, the parallel version of the base function lapply (see ?lapply):
parallel::mclapply(x, load_process_saveFiles, mc.cores = 2L)
Here, x is a list of values of the argument onlyFiles, and mc.cores = 2L indicates that you want to divide the calls among two R processes.
Evaluating the loop inside of load_process_saveFiles in parallel would involve replacing the entire for statement with something like
f <- function(file) {
cat("Processing file", file, "...")
x <- read(file)
y <- process(x)
write(y, file = file.path("path", "to", file))
cat(" done!\n")
}
parallel::mclapply(allFiles, f, ...)
and redefining load_process_saveFiles to allow optional arguments:
load_process_saveFiles <- function(onlyFiles = character(0L), ...) {
## body
}
Then you could do, for example, load_process_saveFiles(onlyFiles, mc.cores = 2L).
I should point out that mclapply is not supported on Windows. On Windows, you can use parLapply instead, but there are some extra steps involved. These are described in the parallel vignette, which can be opened from R with vignette("parallel", "parallel"). The vignette acts as general introduction to parallelism in R, so it could be worth reading anyway.
Parallel package is useful in this case. And if you are using Linux OS, I would recommend doMC package instead of parallel. This doMC package is useful even for looping over big data used in machine learning projects.
See below for my reprex of my issues with source, <-, <<-, environments, etc.
There's 3 files, testrun.R, which calls inputs.R and CODE.R.
# testrun.R (file 1)
today <<- "abcdef"
source("inputs.R")
for (DC in c("a", "b")) {
usedlater_3 <- paste("X", DC, used_later2)
print(usedlater_3)
source("CODE.R", local = TRUE)
}
final_output <- paste(OD_output, used_later2, usedlater_3)
print(final_output)
# #---- file 2
# # inputs.R
# used_later1 <- paste(today, "_later")
# used_later2 <- "l2"
#
# #---- file 3
# # CODE.R
# OD_output <- paste(DC, today, used_later1, usedlater_2, usedlater_3)
I'm afraid I didn't learn R or CS in a proper way so I'm trying to catch up now. Any bigger picture lessons would be helpful. Previously, I've been relying on a global environment where I keep everything (and save/keep between sessions), but now I'm trying to make everything reproducible, so I'm using RStudio to run local jobs that start from scratch.
I've been trying different combinations of <-, <<-, and source(local = TRUE) (instead of local = FALSE). I do use functions for pieces of code where I know the inputs I need and outputs I want, but as you can see, CODE.R uses variables from both testrun.R, the loop inside testrun.R, and input.R. Converting some of the code into functions might help ? but I'd like to know of alternatives as well given this case.
Finally you can see my own troubleshooting log to see my thought process:
first run: variable today wasn't found, so I made today <<- "abcdef" double arrow assignment
second run: DC not found, so I will switch to local = TRUE
third run: but now usedlater_2 not found, so i will change usedlater_2 to <<-. (what about usedlater_1? why didn't this show up as error? we'll see...)
result of third run: usedlater_2 still not found when CODE.R needs it. out of ideas. note: used_later2 was found to create used_later3 in the for loop in testrun.R.
Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.
EDIT:
Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.
## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON,
mc.cores=mc.cores),
foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
replications=100)
Here is a second code sample
parseData <- function(x) {
x <- tryCatch(fromJSON(x),
error=function(e) return(list())
)
## need to do a test to see if valid data, if so ,save out the files
if (!is.null(x$id_str)) {
x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
fname <- paste("rdata/",
format(x$created_at, "%m"),
format(x$created_at, "%d"),
format(x$created_at, "%Y"),
"_",
x$id_str,
sep="")
saveRDS(x, fname)
rm(x, fname)
gc(verbose=FALSE)
}
}
t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.
Reading in parallel
You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:
result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)
Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.
Processing in parallel
In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.
probably not with readLines() due to the nature of non-parallel file-system IO. Of course, if you're using a parallel NFS or something like HDFS, then this restriction won't apply. But assuming you're on a "standard" architecture, it won't be feasible to parallelize your readLine() calls.
Your best bet would probably be to read in the entire file seeing as <500MB will probably fit in memory, then parallelize the processing once you're object is already read in.
I sometimes work with lots of objects and it would be nice to have a fresh start because of memory issues between chunks. Consider the following example:
warning: I have 8GB of RAM. If you don't have much, this might eat it all up.
<<chunk1>>=
a <- 1:200000000
#
<<chunk2>>=
b <- 1:200000000
#
<<chunk3>>=
c <- 1:200000000
#
The solution in this case is:
<<chunk1>>=
a <- 1:200000000
#
<<chunk2>>=
rm(a)
gc()
b <- 1:200000000
#
<<chunk3>>=
rm(b)
gc()
c <- 1:200000000
#
However, in my example (which I can post because it relies on a large dataset), even after I remove all of the objects and run gc(), R does not clear all of the memory (only some). The reason is found in ?gc:
However, it can be useful to call ‘gc’ after a large object has
been removed, as this may prompt R to return memory to the
operating system.
Note the important word may. R has a lot of situations where it specifies may like this and so it is not a bug.
Is there a chunk option according to which I can have knitr start a new R session?
My recommendation would to create an individual .Rnw for each of the major tasks, knit them to .tex files and then use \include or \input in a parent.Rnw file to build the full project. Control the building of the project via a makefile.
However, to address this specific question, using a fresh R session for each chunk, you could use the R package subprocess to spawn a R session, run the needed code, extract the results, and then kill the spawned session.
A simple example .Rnw file
\documentclass{article}
\usepackage{fullpage}
\begin{document}
<<include = FALSE>>=
knitr::opts_chunk$set(collapse = FALSE)
#
<<>>=
library(subprocess)
# define a function to identify the R binary
R_binary <- function () {
R_exe <- ifelse (tolower(.Platform$OS.type) == "windows", "R.exe", "R")
return(file.path(R.home("bin"), R_exe))
}
#
<<>>=
# Start a subprocess running vanilla R.
subR <- subprocess::spawn_process(R_binary(), c("--vanilla --quiet"))
Sys.sleep(2) # wait for the process to spawn
# write to the process
subprocess::process_write(subR, "y <- rnorm(100, mean = 2)\n")
subprocess::process_write(subR, "summary(y)\n")
# read from the process
subprocess::process_read(subR, PIPE_STDOUT)
# kill the process before moving on.
subprocess::process_kill(subR)
#
<<>>=
print(sessionInfo(), local = FALSE)
#
\end{document}
Generates the following pdf:
Is there a way to run an executable from R and capture its output?
As an example:
output <- run_exe("my.exe")
Yes, look at system() and its options. Hence
R> res <- system("echo 4/3 | bc -l", intern=TRUE)
R> res
[1] "1.33333333333333333333"
R>
would be one way to divide four by three in case you mistrust the R engine itself.