Run R script in parallel sessions in the background - r

I have a script test.R that takes arguments arg1, arg2 and outputs a arg1-arg2.csv file.
I would like to run the test.R in 6 parallel sessions (i am on a 6 core CPU) and in the background. How can I do it?
I am on linux

I suggest using the doParallel backend for the foreach package. The foreach package provides a nice syntax to write loops and takes care of combining the results. doParallel connects it to the parallel package included since R 2.14. On other setups (older versions of R, clusters, whatever) you could simply change the backend without touching any of your foreach loops. The foreach package in particular has excellent documentation, so it is really easy to use.
If you are going to write the results to individual files, then the result-combining features of foreach won't be of much use to you. So people might argue that direct use of parallel would be better suited to your application. Personally I find the way foreach expresses looping concepts much easier to use.

You did not provide a reproducible example so I am making one up. As you are on Linux, I am also swicthing to littler which was after all writtten for the very purpose of scripting with R.
#!/usr/bin/env r
#
# a simple example to install one or more packages
if (is.null(argv) | length(argv) != 2) {
cat("Usage: myscript.r arg1 arg2\n")
q()
}
filename <- sprintf("%s-%s.csv", argv[1], argv[2])
Sys.sleep(60) # do some real work here instead
write.csv(matrix(rnorm(9), 3, 3), file=filename)
and you can then lauch this either from the command-line as I do here, or from another (shell) script. The key is the & at the end to send it in the background:
edd#max:/tmp/tempdir$ ../myscript.r a b &
[1] 19575
edd#max:/tmp/tempdir$ ../myscript.r c d &
[2] 19590
edd#max:/tmp/tempdir$ ../myscript.r e f &
[3] 19607
edd#max:/tmp/tempdir$
The [$n] indicates how process how been launched in the background, the number that follows is the process id which you can use to monitor or kill. After a little while we get the results:
edd#max:/tmp/tempdir$
[1] Done ../myscript.r a b
[2]- Done ../myscript.r c d
[3]+ Done ../myscript.r e f
edd#max:/tmp/tempdir$ ls -ltr
total 12
-rw-rw-r-- 1 edd edd 192 Jun 24 09:39 a-b.csv
-rw-rw-r-- 1 edd edd 193 Jun 24 09:40 c-d.csv
-rw-rw-r-- 1 edd edd 193 Jun 24 09:40 e-f.csv
edd#max:/tmp/tempdir$
You may want to read up on Unix shells to learn more about &m the fg and bg background s etc.
Lastly, all this can a) also be done with Rscript though picking arguments is slightly different and b) there are CRAN packages getopt and optparse to facilitate working with command-line arguments.

The state of art would be to use the parallel package, but when I am lazy, I simply start 6 batch (cmd, assuming Windows) files with rscript.
You can set parameters in the cmd-file
SET ARG1="myfile"
rscript rest.r
and read it from
Sys.getenv("ARG")
Using 6 batch files, I can also append multiple runs in one batch to be sure that the cores are always busy.

Related

Julia shell mode in a .jl file

In REPL mode, Julia lets you type a semicolon and run shell commands, i.e.
;
cd ~
And then to return to Julian REPL
Is there a way to do something similar in a .jl file? The closest I found was run(…) and that has many caveats. This is a Linux environment, so I'm not concerned about the caveats of shell mode on Windows machines.
The broader topic of interest is in doing this for other REPL modes, like the R one provided by using RCall
As you mentioned, the default way to do is via the run command. If you have not already, check out the docs on this https://docs.julialang.org/en/v1/manual/running-external-programs/#Running-External-Programs which go into some of the caveats.
I am not sure I follow what you are getting at with RCall but it may perhaps be worth opening a separate question for that.
You can find the code for this at https://github.com/JuliaLang/julia/tree/master/stdlib/REPL/test.
Seems there is no API, just lots of typing.
Here is a minimal working example (the codes are mostly copied from different places from the folder above):
using REPL
mutable struct FakeTerminal <: REPL.Terminals.UnixTerminal
in_stream::Base.IO
out_stream::Base.IO
err_stream::Base.IO
hascolor::Bool
raw::Bool
FakeTerminal(stdin,stdout,stderr,hascolor=true) =
new(stdin,stdout,stderr,hascolor,false)
end
REPL.Terminals.hascolor(t::FakeTerminal) = t.hascolor
REPL.Terminals.raw!(t::FakeTerminal, raw::Bool) = t.raw = raw
REPL.Terminals.size(t::FakeTerminal) = (24, 80)
input = Pipe()
output = Pipe()
err = Pipe()
Base.link_pipe!(input, reader_supports_async=true, writer_supports_async=true)
Base.link_pipe!(output, reader_supports_async=true, writer_supports_async=true)
Base.link_pipe!(err, reader_supports_async=true, writer_supports_async=true)
repl = REPL.LineEditREPL(FakeTerminal(input.out, output.in, err.in, false), false)
repltask = #async REPL.run_repl(repl)
Now you can do:
julia> println(input,";ls -la *.jld2")
-rw-r--r-- 1 pszufe 197121 5506 Jul 5 2020 file.jld2
-rw-r--r-- 1 pszufe 197121 5506 Jul 5 2020 myfile.jld2

How do I download images from a server and then upload it to a website using R?

Okay, so I have approximately 2 GB worth of files (images and what not) stored on a server (I'm using Cygwin right now since I'm on Windows) and I was wondering if I was able to get all of this data into R and then eventually translate it onto a website where people can view/download those images?
I currently have installed the ssh package and have logged into my server using:
ssh::ssh_connect("name_and_server_ip_here")
I've been able to successfully connect, however, I am not particular sure how to locate the files on the server through R. I assume I would use something like scp_download to download the files from the server, but as mentioned before, I am not particularly sure how to locate the files from the server, so I wouldn't be able to download them anyways (yet)!
Any sort of feedback and help would be appreciated! Thanks :)
You can use ssh::ssh_exec_internal and some shell commands to "find" commands.
sess <- ssh::ssh_connect("r2#myth", passwd="...")
out <- ssh::ssh_exec_internal(sess, command = "find /home/r2/* -maxdepth 3 -type f -iname '*.log'")
str(out)
# List of 3
# $ status: int 0
# $ stdout: raw [1:70] 2f 68 6f 6d ...
# $ stderr: raw(0)
The stdout/stderr are raw (it's feasible that the remote command did not produce ascii data), so we can use rawToChar to convert. (This may not be console-safe if you have non-ascii data, but it is here, so I'll go with it.)
rawToChar(out$stdout)
# [1] "/home/r2/logs/dns.log\n/home/r2/logs/ping.log\n/home/r2/logs/status.log\n"
remote_files <- strsplit(rawToChar(out$stdout), "\n")[[1]]
remote_files
# [1] "/home/r2/logs/dns.log" "/home/r2/logs/ping.log" "/home/r2/logs/status.log"
For downloading, scp_download is not vectorized, so we can only upload one file at a time.
for (rf in remote_files) ssh::scp_download(sess, files = rf, to = ".")
# 4339331 C:\Users\r2\.../dns.log
# 36741490 C:\Users\r2\.../ping.log
# 17619010 C:\Users\r2\.../status.log
For uploading, scp_upload is vectorized, so we can send all in one shot. I'll create a new directory (just for this example, and to not completely clutter my remote server :-), and then upload them.
ssh::ssh_exec_wait(sess, "mkdir '/home/r2/newlogs'")
# [1] 0
ssh::scp_upload(sess, files = basename(remote_files), to = "/home/r2/newlogs/")
# [100%] C:\Users\r2\...\dns.log
# [100%] C:\Users\r2\...\ping.log
# [100%] C:\Users\r2\...\status.log
# [1] "/home/r2/newlogs/"
(I find it odd that scp_upload is vectorized while scp_download is not. If this were on a shell/terminal, then each call to scp would need to connect, authenticate, copy, then disconnect, a bit inefficient; since we're using a saved session, I believe (unverified) that there is little efficiency lost due to not vectorizing the R function ... though it is still really easy to vectorize it.)

How to run multiple julia files parallel?

I have a pretty simple question but I can't seem to find an answer anywhere. I want to run two .jl files parallel using the Julia Terminal.
I tried include("file1.jl" & "file2.jl") and include("file1.jl") & include("file2.jl") but this doesn't work.
I'm not sure exactly what you want to do but if you wanted to run these two files on two different workers from the julia terminal you could for e.g.
addprocs(1) # add a worker
pmap(include,["file1.jl", "file2.jl"]) # apply include to each element
# of the array in parallel
But I'm pretty sure there will be a better way of doing whatever you want to accomplish.
While you can probably wrangle your code into the Julia parallel computing paradigms, it seems like the simplest solution is to execute your Julia scripts from the command line. Here I assume that you are comfortable allowing your CPU to handle the task scheduling, which may or may not result in parallel execution.
What follows below is a skeleton pipeline to get you started. Replace task.jl with your file1.jl, file2.jl, etc.
task.jl
println("running like a cheetah")
run_script.sh
echo `date`
julia task.jl
julia task.jl
echo `date`
run_script_parallel.sh
echo `date`
julia task.jl &
julia task.jl &
wait # do not return before background tasks are complete
echo `date`
From the command line, ensure that your BASH scripts are executable:
chmod +rwx run_script.sh run_script_parallel.sh
Try running the scripts now. Note that my example Julia script task.jl returns almost immediately, so this particular comparison is a little silly:
./run_script.sh
./run_script_parallel.sh
My output
Thu Jan 5 14:24:57 PST 2017
running like a cheetah
running like a cheetah
Thu Jan 5 14:24:57 PST 2017
Thu Jan 5 14:25:05 PST 2017
running like a cheetahrunning like a cheetah
Thu Jan 5 14:25:06 PST 2017
The first output orders the print statements in a clean serial order. But observe in the second case that the text is running together. That is common behavior for parallel print statements.

Opening a new instance of R and sourcing a script within that instance

Background/Motivation:
I am running a bioinformatics pipeline that, if executed from beginning to end linearly takes several days to finish. Fortunately, some of the tasks don't depend upon each other so they can be performed individually. For example, Task 2, 3, and 4 all depend upon the output from Task 1, but do not need information from each other. Task 5 uses the output of 2, 3, and 4 as input.
I'm trying to write a script that will open new instances of R for each of the three tasks and run them simultaneously. Once all three are complete I can continue with the remaining pipeline.
What I've done in the past, for more linear workflows, is have one "master" script that sources (source()) each task's subscript in turn.
I've scoured SO and google and haven't been able to find a solution for this particular problem. Hopefully you guys can help.
From within R, you can run system() to invoke commands within a terminal and open to open a file. For example, the following will open a new terminal instance:
system("open -a Terminal .",wait=FALSE)
Similarly, I can start a new r session by using
system("open -a r .")
What I can't figure out for the life of me is how to set the "input" argument so that it sources one of my scripts. For example, I would expect the following to open a new terminal instance, call r within the new instance, and then source the script.
system("open -a Terminal .",wait=FALSE,input=paste0("r; source(\"/path/to/script/M_01-A.R\",verbose=TRUE,max.deparse.length=Inf)"))
Answering my own question in the event someone else is interested down the road.
After a couple of days of working on this, I think the best way to carry out this workflow is to not limit myself to working just in R. Writing a bash script offers more flexibility and is probably a more direct solution. The following example was suggested to me on another website.
#!/bin/bash
# Run task 1
Rscript Task1.R
# now run the three jobs that use Task1's output
# we can fork these using '&' to run in the background in parallel
Rscript Task2.R &
Rscript Task3.R &
Rscript Task4.R &
# wait until background processes have finished
wait %1 %2 %3
Rscript Task5.R
You might be interested in the future package (I'm the author). It allows you to write your code as:
library("future")
v1 %<-% task1(args_1)
v2 %<-% task2(v1, args_2)
v3 %<-% task3(v1, args_3)
v4 %<-% task4(v1, args_4)
v5 %<-% task5(v2, v3, v4, args_5)
Each of those v %<-% expr statements creates a future based on the R expression expr (and all of it's dependencies) and assigns it to a promise v. It is only when v is used, it will block and wait for the value v to be available.
How and where these futures are resolved is decided by the user of the above code. For instance, by specifying:
library("future")
plan(multiprocess)
at the top, then the futures (= the different tasks) are resolved in parallel on your local machine. If you use,
plan(cluster, workers = c("n1", "n3", "n3", "n5"))
they're resolved on those for machine (where n3 accepts two concurrent jobs).
This works on all operating systems (including Windows).
If you have access to a HPC compute with schedulers such as Slurm, SGE, and TORQUE / PBS, you can use the future.BatchJobs package, e.g.
plan(future.BatchJobs::batchjobs_torque)
PS. One reason for creating future was to do large-scale Bioinformatics in parallel / distributed.

Parallelizing on a supercomputer and then combining the parallel results (R)

I've got access to a big, powerful cluster. I'm a halfway decent R programmer, but totally new to shell commands (and terminal commands in general besides basic things that one needs to do to use ubuntu).
I want to use this cluster to run a bunch of parallel processes in R, and then I want to combine them. Specifically, I have a problem analogous to:
my.function <-function(data,otherdata,N){
mod = lm(y~x, data=data)
a = predict(mod,newdata = otherdata,se.fit=TRUE)
b = rnorm(N,a$fit,a$se.fit)
b
}
r1 = my.function
r2 = my.function
r3 = my.function
r4 = my.function
...
r1000 = my.function
results = list(r1,r2,r3,r4, ... r1000)
The above is just a dumb example, but basically I want to do something 1000 times in parallel, and then do something with all of the results from the 1000 processes.
How do I submit 1000 jobs simultaneously to the cluster, and then combine all the results, like in the last line of the code?
Any recommendations for well-written manuals/references for me to go RTFM with would be welcome as well. Unfortunately, the documents that I've found aren't particularly intelligible.
Thanks in advance!
You can combine plyr with doMC package (that is a parallel backend to the foreach package) as follows:
require(plyr)
require(doMC)
registerDoMC(20) # for 20 processors
llply(1:1000, function(idx) {
out <- my.function(.)
}, .parallel = TRUE)
Edit: If you're talking about submitting simultaneous jobs, then don't you have a LSF license? You can then use bsub to submit as many jobs as you need and it also takes care of load-balancing and what not...!
Edit 2: A small note on load-balancing (example using LSF's bsub):
What you mention is something similar to what I wrote here => LSF. You can submit jobs in batches. For ex: using in LSF you can use bsub to submit a job to the cluster like so:
bsub -m <nodes> -q <queue> -n <processors> -o <output.log>
-e <error.log> Rscript myscript.R
and this will place you on the queue and allocate for you the number of processors (if and when available) your job will start running (depending on resources). You can pause, restart, suspend your jobs.. and much much more.. qsub is something similar to this concept. The learning curve maybe a bit steep, but it is worth it.
We wrote a survey paper on State of of the Art in Parallel Computing with R in the Journal of Statistical Software (which is an open journal). You may find this useful as an introduction.
Message Passing Interface do what you want to do, and is very easy to do it. after compiled, you need to run :
mpirun -np [no.of.process] [executable]
you select where to run it with a simple text file with host ip fields like:
node01 192.168.0.1
node02 192.168.0.2
node03 192.168.0.3
here more examples of MPI.

Resources