Write to log file using doMPI - r

I am running doMPI on an HPC and I would like to log output from workers. Using doParallel, I was able to use makeCluster(outfile='myfile.log'). With doMPI, there does not seem to be an outfile argument in any of the methods. I tried using sinkWorkerOutput(). This works, but only wrote the log for one of the workers. I suspect that each worker is overwriting the other. Is there an analog for outfile for doMPI?
A related question - inside of a worker, can I find the worker number?
EDIT: here is a link to an answer discussing how to use outfile: How can I print when using %dopar%
Thank you for your help,
Ben

To send worker output to a file in the doMPI package, set the startMPIcluster "verbose" option to TRUE:
cl <- startMPIcluster(verbose=TRUE)
This creates one file per worker with names of the form "MPI_1_steve_41747.log". The MPI rank, user name, and process ID is used to make the file names unique. You can also specify the log directory via the "logdir" option.
To get a worker number, you can simply call the mpi.comm.rank function.

Related

Restart R session without interrupting the for loop

In my for loop, I need to remove RAM. So I delete some objects with rm() command. Then, I do gc() but the RAM still the same
So I use .rs.restartR() instead of gc() and it works: a sufficient part of my RAM is removed after the restart of the R session.
My problem is the for loop which is interrupted after the R restart. Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
I just stumbled across this post as I'm having a similar problem with rm() not clearing memory as expected. Like you, if I kill the script, remove everything using rm(list=ls(all.names=TRUE)) and restart, the script takes longer than it did initially. However, restarting the session using .rs.restartR() then sourcing again works as expected. As you say, there is no way to 'refresh' the session while inside a loop.
My solution was to write a simple bash script which calls my .r file.
Say you have a loop in R that runs from 1 to 3 and you want to restart the session after each iteration. My bash script 'runR.sh' could read as follows:
#!/bin/bash
for i in {1..3}
do
echo "Rscript myRcode.r $i" #check call to r script is as expected
Rscript myRcode.r $i
done
Then at the top of 'myRcode.r':
args <- commandArgs()
print(args) #list the command line arguments.
myvar <- as.numeric(args[6])
and remove your for (myvar in...){}, keeping just the contents of the loop.
You'll see from print(args) that your input from your shell script is the 6th element of the array, hence args[6] in the following line when assigning your variable. If you're passing in a string, e.g. a file name, then you don't need as.numeric of course.
Running ./runR.sh will then call your script and hopefully solve your memory problem. The only minor issue is that you have to reload your packages each time, unlike when using .rs.restartR(), and may have to repeat other bits that ordinarily would only be run once.
It works in my case, I would be interested to hear from other more seasoned R/bash users whether there are any issues with this solution...
By saving the iteration as an external file, and writing an rscript which calls itself, the session can be restarted within a for loop from within rstudio. This example requires the following steps.
#Save an the iteration as a separate .RData file in the working directory.
iter <- 1
save(iter, file="iter.RData")
Create a script which calls itself for a certain number of iterations. Save the following script as "test_script.R"
###load iteration
library(rstudioapi)
load("iter.RData")
###insert function here.
time_now <- Sys.time()
###save output of function to a file.
save(time_now, file=paste0("time_", iter, ".Rdata"))
###update iteration
iter <- iter+1
save(iter, file="iter.RData")
###restart session calling the script again
if(iter < 5){
restartSession(command='source("test_script.R")')
}
Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
It is not possible.
Okay, you could configure your R system to do something like this, but it sounds like a bad idea. I'm not really sure if you want to restart the for loop from the beginning or pick it up where left off. (I'm also very confused that you seem to have been able to enter commands in the R console while a for loop was executing. I think there's more than you are not telling us.)
You can use your rprofile.site file to automatically run commands when R starts. You could set it up to automatically run your for loop code whenever R starts. But this seems like a bad idea. I think you should find a different sort of fix for your problem.
Some of the things you could do to help the situation: have your for loop write output for each iteration to disk and also write some sort of log to disk so you know where you left off. Maybe write a function around your for loop that takes an argument of where to start, so that you can "jump in" at any point.
With this approach, rather than "restarting R and automatically picking up the loop", a better bet would be to use Rscript (or similar) and use R or the command line to sequentially run each iteration (or batch of iterations) in its own R session.
The best fix would be to solve the memory issue without restarting. There are several questions on SO about memory management - try the answers out and if they don't work, make a reproducible example and ask a new question.
You can make your script recursive by sourcing itself after restarting session.
Make sure the script will take into account the initial status of the loop. So you might have to save the current status of the loop in a .rds file before restarting session. Then call the .rds file from inside the loop after restarting session. This would help you start the loop where it was before restarting r session.
I just found out about this command 'restartSession'. I'm using it because I was also running into memory consumption issues as the garbage collector will not give back the RAM to the OS (Linux).
library(rstudioapi)
restartSession(command = "print('x')")
An approach independent from Rstudio:
If you want to run this in Rstudio, don't use R console, but terminal, otherwise use rstudioapi::restartSession() as in other answers - not recommended (crashes) -.
create iterator and load script (in system terminal would be:)
R -e 'saveRDS(1,"i.rds"); source("Script.R")'
Script.R file:
# read files and iterator
i<-readRDS("i.rds")
print(i)
# open process id of previous loop to kill it
tryCatch(pid <- readRDS(file="pid.rds"), error=function(e){NA} )
if (exists("pid")){
library(tools)
tools::pskill(pid, SIGKILL)
}
# update objects and iterator
i <- i+1
# process
pid <- Sys.getpid()
# save files and iterator
saveRDS(i, file="i.rds")
# process ID to close it in next loop
saveRDS(pid, file="pid.rds")
### restart session calling the script again
if(i <= 20 ) {
print(paste("Processing of", i-1,"ended, restarting") )
assign('.Last', function() {system('Rscript Script.R')} )
q(save = 'no')
}

Wait for child process to terminate in R

I have multiple r script which I source on one script. I want to run each script at a time such that the second script should wait until the first script is over.
Is their a function like wait() which is use in python subprocess package?
is their similar package in R?
Running the following should work in the way you describe -- file_2.R will not run until the first source() is complete.
source('file_1.R')
source('file_2.R')
Note that, by default, elements in the called script's environment will be available in the global environment (and therefore to anything you source thereafter. You can disable this behavior with the argument local=TRUE).

How can to get the filename from a streaming mapreduce job in R?

I am streaming an R mapreduce job and I am need to get the filename. I know that Hadoop sets environment variables for the current job before it starts and I can access env vars in R with Sys.getenv().
I found :
Get input file name in streaming hadoop program
and Sys.getenv(mapred_job_id) works fine, but it is not what I need. I just need the filename and not the job id or name. I also found: How to get filename when running mapreduce job on EC2?
But this isn't helpful either. What is the easiest way to get the current filename while streaming from R? Thank you
I have not tried this, but from the second link you provided, it seems that this is available in an environment variable called map.input.file. Then, this should work:
Sys.getenv("map.input.file")
EDIT:
Upon further investigation, I learned that you need to replace the dots with underscores, so this is the way to do it:
Sys.getenv("map_input_file")
However, the map.input.file property has been deprecated in YARN (Hadoop 2.x), so the new name should be used instead:
Sys.getenv("mapreduce_map_input_file")

Executing an R script in a way other than using source() in JRI

I am new to R and have been trying to use JRI. Through JRI, I have used the "eval()" function to get certain results. If I want to execute an R script, I have used "source()". However I am now in a situation where I need to execute a script on continuously incoming data. While I can still use "source()", I don't think that would be an optimal way from a performance perspectve.
What I did was to read the entire R script into memory and then try and use "eval()" passing the script - but this does not seem to work. I have ensured that the script has been correctly loaded into memory - that is because if I write this script (loaded into the memory) into a file and source this newly created file, it does produce the expected results.
Is there a way for me to not keep sourcing the same file over and over again and execute it from memory? Each of my data units are independent and have to be processed independently and as soon as they become available. I cannot wait to collect a bunch of data units and then pass them on to the R script.
I have searched a lot and not found anything related to this. Any pointers which could help me in this direction would be really helpful.
The way I handled this is as below -
I enclosed the entire script into a function.
I sourced the script file (which now contains the function) at the start of the execution of my program.
The place where I was sourcing the file, I am now just calling the function which contains the script itself i.e. -
REXP result = rengine.eval("retVal<-" + getFunctionName() + "()");
Here, getFunctionName() gives me the name of the name of the function which contains the script.
Since this is loaded into the memory and available, I do not have to source the script file every time I want to execute the script. Any arguments being passed to the script are done as env. variables.
This seems to be a workaround, but solves my problem. Any better options are welcome.

Submit jobs to a slave node from within an R script?

To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS)
Currently, I submit an R script to a slave node using the following command
qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R
Are there functions in R that would allow me to run myscript.R on the head node and send individual tasks to the slave nodes? Something like:
foreach(i=c('file1.csv', 'file2.csv', pbsoptions = list()) %do% read.csv(i)
Update: alternative solution to the qsub command is to remove #/usr/bin/Rscript from the first line of myscript.R and call it directly, as pointed out by #Josh
qsub -S /usr/bin/Rscript -p -1 -cwd -pe mpich 1 -j y -o output.log myscript.R
If you want to submit jobs from within an R script, I suggest that you look at the "BatchJobs" package. Here is a quote from the DESCRIPTION file:
Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine.
BatchJobs appears to be more sophisticated than previous, similar packages, such as Rsge and Rlsf. There are functions for registering, submitting, and retrieving the results of jobs. Here's a simple example:
library(BatchJobs)
reg <- makeRegistry(id='test')
batchMap(reg, sqrt, x=1:10)
submitJobs(reg)
y <- loadResults(reg)
You need to configure BatchJobs to use your batch queueing system. The submitJobs "resource" argument can be used to request appropriate resources for the jobs.
This approach is very useful if your cluster doesn't allow very long running jobs, or if it severely restricts the number of long running jobs. BatchJobs allows you to get around those restrictions by breaking up your work into multiple jobs while hiding most of the work associated with doing that manually.
Documentation and examples are available at the project website.
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.sh
done < list_of_infiles.txt
call_r.sh:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
The R package Rsge allows job submission to SGE managed clusters. It basically saves the required environment to disk, builds job submission scripts, executes them via qsub and then collates the results and returns them to you.
Because it basically wraps calls to qsub, it should work with PBS too (although since I don't know PBS, I can't guarantee it). You can alter the qsub command and the options used by altering the Rsge associated global options (prefixed sge. in the options() output)
Its is no longer on CRAN, but it is availible from github: https://github.com/bodepd/Rsge, although it doesn't look like its maintained any more.
To use it use one of the apply type functions supplied with the package: sge.apply , sge.parRapply, sge.parCapply, sge.parLapply and sge.parSapply, which are parallel equivalents to apply, rapply, rapply(t(x),…), lapply and sapply respectively. In addition to the standard parameters passed to the non-parallel functions a couple of other parameters are needed:
njobs: Number of parallel jobs to use
global.savelist: Character vector giving the names of variables
from the global environment that should be imported.
function.savelist: Character vector giving the variables to save from
the local environment.
packages: List of library packages to be loaded by each worker process
before computation is started.
The two savelist parameters and the packages parameters basically specify what variables, functions and packages should be loaded into the new instances of R running on the cluster machines before your code is executed. The different components of X (either list items or data.frame rows/columns) are divided between njobs different jobs and submitted as a job array to SGE. Each node starts an instance of R loads the specified variables, functions and packages, executes the code, saves and save the results to a tmp file. The controlling R instance checks when the jobs are complete, loads the data from the tmp files and joins the results back together to get the final results.
For example computing a statistic on a random sample of a gene list:
library(Rsge)
library(some.bioc.library)
gene.list <- read.delim(“gene.list.tsv”)
compute.sample <- function(gene.list) {
gene.list.sample <- sample(1000, gene.list)
statistic <- some.slow.bioc.function(gene.list.sample)
return (statistic)
}
results <- sge.parSapply(1:10000, function(x) compute.sample,
njobs = 100,
global.savelist = c(“gene.list”),
function.savelist(“compute.sample”),
packages = c(“some.bioc.library”))
If you like to send tasks to slave nodes as you go along with a script on the head node, I believe your options are the following:
Pre-allocate all slave nodes and them and keep them in standby when they are not needed (as I suggested in my first answer).
Launch new jobs when the slave nodes are needed and have them save their results to disk. Put the main process on hold until the slaves have completed their tasks and then assemble their output files.
Option 2 is definitely possible but will take a lot longer to implement (I've actually done it myself several times). #pallevillesen's answer is pretty much spot on.
Original answer, with missinterpreted question
I have never worked with PBS myself, but it appears that you can use it to submit MPI jobs. You might need to load an MPI module before executing the R script, having a shell script along these lines sent to qsub.
#!/bin/bash
#PBS -N my_job
#PBS -l cput=10:00:00,ncpus=4,mem=2gb
module load openmpi
module load R
R -f myscript.R
You should then be able to use doSNOW to execute your foraech loop in parallel.
n.slaves <- 4
library(doSNOW)
cl <- makeMPIcluster(n.slaves)
registerDoSNOW(cl)
foreach(i=c('file1.csv', 'file2.csv'), pbsoptions = list()) %dopar% read.csv(i)

Resources