How to write foreach %dopar% logs to separate files - r

I use R to run Ant Colony Optimization and usually repeat the same optimization several times to cross-validate my results. I want to save time by running the processes in parallel with the foreach and doParallel packages.
A reproducible example of my code would be very long so I'm hoping this is sufficient. I think I managed to get the code running like this:
result <- list()
short <- function(n){
for(n in 1:10){
result[[n]] <- ACO(data, ...)}}
foreach(n=1:50) %dopar% short(n)
Within the ACO() function I continuously create objects with intermediate results (e.g. the current pheromone levels) which I save using write.table(..., append=TRUE) to keep track of the iterations and their results. Now that I'm running the processes in parallel, the file I write contains results from all processes and I'm not able to tell which process the data belongs to. Therefore, I'd like to write different files for each process.
What's the best way, in general, to save intermediate results when using parallel processing?

You can use the log4r package to write info needed in a log file. More info about the package here.
An example of the code which you have to put in your short function:
# Import the log4r package.
library('log4r')
# Create a new logger object with create.logger().
logger <- create.logger()
# Set the logger's file output.
logfile(logger) <- 'base.log'
# Set the current level of the logger.
level(logger) <- 'INFO'
# Try logging messages with different priorities. # At priority level INFO, a call to debug() won't print anything.
debug(logger, 'Iretation and result info')

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to restart R and continue a benchmark script from previous line (on Windows)?

I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.
One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")

Is there a way to transfer R objects to separate R sessions on linux?

I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)

Calling external program in parallel using foreach and doSNOW: How to import results?

I'm using R to call an external program in parallel on a cluster with multiple nodes and multiple cores. The external program requires three input data files and produces one output file (all files are stored in the same subfolder).
To run the program in parallel (or rather call it in a parallel fashion) I've initially used the foreach function together with the doParallel library. This works fine as long as I'm just using multiple cores on a single node.
However, I wanted to use multiple nodes with multiple cores. Therefore I modified my code accordingly to use the doSNOW library in conjunction with foreach (I tried Rmpi and doMPI, but I did not manage to run the code on multiple nodes with those libraries).
This works fine, i. e. the external program is now indeed run on multiple nodes (with multiple cores) and the cluster logfile shows, that it produces the required results. The problem I'm facing now, however, is that the external program no longer stores the results/output files on the master node/in the specified subfolder of the working directory (it did so, when I was using doParallel). This makes it impossible for me to import the results into R.
Indeed, if I check the content of the relevant folder it does not contain any output files, despite the logfile clearly showing that the external program ran successfully. I guess they are stored on the different nodes (?).
What modifications do I have to make to either my foreach function or the way I set up my cluster, to get those files saved on the master node/in the specified subfolder in my working directory?
Here some example R code, to showcase, what I'm doing:
# #Set working directory in non-interactive mode
setwd(system("pwd", intern = T))
# #Load some libraries
library(foreach)
library(parallel)
library(doParallel)
# ####Parallel tasks####
# #Create doSNOW cluster for parallel tasks
library(doSNOW)
nCoresPerNode <- as.numeric(Sys.getenv("PBS_NUM_PPN"))
nodeNames <- system("cat $PBS_NODEFILE | uniq", intern=TRUE)
machines <- rep(nodeNames, each = nCoresPerNode)
cl <- makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
# #How many workers are we using?
getDoParWorkers()
#####DUMMY CODE#####
# #The following 3 lines of code are just dummy code:
# #The idea is to create input files for the external program "myprogram"
external_Command_Script.cmd # #command file necessary for external program "myprogram" to run
startdata # #some input data for "myprogram"
enddata # #additional input data for "myprogram"
####DUMMY CODE######
# #Write necessary command and data files for external program: THIS WORKS!
for(i in 1:100)){
write(external_Command_Script.cmd[[i]], file=paste("./mysubfolder/external_Command_Script.",i,".cmd", sep=""))
write.table(startdata, file=paste("./mysubfolder/","startdata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
write.table(enddata, file=paste("./mysubfolder/","enddata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
}
# #Run external program "myprogram" in parallel: THIS WORKS!
foreach(i = 1:100)) %dopar% {
system(paste('(cd ./mysubfolder && ',"myprogram",' ' ,"enddata.",i,".txt ", "startdata.",i,".txt", sep="",' < external_Command_Script.',i,'.cmd)'))
}
# #Import results of external program: THIS DOES NOT WORK WHEN RUN ON MULTIPLE NODES!
results <- list()
for(i in 1:100)){
results[[i]] = read.table(paste("./mysubfolder/","enddata.txt.",i,".log.txt", sep=""), sep = "\t", quote="\"", header = TRUE)
}
# #The import does NOT work as the files created by the external program are NOT stored on the master node/in the
# #subfolder of the working directory!
# #Instead I get the following error message:
# #sh: line 0: cd: ./mysubfolder: No such file or directory
# #Error in { : task 6 failed - "cannot open the connection"
My pbs script for the cluster looks something like this:
#!/bin/bash
# request resources:
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:30:00
module add languages/R-3.3.3-ATLAS
export PBS_O_WORKDIR="/panfs/panasas01/gely/xxxxxxx/workingdirectory"
# on compute node, change directory to 'submission directory':
cd $PBS_O_WORKDIR
# run your program and time it:
time Rscript ./R_script.R
I'd like to suggest that you look into batchtools package. It provides methods for interacting with TORQUE / PBS from R.
If you're ok to use it's predecessor BatchJobs for a while, I'd also recommend to try that and when you understand how that works, look into the doFuture foreach adaptor. This will allow you to use the future.BatchJobs package. This combination of doFuture, future.BatchJobs, and BatchJobs allows you do everything from within R and you don't have to worry about creating temporary R scripts etc. (Disclaimer: I'm the author of both).
Example what it'll look like when you've got it set up:
## Tell foreach to use futures
library("doFuture")
registerDoFuture()
## Tell futures to use TORQUE / PBS with help from BatchJobs
library("future.BatchJobs")
plan(batchjobs_torque)
and then you use:
res <- foreach(i = 1:100) %dopar% {
my_function(pathname[i], arg1, arg2)
}
This will evaluate each iteration in a separate PBS job, i.e. you'll see 100 jobs added to the queue.
The future.BatchJobs vignettes have more examples and info.
UPDATE 2017-07-30: The future.batchtools package is on CRAN since May 2017. This package is now recommended over future.BatchJobs. The usage is very similar to the above, e.g. instead of plan(batchjobs_torque) you now use plan(batchtools_torque).
Problem solved:
I made a mistake: The external program is actually NOT running - I misinterpreted the log file. The reason for the external program to not run is that the subfolder (containing the necessary input data) is not found. It seems that the cluster defaults to the user directory instead of the working directory specified in the pbs submission script. This behaviour is different from clusters created with doParallel, which do indeed recognize the working directory. The problem is therefore solved by just adding the relative path to working directory and subfolder in the R script, i. e. ./workingdirectory/mysubfolder/ instead of just ./mysubfolder/. Alternatively, you can also use the full path to the folder.

Saving output from parallel jobs in R into one file

I am running a rather lengthy job that I need to replicate 100 times, thus I have turned to the foreach capability in R which I then run on a 8-core cluster through a shell script. I am trying to input all of my results from each run to the same file. I have included a simplified version of my code.
cl<-makeCluster(core-1)
registerDoParallel(cl,cores=core)
SigEpsilonSq<-list()
SigLSq<-list()
RatioMat<-list()
foreach(p=1:100) %dopar%{
functions defining my variables{...}
for(i in 1:fMaxInd){
rhoSqjMatr[,i]<-1/(1+Bb[i])*(CbAdj+AbAdj*XjBarAdj+BbAdj[i]*XjSqBarAdj)/(dataZ*dataZ)
sigmaEpsSqV[i]<-mean(rhoSqjMatr[,i])
rhoSqjMatr[,i]<-rhoSqjMatr[,i]/sigmaEpsSqV[i]
biasCorrV[,i]<-sigmaEpsSqV[i]/L*gammaQl(rhoSqjMatr[,i])
Qcbar[,i]<-Qflbar-biasCorrV[,i]
sigmaExtSq[,i]<-sigmaSqExt(sigmaEpsSqV[i], rhoSqjMatr[,i])
ratioMatr[,i]<-sigmaExtSq[,i]/(sigmaL*sigmaL)#ratio (sigma_l^e)^2/(sigmaL)^2
}
sigmaEpsSqV<-as.matrix(sigmaEpsSqV)
SigEpsilonSq[[p]]<-sigmaEpsSqV
SigLSq[[p]]<-sigmaExtSq
RatioMat[[p]]<-ratioMatr
} #End of the dopar loop
stopCluster(cl)
write.csv(SigEpsilonSq,file="Sigma_Epsilon_Sq.csv")
write.csv(SigLSq,file="Sigma_L_Sq.csv")
write.csv(RatioMat,file="Ratio_Matrix.csv")
When the job completes, my .csv files are empty. I believe I'm not quite understanding how the foreach saves results and how I can access them. I would like to avoid having to merge files manually. Also, do I need to write
stopCluster(cl)
at the end of my foreach loop or do I wait until the very end? Any help would be much appreciated.
This is not how foreach works. You should look into examples. You need to use .combine, if you want to output something from your parallelized jobs.
Also, instead of this:
sigmaEpsSqV<-as.matrix(sigmaEpsSqV)
SigEpsilonSq[[p]]<-sigmaEpsSqV
SigLSq[[p]]<-sigmaExtSq
RatioMat[[p]]<-ratioMatr
You have to re-write something like this:
list(as.matrix(sigmaEpsSqV),sigmaEpsSqV,sigmaExtSq,ratioMatr)
You can also use rbind, cbind, c,... to aggregate the results into one final output.
You can even your own combine function, example:
.combine=function(x,y)rbindlist(list(x,y))
The solution below should work. The output should be a list of lists. However it might be painful to retreive results and save them in the correct format. If so, you should design your own .combine function.
cl<-makeCluster(core-1)
registerDoParallel(cl,cores=core)
SigEpsilonSq<-list()
SigLSq<-list()
RatioMat<-list()
results = foreach(p=1:100, .combine=list) %dopar%{
functions defining my variables{...}
for(i in 1:fMaxInd){
rhoSqjMatr[,i]<-1/(1+Bb[i])*(CbAdj+AbAdj*XjBarAdj+BbAdj[i]*XjSqBarAdj)/(dataZ*dataZ)
sigmaEpsSqV[i]<-mean(rhoSqjMatr[,i])
rhoSqjMatr[,i]<-rhoSqjMatr[,i]/sigmaEpsSqV[i]
biasCorrV[,i]<-sigmaEpsSqV[i]/L*gammaQl(rhoSqjMatr[,i])
Qcbar[,i]<-Qflbar-biasCorrV[,i]
sigmaExtSq[,i]<-sigmaSqExt(sigmaEpsSqV[i], rhoSqjMatr[,i])
ratioMatr[,i]<-sigmaExtSq[,i]/(sigmaL*sigmaL)#ratio (sigma_l^e)^2/(sigmaL)^2
}
list(as.matrix(sigmaEpsSqV),sigmaEpsSqV,sigmaExtSq,ratioMatr)
} #End of the dopar loop
stopCluster(cl)
#Then you extract and save results

Resources