Parallel processing and temporary files - r

I'm using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile function. i.e. if I have four processors,
library(multicore)
mclapply(1:4, function(x) tempfile())
will give four exactly same filenames. Obviously I need the temporary files to be different so that the child processes don't overwrite each others' files. When using tempfile indirectly, i.e. calling some function that calls tempfile I have no control over the filename.
Is there a way around this? Do other parallel processing packages for R (e.g. foreach) have the same problem?
Update: This is no longer an issue since R 2.14.1.
CHANGES IN R VERSION 2.14.0 patched:
[...]
o tempfile() on a Unix-alike now takes the process ID into account.
This is needed with multicore (and as part of parallel) because
the parent and all the children share a session temporary
directory, and they can share the C random number stream used to
produce the uniaue part. Further, two children can call
tempfile() simultaneously.

I believe multicore spins off a separate process for each subtask. If that assumption is correct, then you should be able to use Sys.getpid() to "seed" tempfile:
tempfile(pattern=paste("foo", Sys.getpid(), sep=""))

Use the x in your function:
mclapply(1:4, function(x) tempfile(pattern=paste("file",x,"-",sep=""))

Because the parallel jobs all run at the same time, and because the random seed comes from the system time, running four instances of tempfile in parallel will typically produce the same results (if you have 4 cores, that is. If you only have two cores, you'll get two pairs of identical temp file names).
Better to generate the tempfile names first and give them to your function as an argument:
filenames <- tempfile( rep("file",4) )
mclapply( filenames, function(x){})
If you're using someone else's function that has a tempfile call in it, then working the PID into the tempfile name by modifying the tempfile function, as previously suggested, is probably the simplest plan:
tempfile <- function( pattern = "file", tmpdir = tempdir(), fileext = ""){
.Internal(tempfile(paste("pid", Sys.getpid(), pattern, sep=""), tmpdir, fileext))}
mclapply( 1:4, function(x) tempfile() )

At least for now, I chose to monkey-patch my way around this by using the following code in my .Rprofile following Daniel's advice to use PID values.
assignInNamespace("tempfile.orig", tempfile, ns="base")
.tempfile = function(pattern="file", tmpdir=tempdir())
tempfile.orig(paste(pattern, Sys.getpid(), sep=""), tmpdir)
assignInNamespace("tempfile", .tempfile, ns="base")
Obviously it's not a good option for any package you'd distribute, but for a single user's need it's the best option thus far since it works in all cases.

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to restart R and continue a benchmark script from previous line (on Windows)?

I want to benchmark the time and profile memory used by several functions (regression with random effects and other analysis) applied to different dataset sizes.
My computer has 16GB RAM and I want to see how R behaves with large datasets and what is the limit.
In order to do it I was using a loop and the package bench.
After each iteration I clean the memory with gc(reset=TRUE).
But when the dataset is very large the garbage collector doesn't work properly, it just frees part of the memory.
At the end all the memory stays filled, and I need to restar my R session.
My full dataset is called allDT and I do something like this:
for (NN in (1:10)*100000) {
gc(reset=TRUE)
myDT <- allDT[sample(.N,NN)]
assign(paste0("time",NN), mark(
model1 = glmer(Out~var1+var2+var3+(1|City/ID),data=myDT),
model2 = glmer(Out~var1+var2+var3+(1|ID),data=myDT),
iterations = 1, check=F))
}
That way I can get the results for each size.
The method is not fair because at the end the memory doesn't get properly cleaned.
I've thought an alternative is to restart the whole R program after every iteration (exit R and start it again, this is the only way I've found you can have the memory cleaned), loading again the data and continuing from the last step.
Is there any simple way to do it or any alternative?
Maybe I need to save the results on disk every time but it will be difficult to keep track of the last executed line, specially if R hangs.
I may need to create an external batch file and run a loop calling R at every iteration. Though I prefer to do it everything from R without any external scripting/batch.
One thing I do for benchmarks like this is to launch another instance of R and have that other R instance return the results to stdout (or simpler, just save it as a file).
Example:
times <- c()
for( i in 1:length(param) ) {
system(sprintf("Rscript functions/mytest.r %s", param[i]))
times[i] <- readRDS("/tmp/temp.rds")
}
In the mytest.r file read in parameters and save results to a file.
args <- commandArgs(trailingOnly=TRUE)
NN <- args[1]
allDT <- readRDS("mydata.rds")
...
# save results
saveRDS(myresult, file="/tmp/temp.rds")

Calling external program in parallel using foreach and doSNOW: How to import results?

I'm using R to call an external program in parallel on a cluster with multiple nodes and multiple cores. The external program requires three input data files and produces one output file (all files are stored in the same subfolder).
To run the program in parallel (or rather call it in a parallel fashion) I've initially used the foreach function together with the doParallel library. This works fine as long as I'm just using multiple cores on a single node.
However, I wanted to use multiple nodes with multiple cores. Therefore I modified my code accordingly to use the doSNOW library in conjunction with foreach (I tried Rmpi and doMPI, but I did not manage to run the code on multiple nodes with those libraries).
This works fine, i. e. the external program is now indeed run on multiple nodes (with multiple cores) and the cluster logfile shows, that it produces the required results. The problem I'm facing now, however, is that the external program no longer stores the results/output files on the master node/in the specified subfolder of the working directory (it did so, when I was using doParallel). This makes it impossible for me to import the results into R.
Indeed, if I check the content of the relevant folder it does not contain any output files, despite the logfile clearly showing that the external program ran successfully. I guess they are stored on the different nodes (?).
What modifications do I have to make to either my foreach function or the way I set up my cluster, to get those files saved on the master node/in the specified subfolder in my working directory?
Here some example R code, to showcase, what I'm doing:
# #Set working directory in non-interactive mode
setwd(system("pwd", intern = T))
# #Load some libraries
library(foreach)
library(parallel)
library(doParallel)
# ####Parallel tasks####
# #Create doSNOW cluster for parallel tasks
library(doSNOW)
nCoresPerNode <- as.numeric(Sys.getenv("PBS_NUM_PPN"))
nodeNames <- system("cat $PBS_NODEFILE | uniq", intern=TRUE)
machines <- rep(nodeNames, each = nCoresPerNode)
cl <- makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
# #How many workers are we using?
getDoParWorkers()
#####DUMMY CODE#####
# #The following 3 lines of code are just dummy code:
# #The idea is to create input files for the external program "myprogram"
external_Command_Script.cmd # #command file necessary for external program "myprogram" to run
startdata # #some input data for "myprogram"
enddata # #additional input data for "myprogram"
####DUMMY CODE######
# #Write necessary command and data files for external program: THIS WORKS!
for(i in 1:100)){
write(external_Command_Script.cmd[[i]], file=paste("./mysubfolder/external_Command_Script.",i,".cmd", sep=""))
write.table(startdata, file=paste("./mysubfolder/","startdata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
write.table(enddata, file=paste("./mysubfolder/","enddata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
}
# #Run external program "myprogram" in parallel: THIS WORKS!
foreach(i = 1:100)) %dopar% {
system(paste('(cd ./mysubfolder && ',"myprogram",' ' ,"enddata.",i,".txt ", "startdata.",i,".txt", sep="",' < external_Command_Script.',i,'.cmd)'))
}
# #Import results of external program: THIS DOES NOT WORK WHEN RUN ON MULTIPLE NODES!
results <- list()
for(i in 1:100)){
results[[i]] = read.table(paste("./mysubfolder/","enddata.txt.",i,".log.txt", sep=""), sep = "\t", quote="\"", header = TRUE)
}
# #The import does NOT work as the files created by the external program are NOT stored on the master node/in the
# #subfolder of the working directory!
# #Instead I get the following error message:
# #sh: line 0: cd: ./mysubfolder: No such file or directory
# #Error in { : task 6 failed - "cannot open the connection"
My pbs script for the cluster looks something like this:
#!/bin/bash
# request resources:
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:30:00
module add languages/R-3.3.3-ATLAS
export PBS_O_WORKDIR="/panfs/panasas01/gely/xxxxxxx/workingdirectory"
# on compute node, change directory to 'submission directory':
cd $PBS_O_WORKDIR
# run your program and time it:
time Rscript ./R_script.R
I'd like to suggest that you look into batchtools package. It provides methods for interacting with TORQUE / PBS from R.
If you're ok to use it's predecessor BatchJobs for a while, I'd also recommend to try that and when you understand how that works, look into the doFuture foreach adaptor. This will allow you to use the future.BatchJobs package. This combination of doFuture, future.BatchJobs, and BatchJobs allows you do everything from within R and you don't have to worry about creating temporary R scripts etc. (Disclaimer: I'm the author of both).
Example what it'll look like when you've got it set up:
## Tell foreach to use futures
library("doFuture")
registerDoFuture()
## Tell futures to use TORQUE / PBS with help from BatchJobs
library("future.BatchJobs")
plan(batchjobs_torque)
and then you use:
res <- foreach(i = 1:100) %dopar% {
my_function(pathname[i], arg1, arg2)
}
This will evaluate each iteration in a separate PBS job, i.e. you'll see 100 jobs added to the queue.
The future.BatchJobs vignettes have more examples and info.
UPDATE 2017-07-30: The future.batchtools package is on CRAN since May 2017. This package is now recommended over future.BatchJobs. The usage is very similar to the above, e.g. instead of plan(batchjobs_torque) you now use plan(batchtools_torque).
Problem solved:
I made a mistake: The external program is actually NOT running - I misinterpreted the log file. The reason for the external program to not run is that the subfolder (containing the necessary input data) is not found. It seems that the cluster defaults to the user directory instead of the working directory specified in the pbs submission script. This behaviour is different from clusters created with doParallel, which do indeed recognize the working directory. The problem is therefore solved by just adding the relative path to working directory and subfolder in the R script, i. e. ./workingdirectory/mysubfolder/ instead of just ./mysubfolder/. Alternatively, you can also use the full path to the folder.

Iteratively resave a directory tree of Excel files

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list

Downloading multiple file as parallel in R

I am trying to download 460,000 files from ftp server ( which I got from the TRMM archive data). I made a list of all files and separated them into different jobs, but can any one help me how to run those jobs at the same time in R. Just an example what I have tried to do
my.list <-readLines("1998-2010.txt") # lists the ftp address of each file
job1 <- for (i in 1: 1000) {
download.file(my.list[i], name[i], mode = "wb")
}
job2 <- for (i in 1001: 2000){
download.file(my.list[i], name[i], mode = "wb")
}
job3 <- for (i in 2001: 3000){
download.file(my.list[i], name[i], mode = "wb")
}
Now I m stuck on how to run all of the Jobs at the same time.
Appreciate your help
Dont do that. Really. Dont. It won't be any faster because the limiting factor is going to be the network speed. You'll just end up with a large number of even slower downloads, and then the server will just give up and throw you off, and you'll end up with a large number of half-downloaded files.
Downloading multiple files will also increase the disk load since now your PC is trying to save a large number of files.
Here's another solution.
Use R (or some other tool, its one line of awk script starting from your list) to write an HTML file which just looks like this:
file-1.dat
file-2.dat
and so on. Now open this file in your web browser and use a download manager (eg DownThemAll for Firefox) and tell it to download all the links. You can specify how many simultaneous downloads, how many times to retry fails and so on with DownThemAll.
A good option is to use the mclapply or parLapply from the builtin parallel package. You then make a function that accepts a list of files that need to be downloaded:
library(parallel)
dowload_list = function(file_list) {
return(lapply(download.file(file_list)))
}
list_of_file_lists = c(my_list[1:1000], my_list[1001:2000], etc)
mclapply(list_of_file_lists, download_list)
I think it is wise to first split up the big list of files into a set a sublists, as for each entry in the list fed to mclapply a process is spawned. If this list is big, and the processing time per item in the list small, the overhead of parallelisation is probably going to make the downloading slower in stead of faster.
Do note that mclapply only works on Linux, parLapply should also work fine under Windows.
First make a while loop which looks for all the destination file. If the current predefined destination file is in the existing destination file, the script creates a new destination file. This will create many destination files which correspond to each one downloading. Next, I parallel the script. If I have 5 cores on my machine, I will get 5 destination files on my disk. I can also use lapply function to do that.
For example:
id <- 0
newDestinationFile <- "File.xlsx"
while(newDestinationFile %in% list.files(path =getwd(),pattern ="[.]xlsx"))
{
newDestinationFile <- paste0("File",id,".xlsx")
id <- id+1
download.file(url = URLS,method ="libcurl",mode ="wb",quiet = TRUE,destfile =newDestinationFile)
}

Resources