I am running R version 3.4.1 on Windows 7.
I need to run a certain program using N=500 bootstrapped data files that I have stored within a working directory. The executable will produce 3 more datafiles for each of the N runs, all of which are stored in one working directory. I'd like to use parallel computing to increase the speed of this process.
The loop requires one starting file ('starter') that gets modified at every iteration of the loop. This is a really important part that I can't really get around doing. Also, at every iteration any pre-existing data files are removed and then the executable is called using system(). The executable produces the three data files called Report, CompReport, and covar.
In a normal loop the process looks something like this:
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
for(iboot in 1:N){
starter$datfile <- paste("BootData", iboot, ".ss", sep="")
#code to replace the starter file with the new one
SS_writestarter(starter, overwrite=TRUE)
#any old files are removed
file.remove("Report.sso", paste("Report_",iboot,".sso",sep=""))
file.remove("CompReport.sso", paste("CompReport_",iboot,".sso",sep=""))
file.remove("covar.sso", paste("covar_",iboot,".sso",sep=""))
#then the program is called
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
This above process takes 5 days to complete so I've been exploring options for parallelization. What I've worked with so far doesn't work so any help or suggestions are appreciated. Here's what I was thinking. Using a foreach loop and making sure I'm not reading and overwriting the starter file. This is what I have:
# read starter file
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
iter <- 2 #just for an example here even though I said N=500 above
cores <- parallel::detectCores()-1
cl <- makeCluster(cores)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
foreach(iboot=1:iter, .packages='r4ss', .export="starter_strt", system ("ss3")) %dopar% {
#code to create the starter file so I don't overwrite it
datfile<- data.frame(rep(paste("BootData",iboot,".ss",sep=""),2))
colnames(datfile) <- "datfile"
starter_list <- cbind(starter_strt, datfile)
# replace starter file with modified version
SS_writestarter(starter_list, overwrite=TRUE)
#delete files as above
#run program
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
The program runs through once but then I get a warning (see below) and the expected files are missing from the directory.
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): starter_strt
I'm pretty stumped at this point. I saw this example Running multiple R scripts using the system() command where the system command is called in a parLapply function. Maybe this is what I need? Help and/or suggestions are greatly appreciated.
Related
I use R to run Ant Colony Optimization and usually repeat the same optimization several times to cross-validate my results. I want to save time by running the processes in parallel with the foreach and doParallel packages.
A reproducible example of my code would be very long so I'm hoping this is sufficient. I think I managed to get the code running like this:
result <- list()
short <- function(n){
for(n in 1:10){
result[[n]] <- ACO(data, ...)}}
foreach(n=1:50) %dopar% short(n)
Within the ACO() function I continuously create objects with intermediate results (e.g. the current pheromone levels) which I save using write.table(..., append=TRUE) to keep track of the iterations and their results. Now that I'm running the processes in parallel, the file I write contains results from all processes and I'm not able to tell which process the data belongs to. Therefore, I'd like to write different files for each process.
What's the best way, in general, to save intermediate results when using parallel processing?
You can use the log4r package to write info needed in a log file. More info about the package here.
An example of the code which you have to put in your short function:
# Import the log4r package.
library('log4r')
# Create a new logger object with create.logger().
logger <- create.logger()
# Set the logger's file output.
logfile(logger) <- 'base.log'
# Set the current level of the logger.
level(logger) <- 'INFO'
# Try logging messages with different priorities. # At priority level INFO, a call to debug() won't print anything.
debug(logger, 'Iretation and result info')
I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)
I'm using R to call an external program in parallel on a cluster with multiple nodes and multiple cores. The external program requires three input data files and produces one output file (all files are stored in the same subfolder).
To run the program in parallel (or rather call it in a parallel fashion) I've initially used the foreach function together with the doParallel library. This works fine as long as I'm just using multiple cores on a single node.
However, I wanted to use multiple nodes with multiple cores. Therefore I modified my code accordingly to use the doSNOW library in conjunction with foreach (I tried Rmpi and doMPI, but I did not manage to run the code on multiple nodes with those libraries).
This works fine, i. e. the external program is now indeed run on multiple nodes (with multiple cores) and the cluster logfile shows, that it produces the required results. The problem I'm facing now, however, is that the external program no longer stores the results/output files on the master node/in the specified subfolder of the working directory (it did so, when I was using doParallel). This makes it impossible for me to import the results into R.
Indeed, if I check the content of the relevant folder it does not contain any output files, despite the logfile clearly showing that the external program ran successfully. I guess they are stored on the different nodes (?).
What modifications do I have to make to either my foreach function or the way I set up my cluster, to get those files saved on the master node/in the specified subfolder in my working directory?
Here some example R code, to showcase, what I'm doing:
# #Set working directory in non-interactive mode
setwd(system("pwd", intern = T))
# #Load some libraries
library(foreach)
library(parallel)
library(doParallel)
# ####Parallel tasks####
# #Create doSNOW cluster for parallel tasks
library(doSNOW)
nCoresPerNode <- as.numeric(Sys.getenv("PBS_NUM_PPN"))
nodeNames <- system("cat $PBS_NODEFILE | uniq", intern=TRUE)
machines <- rep(nodeNames, each = nCoresPerNode)
cl <- makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
# #How many workers are we using?
getDoParWorkers()
#####DUMMY CODE#####
# #The following 3 lines of code are just dummy code:
# #The idea is to create input files for the external program "myprogram"
external_Command_Script.cmd # #command file necessary for external program "myprogram" to run
startdata # #some input data for "myprogram"
enddata # #additional input data for "myprogram"
####DUMMY CODE######
# #Write necessary command and data files for external program: THIS WORKS!
for(i in 1:100)){
write(external_Command_Script.cmd[[i]], file=paste("./mysubfolder/external_Command_Script.",i,".cmd", sep=""))
write.table(startdata, file=paste("./mysubfolder/","startdata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
write.table(enddata, file=paste("./mysubfolder/","enddata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
}
# #Run external program "myprogram" in parallel: THIS WORKS!
foreach(i = 1:100)) %dopar% {
system(paste('(cd ./mysubfolder && ',"myprogram",' ' ,"enddata.",i,".txt ", "startdata.",i,".txt", sep="",' < external_Command_Script.',i,'.cmd)'))
}
# #Import results of external program: THIS DOES NOT WORK WHEN RUN ON MULTIPLE NODES!
results <- list()
for(i in 1:100)){
results[[i]] = read.table(paste("./mysubfolder/","enddata.txt.",i,".log.txt", sep=""), sep = "\t", quote="\"", header = TRUE)
}
# #The import does NOT work as the files created by the external program are NOT stored on the master node/in the
# #subfolder of the working directory!
# #Instead I get the following error message:
# #sh: line 0: cd: ./mysubfolder: No such file or directory
# #Error in { : task 6 failed - "cannot open the connection"
My pbs script for the cluster looks something like this:
#!/bin/bash
# request resources:
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:30:00
module add languages/R-3.3.3-ATLAS
export PBS_O_WORKDIR="/panfs/panasas01/gely/xxxxxxx/workingdirectory"
# on compute node, change directory to 'submission directory':
cd $PBS_O_WORKDIR
# run your program and time it:
time Rscript ./R_script.R
I'd like to suggest that you look into batchtools package. It provides methods for interacting with TORQUE / PBS from R.
If you're ok to use it's predecessor BatchJobs for a while, I'd also recommend to try that and when you understand how that works, look into the doFuture foreach adaptor. This will allow you to use the future.BatchJobs package. This combination of doFuture, future.BatchJobs, and BatchJobs allows you do everything from within R and you don't have to worry about creating temporary R scripts etc. (Disclaimer: I'm the author of both).
Example what it'll look like when you've got it set up:
## Tell foreach to use futures
library("doFuture")
registerDoFuture()
## Tell futures to use TORQUE / PBS with help from BatchJobs
library("future.BatchJobs")
plan(batchjobs_torque)
and then you use:
res <- foreach(i = 1:100) %dopar% {
my_function(pathname[i], arg1, arg2)
}
This will evaluate each iteration in a separate PBS job, i.e. you'll see 100 jobs added to the queue.
The future.BatchJobs vignettes have more examples and info.
UPDATE 2017-07-30: The future.batchtools package is on CRAN since May 2017. This package is now recommended over future.BatchJobs. The usage is very similar to the above, e.g. instead of plan(batchjobs_torque) you now use plan(batchtools_torque).
Problem solved:
I made a mistake: The external program is actually NOT running - I misinterpreted the log file. The reason for the external program to not run is that the subfolder (containing the necessary input data) is not found. It seems that the cluster defaults to the user directory instead of the working directory specified in the pbs submission script. This behaviour is different from clusters created with doParallel, which do indeed recognize the working directory. The problem is therefore solved by just adding the relative path to working directory and subfolder in the R script, i. e. ./workingdirectory/mysubfolder/ instead of just ./mysubfolder/. Alternatively, you can also use the full path to the folder.
I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list
I have a directory containing ~40000 csv files, each ranging in size from ~400 bytes to ~11 MB. I have written a function that reads in a csv file and calculates some basic numbers for each csv file (eq. how many values of "female" are in each csv file). This code successfully ran for the same number of csv files, but when the csv files were smaller.
I'm using the packages parallel and doParallel to run this on my machine and receive the following error:
Error in unserialize(node$con) : error reading from connection
I suspect I'm running out of memory but I am not sure how best to handle the increased size of the files I'm working with. My code is as follows:
Say that 'fpath' is the path to my directory where all these csvs live:
fpath<-"../Desktop/bigdirectory"
filelist <- as.vector(list.files(fpath,pattern="site.csv"))
(f <- file.path(fpath,filelist))
# demographics csv in specified directory
dlist <- as.vector(list.files(fpath,pattern="demographics.csv"))
(d <- file.path(fpath,dlist))
demos <- fread(d,header=T,sep=",")
cl <- makeCluster(4)
registerDoParallel(cl)
setDefaultCluster(cl)
clusterExport(NULL,c('transit.grab'))
clusterEvalQ(NULL,library(data.table))
sdemos <- demo.grab(dl[[1]],demos)
stmp <- parLapply(cl,f,FUN=transit.grab,sdemos)
And the function 'transit.grab' is the following:
transit.grab <- function(sitefile,selected.demos){
require(sqldf)
demos <- selected.demos
# Renaming sitefile columns
sf <- fread(sitefile,header=T,sep=",")
names(sf)[1] <- c("id")
# Selecting transits from site file using list of selected users
sdat <- sqldf('select sf.* from sf inner join demos on sf.id=demos.id')
return(sdat)
}
I'm not looking for someone to debug code, as I know it runs properly for a smaller amount of data, but rather I desperately need suggestions on how to implement this code for ~6.7 GB of data. Any and all feedback is welcome, thanks!
UPDATE:
As suggested, I replaced sqldf() with merge(), which reduced my computation time by half when tested on a smaller directory. When observing my memory usage via Activity Monitor, my trend is pretty flat. BUT now when I try running my code on the large directory, my R session crashes.