Using rm(list=ls()) in a parallel environment in R - r

I am running code in R that runs a function in parallel. The code sets some parameters initially and loads libraries etc, then calls a function (called calibrate) and this runs across several workers using different input parameters on each worker in parallel and returns the result back to the centre. It works, and a number of iterations take place (sometimes more than a 100 over a couple of hours) but crashes after a while, and I suspect that it is a memory resource issue. Hence I want to include an rm type command to reduce memory usage:
At first the function looked like this:
Calibrate <- function() {
rm(list = ls())
gc()
...rest of code calling other functions
}
but this had very little effect. When looking closely and running the code line by line I realised that rm(list=ls()) will do very little inside a function.
So, I thought I would change the code to:
Calibrate <- function() {
ENV <- globalenv()
ll <- ls(envir = ENV)
lf <- lsf.str(envir = ENV)
ll <- ll[ll != lf]
rm(list = ll, envir = ENV)
....rest of code calling other functions
}
This will now get rid of all the variables but not the functions. However, I am worried that this will get rid of all the variables on all the other workers which will still be running. The code runs in parallel but the code does not necessarily run on all the workers at the same speed. So the code is effectively staggered. I only want to remove the variables for an individual worker when the calibrate function is called.
So my question, what should I be doing to clear the variables (rm) for one worker and not the whole system when running in parallel?
Help, really appreciated.

Related

RStudio: Statement to clear memory [duplicate]

I was hoping to make a global function that would clear my workspace and dump my memory. I called my function "cleaner" and want it to execute the following code:
remove(list = ls())
gc()
I tried making the function in the global environment but when I run it, the console just prints the text of the function. In my function file to be sourced:
cleaner <- function(){
remove(list = ls())
gc()
#I tried adding return(invisible()) to see if that helped but no luck
}
cleaner()
Even when I make the function in the script I want it to run (cutting out any potential errors with sourcing), the storage dump seems to work, but it still doesn't clear the workspace.
Two thoughts about this: Your code does not delete all objects, to also remove the hidden ones use
rm(list = ls(all.names = TRUE))
There is also the command gctorture() which provokes garbage collection on (nearly) every memory allocation (as the man page said). It's intended for R developers to ferret out memory protection bugs:
cleaner <- function(){
# Turn it on
gctorture(TRUE)
# Clear workspace
rm(list = ls(all.names = TRUE, envir=sys.frame(-1)),
envir = sys.frame(-1))
# Turn it off (important or it gets very slow)
gctorture(FALSE)
}
If this procedure is used within a function, there is the following problem: Since the function has its own stack frame, only the objects within this stack frame are deleted. They still exist outside. Therefore, it must be specified separately with sys.frame(-1) that only the higher-level stack frame should be considered. The variables are then only deleted within the function that calls cleaner() and in cleaner itself when the function is exited.
But this also means that the function may only be called from the top level in order to function correctly (you can use sys.frames() which lists all higher-level stack frames to build something that also avoids this problem if really necessary)

Working with environments in the parallel package in R

I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

R Parallelisation Error unserialize(socklisk[[n]])

In a nutshell I am trying to parallelise my whole script over dates using Snow and adply but continually get the below error.
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
I have set up the parallelisation process in the following way:
Cores = detectCores(all.tests = FALSE, logical = TRUE)
cl = makeCluster(Cores, type="SOCK")
registerDoSNOW(cl)
clusterExport(cl, c("Var1","Var2","Var3","Var4"), envir = environment())
exposureDaily <- adply(.data = dateSeries,.margins = 1,.fun = MainCalcFunction,
.expand = TRUE, Var1, Var2, Var3,
Var4,.parallel = TRUE)
stopCluster(cl)
Where dateSeries might look something like
> dateSeries
marketDate
1 2016-04-22
2 2016-04-26
MainCalcFunction is a very long script with multiple of my own functions contained within it. As the script is so long reproducing it wouldn't be practical, and a hypothetical small function would defeat the purpose as I have already got this methodology to work with other smaller functions. I can say that within MainCalcFunction I call all my libraries, necessary functions, and a file containing all other variables aside from those exported above so that I don't have to export a long list libraries and other objects.
MainCalcFunction can run successfully in its entirety over 2 dates using adply but not parallelisation, which tells me that it is not a bug in the code that is causing the parallelisation to fail.
Initially I thought (from experience) that the parallelisation over dates was failing because there was another function within the code that utilised parallelisation, however I have subsequently rebuilt the whole code to make sure that there was no such function.
I have poured over the script with a fine tooth comb to see if there was any place where I accidently didn't export something that I needed and I can't find anything.
Some ideas as to what could be causing the code to fail are:
The use of various option valuation functions in fOptions and rquantlib
The use of type sock
I am aware of this question already asked and also this question, and while the first question has helped me, it hasn't yet help solve the problem. (Note: that may be because I haven't used it correctly, having mainly used loginfo("text") to track where the code is. Potentially, there is a way to change that such that I log warning and/or error messages instead?)
Please let me know if there is any other information I can provide to help in solving this. I would be so appreciative if someone could provide some guidance, as the code takes close to 40 minutes to run for a day and I need to run it for close to a year, therefore parallelisation is essential!
EDIT
I have tried to implement the suggestion in the first question included above by utilising the outfile option. Given I am using Windows, I have done this by including the following lines before the exporting of the key objects and running MainCalcFunction :
reportLogName <- paste("logout_parallel.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting log file", getwd()))
mc<-detectCores()
cl<-makeCluster(mc, outfile="")
registerDoParallel(cl)
Similarly, at the beginning of MainCalcFunction, after having sourced my libraries and functions I have included the following to print to file:
reportLogName <- paste(testDate,"_logout.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting test function ",getwd(), sep = ""))
In the MainCalcFunction function I have then put loginfo("text") statements at key junctures to inform me of where the code is at.
This has resulted in some text files being available after the code fails due to the aforementioned error. However, these text files provide no more information on the cause of the error aside from at what point. This is despite having a tryCatch statement embedded in MainCalcFunction where at the end, on any instance of error I have added the line logerror(e)
I am posting this answer in case it helps anyone else with a similar problem in the future.
Essentially, the error unserialize(socklist[[n]]) doesn't tell you a lot, so to solve it it's a matter of narrowing down the issue.
Firstly, be absolutely sure the code runs over several dates in non-parallel with no errors
Ensure the parallelisation is set up correctly. There are some obvious initial errors that many other questions respond to, e.g., hidden parallelisation inside the code which means parallelisation is occurring twice.
Once you are sure that there is no problem with the code and the parallelisation is set up correctly start narrowing down. The issue is likely (unless something has been missed above) something in the code which isn't a problem when it is run in serial, but becomes a problem when run in parallel. The easiest way to narrow down is by setting outfile = "Log.txt" in which make cluster function you use, e.g., cl<-makeCluster(cores-1, outfile="Log.txt"). Then add as many print("Point in code") comments in your function to narrow down on where the issue is occurring.
In my case, the problem was the line jj = closeAllConnections(). This line works fine in non-parallel but breaks the code when in parallel. I suspect it has something to do with the function closing all connections including socket connections that are required for the parallelisation.
Try running using plain R instead of running in RStudio.

%dopar% parallel foreach loop fails to exit when called from inside a function (R)

I have written the following code (running in RStudio for Windows) to read a long list of very large text files into memory using a parallel foreach loop:
open.raw.txt <- function() {
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
return(data)
}
Unfortunately, however, the function fails to complete and debugging shows that it gets stuck at the foreach loop stage. Oddly, windows task manager indicated that I am at close to full capacity processor wise (I have 32 cores, and this should use 30 of them) for around 10 seconds, then it drops back to baseline. However the loop never completes, indicating that it is doing the work and then getting stuck.
Even more bizarrely, if I remove the 'function' bit and just run each step one-by-one as follows:
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
Then it all works fine. What is going on?
Update: I ran the function and then left it for a while (around an hour) and finally it completed. I am not quite sure how to interpret this, given that multiple cores are still only used for the first 10 seconds or so. Could the issue be with how the tasks are being shared out? Or maybe memory management? I'm new to parallelism, so not sure how to investigate this.
The problem is that you have multiple process opening and closing the same file. Usually when a file is opened by a process it is locked to other process, so that prevents reading the file in parallel

Resources