Best way to pass local variables to ipyparallel cluster - jupyter-notebook

I'm running a simulation in an ipython notebook that is composed of seven functions that are dependent of each other, and requires 13 different parameters. Some of the functions are called within other functions to allow one function to run the entire simulation. The simulation involves manipulating two parameters for a total of >20k iterations. Two simulations can be run asynchronously. Since each iteration is taking ~1.5 seconds, I'm investigating parallel processing.
When I first tried ipyparallel, I got a global name not defined error. Makes sense that local objects can't been found a worker. In an effort to avoid spending quite a bit of time going down a rabbit hole, what would be the easiest way to pass a whole bunch of objects to all of the workers? Are there other gotchas to consider when using ipyparallel in this way?

There is a bit more detail in this related question, but the gist is: interactively defined modules resolve in the interactive namespace (__main__), which is different on the engine and client. You can send functions to the engine with view.push(dict(func=func, func2=func2)), in which case they will be found. The alternative is to define your functions in a module or package that you ensure is installed on all the engines.
For instance, in a script:
def bar(x):
return x * x
def foo(y):
return bar(y)
view.apply(foo, 5) # NameError on bar
view.push(dict(bar=bar)) # send bar
view.apply(foo, 5) # 25
Often when using IPython parallel from a notebook or larger script, one of the early steps is seeding the namespace of the engines:
rc[:].push(dict(
f1=f1,
f2=f2,
const=const,
))
If you have more than a few names to push this way, it might be time to consider defining these functions in a module, and distributing that instead.

Related

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Is a function like `purrr::map()` within `future.apply::future_apply()` also being ran in parallel?

Sorry if these are a dumb questions, but I know next to nothing about how parallel processing works in practice.
My questions are:
- Q1. Is a function like purrr::map() within future.apply::future_apply() also being ran in parallel?
- Q2. What happens if I run furrr::future_map() inside of a future.apply() function?
- Q3. Assuming I did the above, would I include another plan(multiprocess) call before furrr::future_map()?
Author of the future framework here.
Q1. Is a function like purrr::map() within future.apply::future_apply() also being ran in parallel?
No. There is nothing in 'purrr' that runs in parallel.
Q2. What happens if I run furrr::future_map() inside of a future.apply() function?
It will fall back to run sequentially, which is plan(sequential). The reason for this is to protect against recursive, nested parallelism, which is rarely wanted. This is explained in the future vignette 'A Future for R: Future Topologies'. In some cases it is reasonable to nested parallelism, e.g. distributed processing on multiple machines where you in turn parallel across multiple cores on each machine. This can be done by using
plan(list(tweak(cluster, workers = c("n1", "n2", "n3")), multisession))
Q3. Assuming I did the above, would I include another plan(multiprocess) call before furrr::future_map()?
You don't want to set plan() "inside" you code / functions. Leave the control of plan() to whoever will use your code/call your functions. Also, one doesn't want to for a nested number of cores such as in plan(list(tweak(multisession, workers = ncores), tweak(multisession, workers = ncores))) because that will use ncores^2 cores which will overload you computer. Using the default number of cores as plan(list(multisession, multisession)) will not have this problem, because in the second layer there will be only one core available anyway.

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

How can I label my sub-processes for logging when using multicore and doMC in R

I have started using the doMC package for R as the parallel backend for parallelised plyr routines.
The parallelisation itself seems to be working fine (though I have yet to properly benchmark the speedup), my problem is that the logging is now asynchronous and messages from different cores are getting mixed in together. I could created different logfiles for each core, but I think I neater solution is to simply add a different label for each core. I am currently using the log4r package for my logging needs.
I remember when using MPI that each processor got a rank, which was a way of distinguishing each process from one another, so is there a way to do this with doMC? I did have the idea of extracting the PID, but this does seem messy and will change for every iteration.
I am open to ideas though, so any suggestions are welcome.
EDIT (2011-04-08): Going with the suggestion of one answer, I still have the issue of correctly identifying which subprocess I am currently inside, as I would either need separate closures for each log() call so that it writes to the correct file, or I would have a single log() function, but have some logic inside it determining which logfile to append to. In either case, I would still need some way of labelling the current subprocess, but I am not sure how to do this.
Is there an equivalent of the mpi_rank() function in the MPI library?
I think having multiple process write to the same file is a recipe for a disaster (it's just a log though, so maybe "disaster" is a bit strong).
Often times I parallelize work over chromosomes. Here is an example of what I'd do (I've mostly been using foreach/doMC):
foreach(chr=chromosomes, ...) %dopar% {
cat("+++", chr, "+++\n")
## ... some undoubtedly amazing code would then follow ...
}
And it wouldn't be unusual to get output that tramples over each other ... something like (not exactly) this:
+++chr1+++
+++chr2+++
++++chr3++chr4+++
... you get the idea ...
If I were in your shoes, I think I'd split the logs for each process and set their respective filenames to be unique with respect to something happening in that process's loop (like chr in my case above). Collate them later if you must ... ie. map/reduce your log files :-)

Resources