what is the difference between warmup attribute in Fork and Warmup annotation in jmh? - jmh

I'm learning JMH benchmarking using this tutorial.
I noticed that there are 2 warmup related stuff for function benchMurmur3_128 in here.
so, I have question regarding the difference between warmup attribute in Fork annotation and Warmup annotation with iterations attribute?

With a JMH benchmark you run one or more forks sequentially, and one or more iterations of your benchmark code within each fork. There are two forms of warmup associated with this:
At the fork level the warmups parameter to #Fork specifies how many warmup forks to run before running the benchmarked forks. Warmup forks are ignored when creating the benchmark results.
The #Warmup annotation lets you specify warmup characteristics within a fork, including how many warmup iterations to run. Warmup iterations are ignored when creating the benchmark results.
For example:
#Fork(value = 3, warmups = 2) means that 5 forks will be run sequentially. The first two will be warmup runs which will be ignored, and the final 3 will be used for benchmarking.
#Warmup(iterations = 5, time = 55, timeUnit = TimeUnit.MILLISECONDS) means that there will be 5 warmup iterations within each fork. The timings from these runs will be ignored when producing the benchmark results.
#Measurement(iterations = 4, time = 44, timeUnit = TimeUnit.MILLISECONDS) means that your benchmark iterations will be run 4 times (after the 5 warmup iterations).
So the overall impact of the warmup settings shown above is that:
Only the final three of the five forks will be used for the benchmark results.
Only the final four iterations within each non-warmup fork will be used for the benchmark results.
That is why the JMH output below (which was run using those annotations against the benchmarked method) shows Cnt 12 at the end of the run: 3 forks x 4 iterations = 12.

Related

How to parallelize future_pmap() across multiple slurm nodes

I have access to a large computing cluster with many nodes each of which has >16 cores, running Slurm 20.11.3. I want to run a job in parallel using furrr::future_pmap(). I can parallelize across multiple cores on a single node but I have not been able to figure out the correct syntax to take advantage of cores on multiple nodes. See this related question.
Here is a reproducible example where I made a function that sleeps for 5 seconds and returns the starting time, ending time, and the node name.
library(furrr)
# Set up parallel processing
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
fake_fn <- function(x) {
t1 <- Sys.time()
Sys.sleep(x)
t2 <- Sys.time()
hn <- system2('hostname', stdout = TRUE)
data.frame(start=t1, end=t2, hostname=hn)
}
stuff <- data.frame(x = rep(5, 64))
output <- future_pmap_dfr(stuff, function(x) fake_fn(x))
I ran the job using salloc --nodes=4 --ntasks=64 and running the above R script interactively.
The script runs in about 20 seconds and returns the same hostname for all rows, indicating that it is running 16 iterations simultaneously on one node but not 64 iterations simultaneously split across 4 nodes as intended. How should I change the plan() syntax so that I can take advantage of the multiple nodes?
edit: I also tried a couple other things:
I replaced multicore with multisession, but saw no difference in output.
I replaced the plan(list(...)) with plan(cluster(workers = availableWorkers()) but it just hangs.
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
Sorry, this does not work. When you specify a list of future strategies like this, you are specifying what should be used in nested future calls. In your future_pmap_dfr() example, it's only the first level in this list that will be used. The other three levels are never used. See https://future.futureverse.org/articles/future-3-topologies.html for more details.
I replaced ... with plan(cluster(workers = availableWorkers()) ...
Yes,
plan(cluster, workers = availableWorkers())
which is equivalent to the default,
plan(cluster)
is the correct attempt here.
... but it just hangs.
There could be two things going on here. The first one, is that the workers are set up one by one. So, if you got lots of them, it will take quite a while before plan() will complete. I recommend that you try with only two workers to confirm it works or doesn't work. You can also turn on debug output to see what happens, i.e.
library(future)
options(parallelly.debug = TRUE)
plan(cluster)
Second, using a PSOCK cluster across nodes requires that you have SSH access to those parallel workers. Not all HPC environments support that, e.g. they might prevent users from SSH:ing into compute nodes. This could also be what you're experiencing. As above, turn on debugging to figure out where it stalls.
Now, even if you managed to get this working, you would be faced with a limitation in R that limits you to have at most 125 parallel workers, but typically a bit less. You can read more about this limit in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28. It also shows that one can tweak the R source code and recompile to increase this limit to thousands.
An alternative to the above is to use the future.batchtools;
plan(future.batchtools::batchtools_slurm, workers = availableCores())
This would result in the tasks in future_pmap_dfr() will be resolved via n = availableCores() Slurm jobs. Of course, this comes with the extra overhead of the scheduler, e.g. queueing, launching, running, finishing, and reading the data back.
BTW, the best place to discuss these things is on https://github.com/HenrikBengtsson/future/discussions.

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

load-balancing in R foreach loops

Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.
I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)
Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.

JMH measurement iterations

I'm using JMH and I find something hard to understand: I have one method annotated with #Benchmark and I set measurementIterations(3). The method is called 3 times, but within each iteration call, the function runs a rather big and random number of times.
My question is: is that number completely random? Is there a way to control it and determine how many times should the function run within an iteration? And what is the importance with set up the measurementIterations if each way or another, the function will run a random number of times?
measurementIterations defines how many measured iterations you want to measure of the benchmark. I don't know which parameters you have specified but by default JMH runs the benchmark time-based (default I guess 1 second). This means the benchmark method is invoked in that time frame as often as possible. There are possibilities to specify how often the method should be called in one iteration (-> batching).
I would recommend to study the JMH Samples provided by JMH: http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
They are a very good introduction into JMH and cover pitfalls you easily make within benchmarks.
The number of iteration depends on the various JMH modes I think you must be using Avgtime mode it will perform various iterations.
/////////////////////////////////////////////////////////////////////////////////
Mode.Throughput: Calculate number of operations in a time unit.
Mode.AverageTime: Calculate an average running time.
Mode.SampleTime: Calculate how long does it take for a method to run
(including percentiles).
Mode.SingleShotTime: Just runs a method
once (useful for cold-testing mode).
////////////////////////////////////////////////////////////////////////////////
For example Use mode "Mode.SingleShotTime", it will perform iteration exactly the number of times you mentioned in the run(see below).
// Example runner class
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_01_HelloWorld.class.getSimpleName())
.warmupIterations(1)// number of times the warmup iteration should take place
.measurementIterations(1)//number of times the actual iteration should take place
.forks(1)
.shouldDoGC(true)
.build();
new Runner(opt).run();
}
JMH is doing warm-up iterations that are not measured but necessary for valid results.
measurementIterations defines how many iterations should be measured. This does not include warm-up, because warm-up is not measured.
Yes, in every iteration, the times for method running is random(it's the max number of times the method can run). The times is not important. What is important is the average time used each time.
Besides, you can control how many iterations to run with measurementIterations() and the duration of every iteration with measurementTime().
For example, if you want to run you method with only 1 iteration and it's duration to 1ms, without warmup, just set warmupIterations to 0, measurementTime to 1ms, measurementIterations to 1. Like below:
Options opt = new OptionsBuilder()
.include(xxx.class.getSimpleName())
.warmupIterations(0)
.measurementTime(TimeValue.milliseconds(1))
.measurementIterations(1)
.forks(1)
.build();
Significance for mutiple iterations: Run more, the results should be more reliable.

Resources