Parallel computing in Julia and mis-allocating # of cores - julia

I used the pmap() function in Julia to write a parallel code.
I then secured four cores on the cluster and ran a script:
julia -p 12 my_parallel_program.jl
Should I now cancel my job? What exactly is happening now that I told Julia there are 12 cores when there are really 4? Will it run just as fast as if I had julia -p 4 my_parallel_program.jl?

The option -p 12 launches a total 13 parallel processes (one master process plus 12 slave/worker processes).
Depending on your problem setting this might be desired or not.
The rule of thumb is to launch a number of slave processes that matches the number of cores on your machine (this can be checked via nproc bash command).
However, in scenarios that are strongly dependent on some external processing (e.g. waiting for network IO), creating a larger number of processes that the number of cores might completely make sense. On the other hand, in compute intensive cases (such as numerical simulations) the number of slave processes should be equal to the number of available cores. Last, but not least, please note that if you run your a compute intensive process on laptop you can easily cause it to overheat (theoretically it should not happen but I managed to melt-down 3 laptop CPUs this way). Hence, if it is numerical simulation on laptop - use no more than 75% of available cores.
Having that said please note that there are many alternatives to multiprocessing in Julia:
green threads with #async - those are great for network intensive scenarios such as web-scraping
multi-threading with Threads.#threads - this requires setting the JULIA_NUM_THREADS system variable. This option allows to share common data across the threads. The advantage is that are no communication issues between process (compared to -p), the disadvantage is that the utilization of locking mechanism is required to prevent to threads to change the same piece of data simultaneously.
multiprocessing (the one with -p option that you used). Here Julia spawns the given number of slave processes. If you need to communicate across processes you should use the ParallelDataTransfer.jl package
distributed multiprocessing (using the --machinefile Julia launch option). This is very powerful because it allows to run -p-style code across huge clusters of machines running Julia. All what is needed to configure this is passwordless SSH and opening TCP/IP network connections across machines in the cluster.
The optimal choice depends on your computational scenario but some hints have been given above.

You told Julia to start 12 worker processes and Julia obeyed. You could check it calling nworkers function. The number of logical cores Julia sees can be checked by Base.Sys.CPU_CORES.
In general starting more processes than cores most likely will degrade the performance of computations as 12 processes compete for time of 4 cores.
Here is an example session to confirm this (run on machine with 4 logical cores):
$ julia -p 12
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.7.0-alpha.0 (2018-05-31 00:07 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-w64-mingw32
julia> nworkers()
12
julia> Base.Sys.CPU_CORES
4

Related

HuggingFace Trainer() does nothing - only on Vertex AI workbench, works on colab

I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.
I'm totally stumped and have no idea how to even begin to try debug this.
I made this small notebook: https://github.com/andrewm4894/colabs/blob/master/huggingface_text_classification_quickstart.ipynb
If you set framework=pytorch and run it in colab it runs fine.
I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer() call and do nothing.
It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):
(base) jupyter#pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P8 9W / 70W | 3MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I found this thread that maybe seems like a similar problem so i played with as many Trainer() args as i could but no luck.
So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.
Basically this was all working great (in my actual real code im working on) on colab's but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.
Any help or advice much appreciated, i'm new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.
Workaround
i noticed that if i make a new workbook NumPy/SciPy/scikit-learn 4 vCPUs, 15 GB RAM , NVIDIA Tesla T4 x (instead of the official pytorch one from the dropdown) and install pytorch myself with conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch it all works.

Unable to use all cores with mpirun

I'm testing a simple MPI program on my desktop (Ubuntu LTS 16.04/ Intel® Core™ i3-6100U CPU # 2.30GHz × 4/ gcc 4.8.5 /OpenMPI 3.0.0) and mpirun won't let me use all of the cores on my machine (4). When I run:
$ mpirun -n 4 ./test2
I get the following error:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
./test2
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
But if I run with:
$ mpirun -n 2 ./test2
everything works fine.
I've seen from other answers that I can check the number of processors with
cat /proc/cpuinfo | grep processor | wc -l
and this tells me that I have 4 processors. I'm not interested in oversubscribing, I'd just like to be able to use all my processors. Can anyone help?
Your processor has 4 hyperthreads but only 2 cores (see the specs here).
By default, Open MPI does not run more than one MPI task per core.
You can have Open MPI run up to one MPI task per hyperthread with the following option
mpirun --use-hwthread-cpus ...
FWIW
The command you mentioned reports the number of hyperthreads.
A better way to figure out the topology of a machine is via the lstopo command from the hwloc package.
MPI tasks are not bound on cores nor threads on OS X, so if you are running on a Mac, the --oversubscribe -np 4 would lead to the same result.
To resolve your problem, you can use the --use-hwthread-cpus command line arguments for mpirun, as already pointed out by Gilles Gouaillardet. In this case, Open MPI will treat the thread provided by hyperthreading as the Open MPI processor. Otherwise, it will treat a CPU core as an Open MPI processor, which is the default behavior. When using --use-hwthread-cpus, it will correctly determine the total number of processors available to you, that is, all processors available on all hosts specified in the Open MPI host file. Therefore, you do not need to specify the "-n" parameter. In addition, when using the --use-hwthread-cpus command line parameter, Open MPI refers to the threads provided by hyperthreading as "hardware threads". With this technique, you will not oversubscribe, and if some Open MPI processor will run on a virtual machine, it will use the correct number of threads assigned to that virtual machine. And if your processor has more than two threads per core, as a Xeon Phi (Knights Mill, Knights Landing, etc.), it will take all four threads per core as an Open MPI processor.
Use $ lscpu the number of cores per socket * number of sockets would give you number of physical cores(the ones that you can use for mpi) where as number of cores per socket * number of sockets * threads per core will give you number of logical cores(the one that you get by using the command $ cat /proc/cpuinfo | grep processor | wc -l)

mpi job submission on lsf cluster

I usually process data on the University's cluster. Most jobs done before are based on parallel batch shell (divide job to several batches, then submit them parallel). An example of this shell is shown below:
#! /bin/bash
#BSUB -J model_0001
#BSUB -o z_output_model_0001.o
#BSUB -n 8
#BSUB -e z_output_model_0001.e
#BSUB -q general
#BSUB -W 5:00
#BSUB -B
#BSUB -N
some command
This time, I am testing some mpi job (based on mpi4py). The code has been tested on my laptop working on single task(1 task using 4 processor to run). Now I need to submit multi-task (30) jobs on the cluster (1 task using 8 processor to run). My design is like this: prepare 30 similar shell files above. command in each shell fill is my mpi command (something like "mpiexec -n 8 mycode.py args"). And each shell reserves 8 processors.
I submitted the jobs. But I am not sure if I am doing correctly. It's running but I am not sure if it runs based on mpi. How can I check? Here are 2 more questions:
1) For normal parallel jobs, usually there is a limit number I can reserve for single task -- 16. Above 16, I never succeeded. If I use mpi, can I reserve more? Because mpi is different. Basically I do not need continuous memory.
2) I think there is a priority rule on the cluster. For normal parallel jobs, usually when I reserve more processors for 1 task (say 10 tasks and 16 processors per task), it requires much more waiting time in the queue than reserving less less processors for single task (say divide each task to 8 sub-tasks (80 sub-tasks in total) and 2 processors per sub-task). If I can reserve more processors for mpi. Does it affects this rule? I worry that I am going to wait forever...
Well, increasing "#BSUB -n" is exactly what you need to do. That option tells how many execution "slots" you are reserving. So if you want to run an MPI job with 20 ranks, you need
#BSUB -n 20
IIRC the execution slots do not need to be allocated on the same node, LSF will allocate slots from as many nodes are required for the request to be satisfied. But it's been a while since I've used LSF, and I currently don't have access to a system using it, so I could be wrong (and it might depend on the local cluster LSF configuration).

MPI: mpiexec third parameter not clear

What exactly is the third parameter in the following MPI command
mpiexec -n 2 cpi
Is it no. of cores? So if I am running on Pentium 4 , shall I make it 1?
-n 2: spawn two processes.
cpi: the executable.
Experiment with what is faster, one or two or more processes. Some codes run best with one process per core, some codes benefit from oversubscription.

Pinning parallel process using UNix system calls

for i in `seq 1 8` ; do
(./runProgram &)
done
Dear Fellows,
I know how to create parallel processes by creating 8 independent processes, the next thing I am in search for is how to
i-Run 8 copies concurrently with processor pinning (each copy on is own processor core)
ii-Run 16 copies concurrently with processor pinning (2 copies per core)
iii-Run 8 copies concurrently with processor pinning as per “iii” and flipping processor core to the furthest core after a particular function call in the code.
Current configuration of my cpu is 8 cores.it is running Fedora OS. I dont know the process ids in advance.
please suggest.
Thanks in advance.
The easiest way to achieve i and ii is to use the taskset command:
Case i:
for i in `seq 0 7`; do
taskset -c $i ./runProgram &
done
Case ii:
for i in `seq 0 7`; do
taskset -c $i ./runProgram &
taskset -c $i ./runProgram &
done
Case iii: See the manual pages for sched_getaffinity(2) and sched_setaffinity(2) on how to change the pinning in code.

Resources