What's the need of using MPI pinning - mpi

Can someone explain whats MPI pinning and why is it needed ?
Description from our cluster says this: but I dint understood much
mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.

Related

How to increase processing power in SLURM? (nodes/cores/tasks?)

I would like to increase the processing power of my jobs but am not sure how to go about this. At the moment I am requesting 1 node on SLURM (#SBATCH --nodes 1) but am not sure whether I should request more cores or more nodes? I know that my workplace HPC has 44 cores to each node, so am I currently using all 44 nodes and need to use an additional 44? Or does this command just request one core from this node by default and I need to find a way to request more cores from that node?
I also know that commands like --ntasks=1, --ntasks-per-node 10 and --cpus-per-task=4 modify number of tasks, but I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
EDIT: I've changed my code from
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 10
(originally copied from someone else, no idea what it's doing)
to
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 10
Any advice appreciated
Increasing the number of node only worth it if your application benefit from distributed computing (using MPI for example). This is the case of most HPC applications. Whether increasing the number of node or core is better is very dependent of the target application (and low-level details of the target platform). Hybrid applications doing a lot of communications tends to better perform with more cores while memory bound ones require more nodes to execute faster. Note that using more core generally helps to better use available HPC resources as remaining cores are generally unused and wasted (HPC cluster/supercomputers allowing multiple users to use cores of the same node simultaneously are very rare). A lot of HPC applications do not scale well in shared memory though (often due to IO/memory saturation or a poor support of NUMA platforms). This is a complex topic where researcher worked since several decades.
I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
You cannot speed up your application by using more cores or more nodes if your application use only one process with one thread (and do not use accelerators like GPUs). You need to parallelize your application first. There are many tools for that starting from OpenMP and MPI (for the basics). There is no free lunch.

Slowdown at increased number of processes for HPC with fat-tree architecture

I've noticed something particularly odd about a simple program I've been running on a HPC with a fat tree architecture and I'm not exactly sure why I'm getting the results I'm getting.
The program I've created simply prints the runtime of a program on a varying number of processes (using MPI). I experimented by varying the number of processes by 2^n from 2 to 256, and while the execution time for each process tends to decrease as the number of processes increases from 2 to 8 processes, this time jumps dramatically at 64 processes.
Could this be because of the architecture, itself? I'd imagine that the execution time would decrease with respect to the number of processes, but this doesn't seem to be the case past a certain threshold of processes.
I figured out the issue a while ago after reading the documentation (go figure) and wanted to post the solution here in case anyone had a similar issue. On the HPC I was using (AFRL's Mustang), I was executing my programs with mpirun on the login node. The documentation clearly states that jobs need to be submitted via a batch script per section 6 of the user guide:
https://www.afrl.hpc.mil/docs/mustangQuickStartGuide.html#jobSubmit

Single-CPU programs running on Hyper-Threading-enabled quadcore CPU

I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.
This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.
Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!
If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.
No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.

How do I select the no. of processors/cores to run my MPI program on?

I am using mpich2 1.2.1p1 version which has MPD as its default process manager.
When we run mpiexec, we can mention the no. of processes we want to spawn, but I also want to mention/select the no. of processors/cores I want to use. How do i do it?
Also, when we simply spawn n no. of processes, how do we know how many processors/cores are being used??
Please help.
Any sensible operating system will use as many cores as possible on each machine. You should not have to worry about that. When spawning 4 mpi processes on a quad core machine, it is safe to assume that all 4 cores will be used. If not, there is something seriously wrong with the configuration. Anyway, if you really want to be sure, check the CPU usage with for example 'top'.
The number of processes is the number of cores used. Mpi will put at least one process on each core. If you want to make sure you are always using the maximum number of cores on your machine then use the OS resources on your system to get the number of cores and pass that to the mpiexec call.

mpirun actual number of processors used

I am starting programming on an OpenMPI managed cluster.
I use the following command to run my executable:
mpirun -np 32 file
Now what I understand is that 32 specifies the number of processes that should be created. They may be created on the same processor. Am I right?
I am noticing increasing time for execution with increase in the number of processes. Could the above be a reason for this?
How do I find out the execution and scheduling policy of the cluster?
Is it correct to assume that typically the cluster I am working on will have many processes running on each node just as they run on my PC.
I would expect your job management system (which is ?) to allocate 1 MPI process per core. But that is a configuration matter and your cluster may not be configured as I expect. Can you see what processes are running on the various cores of your cluster at run time ?
There are many explanations for increasing execution time with increasing numbers of processes, several good ones which include the possibility of one-process-per-core. But multiple processes per core is a potential explanation.
You find out about the policies of your cluster by asking the cluster administrator.
No, I think it is atypical for cluster processors (or cores) to execute multiple MPI processes simultaneously.

Resources