MPI_Spawn without binding to cores - mpi

I am developing same application, which spawns new MPI processes in runtimme using MPI_Spawn. The spawning processes is working OK, but i having problem with binding. I want that every MPI would be "not bound (or bound to all available processors)".
When I run application:
mpirun --oversubscribe --allow-run-as-root --bind-to none --report-bindings -np 1 ./MyApp
And in runtime i MPI_Spawn new processes, i get such report:
Main mpi process report: MCW rank 0 is not bound (or bound to all available processors)
First spawned MPI process report: MCW rank 0 bound to socket 0[core 2[hwt 0-1]]: [../../BB/..]
Second spawned MPI process report: MCW rank 0 bound to socket 0[core 3[hwt 0-1]]: [../../../BB]
Third spawned MPI process report: MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..]
Fourth spawned MPI process report: MCW rank 0 not bound
How to make that all processes would NOT binded?

Related

DPDK MLX5 driver - QP creation failure

I am developing a DPDK program using a Mellanox ConnectX-5 100G.
My program starts N workers (one per core), and each worker deals with its own dedicated TX and RX queue, therefore I need to setup N TX and N RX queues.
I am using flow director and rte_flow APIs to send ingress traffic to the different queues.
For each RX queue I create a mbuf pool with:
n = 262144
cache size = 512
priv_size = 0
data_room_size = RTE_MBUF_DEFAULT_BUF_SIZE
For N<=4 everything works fine, but with N=8, rte_eth_dev_start returns:
Unknown error -12
and the following log message:
net_mlx5: port 0 Tx queue 0 QP creation failure
net_mlx5: port 0 Tx queue allocation failed: Cannot allocate memory
I tried:
to increment the number of Hugepages (up to 64x1G)
change the pool size in different ways
both DPDK 18.05 and 18.11
change the number of TX/RX descriptors from 32768 to 16384
but with no success.
You can see my port_init function here (for DPDK 18.11).
Thanks for your help!
The issue is related to the TX inlining feature of the MLX5 driver, which is only enabled when the number of queues is >=8.
TX inlining uses DMA to send the packet directly to the host memory buffer.
With TX inlining, there are some checks that fail in the underlying verbs library (which is called from DPDK during QP Creation) if a large number of descriptors is used. So a workaround is to use fewer descriptors.
I was using 32768 descriptors, since the advertised value in dev_info.rx_desc_lim.nb_max is higher.
The issue is solved using 1024 descriptors.

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.
You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

Shared memory Vs QPI

Assume there is a node of two processors having 8 cores each (Intel Sandy Bridge E5-2670). Each processor is fitted in a separate socket. Hence there are 8 cores in each socket. Suppose I have two processes - one on socket 1 and one on socket 2 and they communicate. How can I find if they use QPI (Quick Path Interconnect) for communication OR some form of shared memory ?
(Each socket has a shared L3 cache - 20 MB and shared RAM of 16GB).

Processor/socket affinity in openMPI?

I know,there are some basic function in openMPI implementation for mapping the different processes to different cores of different sockets(if the system have more than one sockets).
--bind-to-socket(first come first serve)
--bysocket(round-robin,based on load balencing)
--npersocket N(assign N processes to each socket)
--npersocket N --bysocket(assign N process to each socket , but in a round-robin basis)
--bind-to-core(binds one process to each core in a sequential fashion)
--bind-to-core --bysocket(assign one process to each core ,but never leave any socket less utilized)
--cpus-per-proc N(bind processes to more than one core)
--rankfile(can write complete description of preference of each process)
I am running my openMPI program on a server having 8 sockets(10 cores each),and since the multi threading is on,there are 160 cores available.
I need to analyze by running openMPI program on different combination of sockets/cores and processes.I expect the case when all the sockets are used ,and the code is dealing with some data transfer ,to be slowest as memory transfer is fastest in case both the process are executing on the cores of a same socket.
So my questions are follows,
What are the worst/best case mapping between the process and sockets(each process has a sleep duration and a data transfer to the root process) ?
Is there any way to print the name of the socket and core details on which the process is being executed ? (i will make us of it to know,if the processes are really distributing themselves among the sockets)
It depends on so many factors that it's impossible for a single "silver bullet" answer to exist. Among the factors are the computational intensity (FLOPS/byte) and the ratio between the amount of local data to the amount of data being passed between the processes. It also depends on the architecture of the system. Computational intensity can be estimated analytically or measured with a profiling tool like PAPI, Likwid, etc. System's architecture can be examined using the lstopo utility, part of the hwloc library, which comes with Open MPI. Unfortunately lstopo cannot tell you how fast each memory channel is and how fast/latent the links between the NUMA nodes are.
Yes, there is: --report-bindings makes each rank print to its standard error output the affinity mask that apply to it. The output varies a bit among the different Open MPI versions:
Open MPI 1.5.x shows the hexadecimal value of the affinity mask:
mpiexec --report-bindings --bind-to-core --bycore
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],0] to cpus 0001
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],1] to cpus 0002
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],2] to cpus 0004
[hostname:00599] [[10634,0],0] odls:default:fork binding child [[10634,1],3] to cpus 0008
This shows that rank 0 has its affinity mask set to 0001 which allows it to run on CPU 0 only. Rank 1 has its affinity mask set to 0002 which allows it to run on CPU 1 only. And so on.
mpiexec --report-bindings --bind-to-socket --bysocket
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],0] to socket 0 cpus 003f
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],1] to socket 1 cpus 0fc0
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],2] to socket 0 cpus 003f
[hostname:21302] [[30955,0],0] odls:default:fork binding child [[30955,1],3] to socket 1 cpus 0fc0
In that case the affinity mask alternates between 003f and 0fc0. 003f in binary is 0000000000111111 and such an affinity mask allows each even rank to execute on CPUs from 0 to 5. 0fc0 is 0000111111000000 and therefore odd ranks are only scheduled on CPUs 5 to 11.
Open MPI 1.6.x uses a nicer graphical display instead:
mpiexec --report-bindings --bind-to-core --bycore
[hostname:39646] MCW rank 0 bound to socket 0[core 0]: [B . . . . .][. . . . . .]
[hostname:39646] MCW rank 1 bound to socket 0[core 1]: [. B . . . .][. . . . . .]
[hostname:39646] MCW rank 2 bound to socket 0[core 2]: [. . B . . .][. . . . . .]
[hostname:39646] MCW rank 3 bound to socket 0[core 3]: [. . . B . .][. . . . . .]
mpiexec --report-bindings --bind-to-socket --bysocket
[hostname:13888] MCW rank 0 bound to socket 0[core 0-5]: [B B B B B B][. . . . . .]
[hostname:13888] MCW rank 1 bound to socket 1[core 0-5]: [. . . . . .][B B B B B B]
[hostname:13888] MCW rank 2 bound to socket 0[core 0-5]: [B B B B B B][. . . . . .]
[hostname:13888] MCW rank 3 bound to socket 1[core 0-5]: [. . . . . .][B B B B B B]
Each socket is represented graphically as a set of square brackets with each core represented by a dot. The core(s) that each rank is bound to is/are denoted by the letter B. Processes are bound to the first hardware thread only.
Open MPI 1.7.x is a bit more verbose and also knows about hardware threads:
mpiexec --report-bindings --bind-to-core
[hostname:28894] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[hostname:28894] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..]
[hostname:28894] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..]
[hostname:28894] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..]
mpiexec --report-bindings --bind-to-socket
[hostname:29807] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[hostname:29807] MCW rank 1 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[hostname:29807] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[hostname:29807] MCW rank 3 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
Open MPI 1.7.x also replaces the --bycore and --bysocket options with the more general --rank-by <policy> option.
1.
If there is equal communication between each node and the root and no other communication pattern, then communication will not influence the performance of a specific process->socket mapping. (This is assuming a regular symmetric interconnect topology between the sockets.) Otherwise you usually try to place process pairs with heavy communication close to each other in the communication topology. With MPI on shared memory systems that may not be relevant, but on clusters it certainly is.
However load balancing may also have an effect on the performance of the mapping. If some processes wait for a message/barrier, the other cores on that socket may be able to utilize a higher turbo frequency. This heavily depends on the runtime behavior of the application. An application only consisting of sleep and transfer does not really make sense.
You can use libnuma / sched_getaffinity to confirm your process pinning manually.
There are a number of performance analysis tools that would be helpful to answer your questions. For example OpenMPI comes with VampirTrace that produces a trace containing information about the MPI communication and more. You can view with Vampir.

Number of tcp connections used by MPI program (MPICH2+nemesis+tcp)

How much tcp connections will be used for sending data by MPI program if the MPI used is MPICH2? If you know also about pmi connections, count them separately.
For example, if I have 4 processes and additional 2 Communicators (COMM1 for 1st and 2nd processes and COMM2 for 3rd and 4rd); the data is sent between each possible pair of processes; in every possible communicator.
I use recent MPICH2 + hydra + default pmi. OS is linux, network is switched Ethernet. Every process in on separated PC.
So, here are pathes of data (in pairs of processes):
1 <-> 2 (in MPI_COMM_WORLD and COMM1)
1 <-> 3 (only in MPI_COMM_WORLD)
1 <-> 4 (only in MPI_COMM_WORLD)
2 <-> 3 (only in MPI_COMM_WORLD)
2 <-> 4 (only in MPI_COMM_WORLD)
3 <-> 4 (in MPI_COMM_WORLD and COMM2)
I think there can be
Case 1:
Only 6 tcp connections will be used; data sent in COMM1 and MPI_COMM_WORLD will be mixed in the single tcp connection.
Case 2:
8 tcp connections: 6 in MPI_COMM_WORLD (all-to-all = full mesh) + 1 for 1 <-> 2 in COMM1 + 1 for 3 <-> 4 in COMM2
other variant that I didn't think about.
Which communicators are being used doesn't affect the number of TCP connections that are established. For --with-device=ch3:nemesis:tcp (the default configuration), you will use one bidirectional TCP connection between each pair of processes that directly communicate via point-to-point MPI routines. In your example, this means 6 connections. If you use collectives then under the hood additional connections may be established. Connections will be established lazily, only as needed, but once established they will stay established until MPI_Finalize (and sometimes also MPI_Comm_disconnect) is called.
Off the top of my head I don't know how many connections are used by each process for PMI, although I'm fairly sure it should be one per MPI process connecting to the hydra_pmi_proxy processes, plus some other number (probably logarithmic) of connections among the hydra_pmi_proxy and mpiexec processes.
I can't answer your question completely, but here's something to consider. In MVAPICH2 for the PMI we developed a tree based connection mechanism. So each node would have log (n) TCP connections at the max. Since opening a socket would subject you to the open file descriptor limit on most OSes, its probable that the MPI library would use a logical topology over the ranks to limit the number of TCP connections.

Resources