Number of tcp connections used by MPI program (MPICH2+nemesis+tcp)

Number of tcp connections used by MPI program (MPICH2+nemesis+tcp) - tcp

How much tcp connections will be used for sending data by MPI program if the MPI used is MPICH2? If you know also about pmi connections, count them separately.
For example, if I have 4 processes and additional 2 Communicators (COMM1 for 1st and 2nd processes and COMM2 for 3rd and 4rd); the data is sent between each possible pair of processes; in every possible communicator.
I use recent MPICH2 + hydra + default pmi. OS is linux, network is switched Ethernet. Every process in on separated PC.
So, here are pathes of data (in pairs of processes):
1 <-> 2 (in MPI_COMM_WORLD and COMM1)
1 <-> 3 (only in MPI_COMM_WORLD)
1 <-> 4 (only in MPI_COMM_WORLD)
2 <-> 3 (only in MPI_COMM_WORLD)
2 <-> 4 (only in MPI_COMM_WORLD)
3 <-> 4 (in MPI_COMM_WORLD and COMM2)
I think there can be
Case 1:
Only 6 tcp connections will be used; data sent in COMM1 and MPI_COMM_WORLD will be mixed in the single tcp connection.
Case 2:
8 tcp connections: 6 in MPI_COMM_WORLD (all-to-all = full mesh) + 1 for 1 <-> 2 in COMM1 + 1 for 3 <-> 4 in COMM2
other variant that I didn't think about.

Which communicators are being used doesn't affect the number of TCP connections that are established. For --with-device=ch3:nemesis:tcp (the default configuration), you will use one bidirectional TCP connection between each pair of processes that directly communicate via point-to-point MPI routines. In your example, this means 6 connections. If you use collectives then under the hood additional connections may be established. Connections will be established lazily, only as needed, but once established they will stay established until MPI_Finalize (and sometimes also MPI_Comm_disconnect) is called.
Off the top of my head I don't know how many connections are used by each process for PMI, although I'm fairly sure it should be one per MPI process connecting to the hydra_pmi_proxy processes, plus some other number (probably logarithmic) of connections among the hydra_pmi_proxy and mpiexec processes.

I can't answer your question completely, but here's something to consider. In MVAPICH2 for the PMI we developed a tree based connection mechanism. So each node would have log (n) TCP connections at the max. Since opening a socket would subject you to the open file descriptor limit on most OSes, its probable that the MPI library would use a logical topology over the ranks to limit the number of TCP connections.

Related

how are packets delivered from A to B in same subnet?

Sorry, this is a basic question, but I was wondering what will happen in this case? Firstly, is the case below valid example?
Lets say I have 100 servers configured to be in same subnet. I want each of these servers to be able to communicate with one other. Let's say these servers are connected to some network switches. As an example --
---- Switch 4 --- Server 3
/ /
/ -------------/
--- Switch 3 /
/ /
/ /
Server1 -- Switch 1 -- Switch 2 -- Server 2.
Based on what I read online, if the servers are within the same subnet, then there is no routing involved. The packets from server 1 are sent over the ethernet interface which says it belongs to this subnet and at the ethernet transport layer, and the frames are sent with the destination Mac set.
So, for the above diagram, if Server 1 wants to send a packet to Server 3, then it will determine that the packet needs to be sent down the interface that is associated with Switch 1. When Switch 1 tries to decide where to forward the packet, it realizes that frames with this particular MAC address should be forwarded to Switch 3 (not switch 2). Switch 3 then forwards to Switch 4 and Switch 4 then delivers it to Server 3.
Furthermore, my understanding is that, at the layer 2, no sophisticated routing algorithm is used by Switch 1 to decide whether it should send to Switch 2 or Switch 3. It is based on a dumb MAC table that tells it whether it has ever seen this MAC or Switch 2 or 3 and forwards accordingly.
Is my understanding above correct? If so, then what I fail to understand is this --
How is the above MAC based routing within same subnet, any better than using a layer 3 routing protocol like OSPF or LSP that seems to be a lot more smarter about routing efficiently.
If no routing algorithms are used, how will it realize that it is better to go via Switch 2 to reach Server 3 than use Switch 3?

If no routing is present the path to the switch will be determine by its mac address. Server 3 must have 2 network cards (2 MACs' 2 IPS), one connected to switch4 and switch2. When you send a package you use the IP address, then an ARP broadcast will be made to get the MAC address of the device, once you have it then you create the packet with the appropriate headers (layer 2 Mac, layer3 IP).
If you have a routing protocol and instead of switches you have routers, different methods of cost can be taken into consideration to determine the fastest route. For example hoops and speed. You can even force the route via one path no mater what is the best one.
BONUS: If your switches have STP (spanning tree protocol) and have multiple connections between them, then they will determine the fastest route of connection between switches and have redundancy in case of one link fails. In this example the route may change on the diagram that you posted. But remember switches are connected between them with more than one connection 2 or more for each one of them.

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.

You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

Shared memory Vs QPI

Assume there is a node of two processors having 8 cores each (Intel Sandy Bridge E5-2670). Each processor is fitted in a separate socket. Hence there are 8 cores in each socket. Suppose I have two processes - one on socket 1 and one on socket 2 and they communicate. How can I find if they use QPI (Quick Path Interconnect) for communication OR some form of shared memory ?
(Each socket has a shared L3 cache - 20 MB and shared RAM of 16GB).

How does 802.11 MAC unfairness impact TCP performance?

One of the major factors that affect TCP performance in 802.11 ad hoc networks is the unfairness in the MAC. Could someone please illustrate for me what this "unfairness" means?

In ad hoc networks, you usually are trying to do multihop routing. 802.11 CSMA/CA can manifest the "exposed terminal problem" in these situations. Consider a linear topology
... A <---> T <---> B ...
A and B are not in CSMA range. Suppose T is already transmitting a data stream to A. Now suppose a TCP stream starts getting routed through B. Because B CSMAs with A, it will essentially be locked out of the channel. The TCP connection being routed through B will eventually timeout.
Another possibility is the "hidden terminal problem". Consider the topology
A ---> X <--- B
A and B cannot CSMA. Suppose A and B each try to send a TCP stream to X. Because they cannot CSMA with each other, both win their respective channel contention rounds and transmit, only to have their frames collide at X. This can be solved to some extent with RTS/CTS. But in general, the reason for poor TCP performance in wireless environments has to do with the fact that TCP uses a dropped packet as a congestion signal, i.e., a TCP source will cut its window and there by drop its throughput. In wireless networks, dropped packets can be due to any number of transient things (e.g., collision, interference). A TCP source misinterprets these packet drops as a congestion signal and will throttle its send rate and underutilize the channel.
Another problem that can arise is due to the "capture effect". Again, consider
A ----> X <-------------B
Both A and B are in range of X, but B is farther away and thus has a lower received signal strength at X. Again, A and B cannot CSMA. In this case, X may "capture" the stronger transmitter A, i.e., its radio will decode A's frames but consider B's as noise (even though in the absence of A, X would go right ahead and decode B's frames). This sets up an unfair advantage for A if both are trying to route a TCP stream through X.
802.11's DCF also favors the last winner of the channel contention round. As a result, this gives a slight advantage to long-lived, bulk TCP transfers.
Note that all these problems affect all transport protocols, not just TCP. It's just that the way TCP is designed makes it react particularly poorly to these scenarios.

How does TCP slow start increase throughput?

TCP slow start came about in a time when the Internet began experiencing "congestion collapses". The anecdotal example from Van Jacobson and Michael Karels paper goes:
During this period, the data throughput from LBL to UC Berkeley (sites separated
by 400 yards and two IMP hops) dropped from 32 Kbps to 40 bps.
The congestion problem is often described as being caused by the transition from a high-speed link to a slow-speed link, and packet build up/dropping at the buffer at this bottleneck.
What I'm trying to understand is how such a build up would cause a drop in end-to-end throughput, as opposed to simply causing superfluous activity/retransmits on the high-speed portion of the link leading into the full buffer. As an example, consider the following network:
fast slow fast
A ======== B -------- C ======== D
A and D are the endpoints and B and C are the packet buffers at a transition from a high speed to low speed network. So e.g. the link between A/B and C/D is 10Mbps, and link between B/C is 56Kbps. Now if A transmits a large (let's say theoretically infinite) message to D, what I'm trying to understand is why it would take it any longer to get through if it just hammered the TCP connection with data versus adapting to the slower link speed in the middle of the connection. I'm envisaging B as just being some thing whose buffer drains at a fixed rate of 56Kbps, regardless of how heavily its buffer is being hammered by A, and regardless of how many packets it has to discard because of a full buffer. So if A is always keeping B's buffer full (or overfull as may be the case), and B is always transmitting at it's maximum rate of 56Kbps, how would the throughput get any better by using slow-start instead?
The only thing I could think of was if the same packets D had already received were having to be retransmitted over the slow B/C link under congestion, and this was blocking new packets. But wouldn't D have typically ACK'd any packets it had received, so retransmitted packets should be mostly those which legitimately hadn't been received by D because they were dropped at B's buffer?

Remember that networks involve sharing resources between multiple computers. Very simplistically, slow start is required to avoid router buffer exhaustion by a small number of TCP sessions (in your diagram, this is most likely at points B and C)
From RFC 2001, Section 1:
Old TCPs would start a connection with the sender injecting multiple
segments into the network, up to the window size advertised by the
receiver. While this is OK when the two hosts are on the same LAN,
if there are routers and slower links between the sender and the
receiver, problems can arise. Some intermediate router must queue
the packets, and it's possible for that router to run out of space.
[2] shows how this naive approach can reduce the throughput of a TCP
connection drastically.
...
[2] V. Jacobson, "Congestion Avoidance and Control," Computer
Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.
Routers must have finite buffers. The larger a speed mismatch between links is, the greater the chance of buffer exhaustion without slow start. After you have buffer exhaustion, your average TCP throughput will go down because buffering increases TCP's ability to utilize links (preventing unnecessary drops for instantaneous link saturation).
Note that RFC 2001 above has been superseded by RFC 5681; however, RFC 2001 offers a more quotable answer to your question.
From your OP...
Now if A transmits a large (let's say theoretically infinite) message to D, what I'm trying to understand is why it would take it any longer to get through if it just hammered the TCP connection with data versus adapting to the slower link speed in the middle of the connection.
First, there is no such thing as an infinite message in TCP. TCP was limited by the initial window size before slow-start came along.
So, let's say the initial TCP segment was 64KB long. If the entire TCP segment fills the router's tx buffer at B, TCP utilizes less of the link over time due to dynamics involved with packet loss, ACKs and TCP back-off. Let's look at individual situations:
B's tx_buffer < 64KB: You automatically lost time for retransmissions because A's TCP is sending faster than B can dequeue packets
B's tx_buffer >= 64KB: As long as A is the only station transmitting, no negative effects (as long as D is ACK-ing correctly); however, if there are multiple hosts transmitting on A's LAN trying to transit across the 56K link, there are probably problems because it takes 200 milliseconds to dequeue a single 1500 byte packet at 56K. If you have 44 1500-byte packets from A's 64KB initial window (44*1460=64KB; you only get 1460 bytes of TCP payload), the router has a saturated link for almost 9 seconds handling A's traffic.
The second situation is neither fair nor wise. TCP backs off when it sees any packet loss... multiple hosts sharing a single link must use slow start to keep the situation sane.
BTW, I have never seen a router with 9 seconds of buffering on an interface. No user would tolerate that kind of latency. Most routers have about 1-2 seconds max, and that was years ago at T-1 speeds. For a number of reasons, buffers today are even smaller.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex