What is the common topology of GPU within clusters? - networking

I am a novice in the field of high performance computing and I am learning the Allreduce operation for GPU cards. I find the efficient collective operation called ring-Allreduce which requires the physical topology of GPU cards to be the tree topology. However, I check the topology of my own server as the following.
It seems that the GPU cards are connected by several local PCIe bus and PCIe host bridge. Is it the hierarchy of bus topology?

A two-socket system has several PCIe root ports on each socket. A PCIe bridge is attached to each root port, and the GPUs are attached to the bridges.
Connections labeled PIX are between GPUs attached to the same bridge.
Connections labeled NODE are between GPUs attached to two different bridges (on two different root ports).
Connections labeled SYS are connected to root ports on different sockets.

Related

Can an L2 switch send its MAC table data to another device/switch connected in the same network via a port?

My requirement is, there are four ethernet switches connected in a straight daisy chain manner. But no ring/loop is formed. To each switch's PHY ports, there are I/O modules connected. A main CPU is interfaced to the first switch.
My problem is how will the main CPU the location of these ports(I/O modules)? Is there a way to send the MAC table data to another device/switch connected in the same network via a port.
Thanks
Need any other possible ways to implement this.

Does the vhost-user driver ensures the distribution of traffic between multiple RX queues?

I have a question for you. I know that vhost-user NICs can be configured with many RX/TX queues, but does the vhost-user driver ensures the distribution of traffic between RX queues?
I used the sample application l3fdw to switch traffic between two vhost-user NICs, each with 4 queues. The traffic was generated using TREX (and testpmd also), running inside a VM. When I traced my experiment, I noticed that the traffic was only received in queue "0", while the other RX queues were empty.
The l3fdw app tells me that "Port 0 modified RSS hash function based on hardware support,requested:0xa38c configured:0". For offloading capabilities, testpmd indicates that the vhost-user nic NIC has only support for VLAN STRIP (and not for RSS)!
I appreciate any clarification on this matter.
Thank you,
PS:
DPDK version: 19.08
Qemu version: 4.2.1
Adele
Answer for the original question does the vhost-user driver ensures the distribution of traffic between RX queues? is
There is no mechanism like RSS or RTE_FLOW from DPDK Libraries which will ensure software packet distribution on the RX queues of VHOST NIC.
#AdelBelkhiri there are multiple aspects to be clarified to understand this better.
Features supported by VHOST PMD do not advertise either RTE_FLOW or RSS.
Driver code for vhost pmd in file rte_eth_vhost.c does not advertise RSS or RTE_FLOW capability
there is an article which describes the use of OVS and Multiple queues. The RSS is configured on the Physical NIC with 2 RX queues. The RSS is done on the Physical NIC, 2 separate threads picks the packets from the Physical RX queue and puts the same in VHOST queues. Thus achieving pass-through RSS.
hence in your case where you have 2 VM with 2 NIC ports each having 4 queues, please try 8 PMD threads on OVS to concurrently fwd packets between queues. Where the TREX (TX) VM will ensure to put appropriate packets into each queue seperately.
But the simple answer is there is no RSS or RTE_FLOW logic to distribute traffic

how CPU distributes data from network?

I'm learning network communications and already familiar with TCP/IP networking layers (physical, data link ... and application layers) and how data moves through this nodes. But I have some questions about what happens inside a machine, when data is received by a Network Interface Card(NIC).
Questions:
How CPU knows that data from other machine is arrived?
How CPU informs OS that data from other machine is arrived?
How OS knows which application the data is for?
Please, give me some deep explanation for this topic, or advice some useful materials to make it clear.
To give you a general view from Linux point(should be similar for other OS):
The packets arrive in NIC. These packets are copied into circular queues in RAM via DMA. The arrival of packets will generate an interrupt to let the system know that their are packets in RAM. Corresponding to the interrupt there will be an interrupt handler routine registered with the Operating System via the network driver. (To keep things simple didn't talk about softirq's). Each CPU has a poll function whose job is to harvest packets from these queue's and pass them onto upper n/w layers. So answering your queries:
How CPU knows that data from other machine is arrived?
When interrupt occurs and poll loop is not running on the CPU, the OS(via network driver)
will ask the CPU to start the poll loop for harvesting the packets.
How CPU informs OS that data from other machine is arrived?
CPU doesn't need to inform OS. The OS will know when the interrupt occurs as the interrupt handler is a part of the network driver which is part of OS. Infact in a way OS will tell the CPU to start harvesting packets.
How OS knows which application the data is for?
The communication is done via sockets which will have a port number. The packets arrived will have a port number which will guide the OS to take the packet to the required application.

Simulating wireless network transmission

I have an embedded wireless driver for wireless communication that I would like to run in a simulated environment so that I don't have to own hundreds of the devices to test my code.
Ultimately I would like to be able to specify a network connection between nodes, which would be an instance of the C++ code that is used in the embedded device. The connection should simulate the wireless medium such that no packets arrive to the node whilst it is transmitting and no packets arrive at the node if two or more nodes are trying to transmit. Connections modelled by a basic connectivity graph.
I wonder if such a model is possible in the Python Twisted framework. If it is not are there any networking frameworks that would make such a task simple.
Language preferences are Python or Java.
Thanks.
You can use Network Simulator 2 (NS2). Its a awesome network simulator. Its uses C++ and TCL.

TCP - LRO/TSO techniques

Why is it must that all interfaces (routers and bridges) involved support LRO/TSO technique ?
Routers don't. Bridges do.
External routers, hubs, switches or anything else that is externally connected to the network will not see the effects of TSO, only interfaces inside the device with TSO will experience any effects - it's a software thing.
A router is an external device which is connected to the network by ethernet cables, fibre optic cables, wireless comms etc. These communication mediums adhere to internation standards such as 803.2 for ethernet or 803.11 for wireless - they're hardware devices, and hardware devices have very strict rules on how they communicate.
A bridge is an internal software construct and is specific to your OS.
Let's use 803.2 (ethernet) and a linux host for an example.
An application calls for a socket to be created and then pushes a large data chunk into the socket. The linux kernel determines which interface this data should be transitted on. The kernel will next interrogate the driver for this interface to determine its capabilities, if the interface is TSO capable the kernel will pass an sk_buff with a single "template" header and a huge chunk of data (more than 1 packets worth) to the interface driver.
Let's consider a standard interface straight to a hardware NIC first:
Some interfaces have fake TSO (they segment the packet in the driver) and some have true TSO (the template header and data are passed to the hardware with minimal alterations). At this point ether the driver or the NIC hardware will convert this large segment of data into multiple, standard compliant, 803.2 ethernet frames, it is these compliant frames that an external device, such as a router, hub, switch, modem or other host will see on the wire.
Now let's consider several NICs behind a software bridge:
Although the kernel is aware of each NIC at a low level, the network stack is only aware of the bride, thus only capabilities that ALL of the underlying NICs have should be passed up to the bridge. If an sk_buff is passed to a bridge, then ALL the interfaces in the bridge will receive the same sk_buff. We'll assume that the kernal has once again passed our large TSO sk_buff to a bridge, if any of the underlying interfaces does not support TSO then the packet will most likely be dropped by the hardware NIC in question.
In summary:
Worst case scenario is the bridge will repeatedly retry to send the same data chunk on the broken interface and the whole bridge will lock up until the application decides to give up. Best case scenario, the non TSO NIC will simply appear to be dead.
That said, if the NIC has unsafe code in its driver then this could cause a segmentation fault that could bring the whole system down.

Resources