How nodes communicate in OpenMPI

How nodes communicate in OpenMPI - mpi

I am able to run OpenMPI job in multiple nodes under ssh. Everything looks good but I find that I do not know much about what is really happening. So, how nodes communicate in OpenMPI? It's in multiple nodes, hence it can not be shared memory. It also seems not TCP or UDP because I have not configured any port. Can anyone describe what happened when a message is sent and received between 2 processes in 2 nodes? THANKS!

Open MPI is built on top of a framework of frameworks called Modular Component Architecture (MCA). There are frameworks for different activities such as point-to-point communication, collective communication, parallel I/O, remote process launch, etc. Each framework is implemented as a set of components that provide different implementations of the same public interface.
Whenever the services of a specific framework are requested for the first time, e.g., those of the Byte Transfer Layer (BTL) or the Matching Transport Layer (MTL), both of which transfer messages between the ranks, MCA enumerates through the various components capable of fulfilling the requirements and tries to instantiate them. Some components have specific requirements on their own, e.g., require specific hardware to be present, and fail to instantiate if those aren't met. All components that instantiate successfully are scored and the one with the best score is chosen to carry out the request and other similar requests. Thus, Open MPI is able to adapt itself to different environments will very little configuration on the user side.
For communication between different ranks, the BTL and MTL frameworks provide multiple implementations and the set depends heavily on the Open MPI version and how it was built. The ompi_info tool can be used to query the library configuration. This is an example from one of my machines:
$ ompi_info | grep 'MCA [mb]tl'
MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA btl: self (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v2.1.1)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v2.1.1)
The different components listed here are:
openib -- uses InfiniBand verbs to communicate over InfiniBand networks, which is one of the most widespread high-performance communication fabric for clusters nowadays, and other RDMA-capable networks such as iWARP or RoCE
sm -- uses shared memory to communicate on the same node
tcp -- uses TCP/IP to communicate over any network that provides a sockets interface
vader -- similarly to sm, provides shared memory communication on the same node
self -- provides efficient self-communication
psm -- uses the PSM library to communicate over networks derived from PathScale's InfiniBand variant, such as Intel Omni-Path (r.i.p.)
ofi -- alternative InfiniBand transport that uses OpenFabrics Interfaces (OFI) instead of verbs
The first time rank A on hostA wants to talk to rank B on hostB, Open MPI will go through the list of modules. self only provides self-communication and will be excluded. sm and vader will get excluded since they only provide communication on the same node. If your cluster is not equipped with a high-performance network, the most likely candidate to remain is tcp, because there is literally no cluster node that doesn't have some kind of Ethernet connection to it.
The tcp component probes all network interfaces that are up and notes their network addresses. It opens listening TCP ports on all of them and publishes this information on a central repository (usually managed by the mpiexec process used to launch the MPI program). When the MPI_Send call in rank A requests the services of tcp in order to a send message to rank B, the component looks up the information published by rank B and selects all IP addresses that are in any of the networks that hostA is part of. It then tries to create one or more TCP connections and upon success the messages start flowing.
In most cases, you do not need to configure anything and the tcp component Just Works™. Sometimes though it may need some additional configuration. For example, the default TCP port range may be blocked by a firewall and you may need to tell it to use a different one. Or it may select network interfaces that have the same network range, but do not provide physical connectivity - typical case are the virtual interfaces used by the various hypervisors or container services. In this case, you have to tell tcp to exclude those interfaces.
Configuring the various MCA components is done by passing in MCA parameters with the --mca param_name param_value command-line argument of mpiexec. You may query the list or parameters that a given MCA component has and their default values with ompi_info --param framework component. For example:
$ ompi_info --param btl tcp
MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v2.1.1)
MCA btl tcp: ---------------------------------------------------
MCA btl tcp: parameter "btl_tcp_if_include" (current value: "",
data source: default, level: 1 user/basic, type:
string)
Comma-delimited list of devices and/or CIDR
notation of networks to use for MPI communication
(e.g., "eth0,192.168.0.0/16"). Mutually exclusive
with btl_tcp_if_exclude.
MCA btl tcp: parameter "btl_tcp_if_exclude" (current value:
"127.0.0.1/8,sppp", data source: default, level: 1
user/basic, type: string)
Comma-delimited list of devices and/or CIDR
notation of networks to NOT use for MPI
communication -- all devices not matching these
specifications will be used (e.g.,
"eth0,192.168.0.0/16"). If set to a non-default
value, it is mutually exclusive with
btl_tcp_if_include.
MCA btl tcp: parameter "btl_tcp_progress_thread" (current value:
"0", data source: default, level: 1 user/basic,
type: int)
Parameters have different levels and by default ompi_info only shows parameters of level 1 (user/basic parameters). This can be changed with the --level N argument to show parameters up to level N. The levels go all the way up to 9 and those with higher levels are only required in very advanced cases, such as fine-tuning the library or debugging issues. For example, btl_tcp_port_min_v4 and btl_tcp_port_range_v4, which are used in tandem to specify the port range for TCP connections, are parameters of level 2 (user/detail).

Related

gossip protocol works on tcp or http

The Gossip protocol used by many distributed systems e.g. Cassandra to communicate with other nodes in the ring. So, does it use HTTP or TCP protocol?
Also, what are the pros choosing one over another in distributed systems?

You can use any protocol you want (tcp, http, dns etc) to broadcast information regarding the state of your nodes from a cluster. In my opinion , you should focus on a gossip algorithm, and not really think about the "protocol" word from the naming. At it's core, it's all about broadcasting information between nodes. Each node sending it's own view of the cluster state to a subgroup of nodes, and the broadcast keeps going until all nodes share the same view. There are multiple ways of implmenting such a broadcasting algorithm so research more about it or try your own model :) .
Here is some nice info and pseudo code about gossip model/algorithms

HTTP and TCP are fundamentally different things as they work on different layers of the network stack:
https://en.wikipedia.org/wiki/OSI_model
If you look at the OSI Model TCP works on the transport layer (Layer 4) and HTTP works on the application layer (layer 7), The two perform different jobs. The transport layer is responsible for providing the functional mechanisms for transferring data. The application layer is built on top of the transport (and other) layers and provides items such as partner negotiation, availability and communication synching.
The two are not interchangeable with one another.

Are RPCs bi-directional?

Do remote procedure calls support bi-directional communication?
I. e. is it possible to build a communication mechanism
using "pure" RPC (without any protocols on top of it like XML-RPC, JSON-RPC, Thrift etc.), which
allows two machines to exchange messages in both directions (from machine 1 to machine 2 and vice versa) ?

The old well known RPC (ONC RPC/ SUN RPC) are allow bi-directional connections. Of course at the end you need an implementation which will support that. The is a fork of ti-rpc maintained by LinuxBox https://github.com/mattbenjamin/libtirpc-lbx and a java implementation http://code.google.com/p/nio-jrpc/ with bi-directional rpc support. Both libraries used in NFSv4.1 server/client implementations which require bi-directional RPC.

Where does the Transport Layer operate?

I'd like to know where the Transport Layer of the OSI model is running in a computer system. Is it part of the Operating System? Does it run in its own process or thread? How does it pass information up to other applications or down to other layers?

I'd like to know where the Transport Layer of the OSI model is running in a computer system.
It isn't. The OSI model applies to the OSI protocol suite, which is defunct, and not running anywhere AFAICS. However TCP/IP has its own model, which also includes a transport layer. I will assume that's what you mean hereafter.
Is it part of the Operating System?
Yes.
Does it run in its own process or thread?
No, it runs as part of the operating system.
How does it pass information up to other applications
Via system calls, e.g. the Berkeley Sockets API, WinSock, etc.
or down to other layers?
Via internal kernel APIs.

What the OSI model calls the transport layer corresponds fairly closely to the TCP layer in TCP/IP. That is, it gives guaranteed delivery/error recovery, and transparent transfers between hosts -- you don't need to pay attention to how the data is routed from one host to another -- you just specify a destination, and the network figures out how to get it there.
As far as where that's implemented: well, mostly in the TCP/IP stack, which is typically part of the OS. Modern hardware can implement at least a few bits and pieces in the hardware though (e.g., TCP checksum and flow control). The network stack will offload those parts of the TCP operation to the hardware via the device driver.

The transport layer is available as a library usually shipping with Operating System.
The logical part is implemented in the library. Interaction with transport medium is through drivers.

The transport layer exists between two devices or more, in his example a Client and Host Machine (virtual or real). Transport is invoked by the Operating System on both ends. Both the Client and Host Machine have instances of an Operating System and underly hardware managing transport.
Transport control coordinates information delivery assurance for both the Client and Host Machine OS. Some Machines where necessary, shift some of the workload from the CPU or Kernel down to underlying chipsets to lighten the load. Transport duty is essential commodities work not typically appropriate for the Kernel or Main CPU, but the OS is where transport evolved from as the grid modernized.
In the classroom, the duty is done by the OS, in the industrial control systems I design and implement, we always consider hardware acceleration and efficiencies.
RPDelio

Translate router CLI commands into sequence of MIB operations

In the design of the management API of a network element, we often include support for the commonly used CLIs like the CISCO style CLI and Juniper style CLI. But to support those commands, we need to know the breakdown of the commands issued into the sequence of operations on the MIB tables and objects there in.
For example:
A CLI command :
router bgp 4711 neighbor 3.3.3.3
And it's MIB object operations (like in SNMP) would be :
bgpRmEntIndex 4711
bgpPeerLocalAddrType unica
bgpPeerLocalAddr 2.2.2.2
bgpPeerLocalPort 179
bgpPeerRemoteAddrType uni
bgpPeerRemoteAddr 3.3.3.3
bgpPeerRemotePort 179
Is there some resource which can help us understand this breakdown?

In general on the types of devices that you mention, you will find that there is no simple mapping between CLI operations and (SNMP) operations on MIB variables. The CLIs are optimized for "user-friendly" configuration and on-line diagnostics, SNMP is optimized for giving machine-friendly access to "instrumentation", mostly for monitoring. Within large vendors (such as Cisco or Juniper) CLI and SNMP are typically developed by different specialized groups.
For something that is closer to CLIs, but more friendly towards programmatic use (API), have a look at the IETF NETCONF protocol, which provides XML-based RPC read and write access to device configuration (and state). Juniper pioneered this concept through their Junoscript APIs and later helped with defining the IETF standard, so you will find good support there. Cisco has also added NETCONF capabilities to their systems, especially the newer ones such as IOR-XR.

The MIB documents, such as this one,
http://www.icir.org/fenner/mibs/extracted/BGP4-V2-MIB-idr-00.txt

Developing Serverless Lan Chat Program Help!

I want to develop simple Serverless LAN Chat program just for fun. How can I do this ? What type Architecture should I use?
Last year I have worked on TCP,UDP Client/ Server application Project.It was simple (Server listens to certain port/socket and Client connect to server's port etc..) But I have no idea about how to develop "Serverless" LAN Chat program. How can I do this? UDP,TCP,Multicast,Broadcast? or Should program behave like both server and client?

The simplest way would be to use UDP and simply broadcast your messages all over the network.
A little bit more advanced version would be to only use the broadcast to discover other nodes in the network.
Every node maintains a list of known peers.
Messages are sent with TCP to all known peers.
When a node starts up, it sends out an UDP broadcast to discover other nodes.
When a node receives a discovery broadcast, it sends "itself" to the source of the broadcast, in order to make it self known. The receiving node adds the broadcaster to it's own list of known peers.
When a node drops out of the network, it sends another broadcast in order to inform the remaining nodes that they should remove the dropped client from their list.
You would also have to consider handling the dropping out of nodes without them informing the rest of the network.

The spread toolkit may be a bit overkill for what you want, but an interesting starting point.
From the blurb:
Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support. Spread services range from reliable messaging to fully ordered messages with delivery guarantees.
Spread can be used in many distributed applications that require high reliability, high performance, and robust communication among various subsets of members. The toolkit is designed to encapsulate the challenging aspects of asynchronous networks and enable the construction of reliable and scalable distributed applications.
Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.
Some of the services and benefits provided by Spread:
Reliable and scalable messaging and group communication.
A very powerful but simple API simplifies the construction of distributed architectures.
Easy to use, deploy and maintain.
Highly scalable from one local area network to complex wide area networks.
Supports thousands of groups with different sets of members.
Enables message reliability in the presence of machine failures, process crashes and recoveries, and network partitions and merges.
Provides a range of reliability, ordering and stability guarantees for messages.
Emphasis on robustness and high performance.
Completely distributed algorithms with no central point of failure.

Apples iChat is an example of the very product you are envisioning. It uses Bonjour (apple's zero-conf networking protocol) to identify peers on a LAN. You can then chat or audio/video chat with them.
I'm not entirely sure how Bonjour works inside, but I know it uses multicast. Clients "register" services on the LAN, and the Bonjour protocol allows for each host to pull up a directory of hosts for a given service (all without central management).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex