Why does MPI_Barrier does not stop at the same time for workers across different nodes?

Why does MPI_Barrier does not stop at the same time for workers across different nodes? - mpi

I used the following piece of codes to synchronize between working nodes on different boxes:
MPI_Barrier(MPI_COMM_WORLD);
gettimeofday(&time[0], NULL);
printf("RANK: %d starts at %lld sec, %lld usec\n",rank, time[0].tv_sec, time[0].tv_usec);
when I run two tasks in the same node, the starting time are quite close:
RANK: 0 starts at 1379381886 sec, 27296 usec
RANK: 1 starts at 1379381886 sec, 27290 usec
However, when I run two tasks across two different nodes, I ended up with more different starting time:
RANK: 0 starts at 1379381798 sec, 720113 usec
RANK: 1 starts at 1379381798 sec, 718676 usec
Is the following different reasonable? Or it implies some communication issue between nodes?

A barrier means that different nodes will synchronize. They do this by exchanging messages. However once a node has received message from all the other nodes that they reached the barrier, that node will continue. There is no reason to wait executing code since barriers are mainly used to guarantee for instance all nodes have processed data, not to synchronize nodes in time...
One can never synchronize nodes in time. Only by using strict protocols like the simple time protocol (STP), one can guarantee clocks are set approximately equal.
A barrier was introduced to guarantee code before the barrier is executed before nodes start executing something else.
For instance let's say all nodes execute the following code:
MethodA();
MPI_Barrier();
MethodB();
Then you can be sure that if a node executes MethodB, all other nodes have executed MethodA, however you don't know anything about how much they have already processed of MethodB.
Latency has a high influence on the execution time. Say for instance machineA was somehow faster than machineB (we assume a WORLD with two machines and time differences can be caused by caching,...)
If machine A reaches the barrier, it will send a message to machineB and wait for a message of machineB which says machineB has reached the barrier as well. In a next timeframe machineB reaches the barrier as well and sends a message to machineA. However machineB can immidiately continue process data since it already received the message of machineA. machineA however, must wait until the message arrives. Of course this message will arrive quite soon, but cause some timedifference. Furthermore it's not guaranteed that the message will be received correctly the first time. If machineA does not confirm it received the message, machineB will resend it after a while causing more and more delay. However in a LAN network packet loss is very unlikely.
One can thus claim the time of transmission (latency) has an impact on the difference in time.

Related

BizTalk send port retry interval and retry count

There is one dynamic send port (Req/response) in my orchestration.
Request is sending to external system and accepting response in orch. There is a chance the external system have monthly maintenance of 2 days. To handle that scenario
Retry interval if I set to 2 days is it impacting the performance? Is it a good idea?

I wouldn't think it is a good idea, as even a transitory error of another type would then mean that message would be delayed by two days.
As maintenance is usually scheduled, either stop the send port (but don't unenlist) or stop the receive port that picks up the messages to send (preferable, especially if it is high volume), and start them again after the maintenance period.
The other option would be to build that logic into the Orchestration, that if it catches an exception that it increased the retry interval on each retry. However as above, if it is high volume, you might be better off switching of the receive location, as otherwise you will have a high number of running instances.

Set a service interval at the send port if you know when the receiving system will be down. If the schedule is unknown I would rather set:
retry count = 290
retry interval = 10 minutes
to achieve that the messages will be transmitted over two days.

Is buffering time at the transmitting end included in RTT?

Good day!
I know this is a simple question but I can't find its answer, whenever I look for RTT, it is usually loosely defined. So, is buffering time in the transmitting node included in RTT -received by ping-?

RTT simply means "round-trip time." I'm not sure what "buffering" you're concerned about. The exact points of measurement depend on the exact ping program you're using, and there are many. For BusyBox, the ping implementation can be found here. Reading it shows that the outgoing time is stamped when the outgoing ICMP packet is prepared shortly before sendto() is called, and the incoming time is stamped when the incoming ICMP packet is parsed shortly after recvfrom() is called. (Look for the calls to monotonic_us().) The difference between the two is what's printed. Thus the printed value includes all time spent in the kernel's networking stack, NIC handling and so on. It also, at least for this particular implementation, includes time the ping process may have been waiting for a time slice. For a heavily loaded system with scheduling contention this could be significant. Other implementations may vary.

Implementing TCP keep alive at the application level

We have a shell script setup on one Unix box (A) that remotely calls a web service deployed on another box (B). On A we just have the scripts, configurations and the Jar file needed for the classpath.
After the batch job is kicked off, the control is passed over from A to B for the transactions to happen on B. Usually the processing is finished on B in less than an hour, but in some cases (when we receive larger data for processing) the process continues for more than an hour. In those cases the firewall tears down the connection between the 2 hosts after an inactivity of 1 hour. Thus, the control is never returned back from B to A and we are not notified that the batch job has ended.
To tackle this, our network team has suggested to implement keep-alives at the application level.
My question is - where should I implement those and how? Will that be in the web service code or some parameters passed from the shell script or something else? Tried to google around but could not find much.

You basically send an application level message and wait for a response to it. That is, your applications must support sending, receiving and replying to those heart-beat messages. See FIX Heartbeat message for example:
The Heartbeat monitors the status of the communication link and identifies when the last of a string of messages was not received.
When either end of a FIX connection has not sent any data for [HeartBtInt] seconds, it will transmit a Heartbeat message. When either end of the connection has not received any data for (HeartBtInt + "some reasonable transmission time") seconds, it will transmit a Test Request message. If there is still no Heartbeat message received after (HeartBtInt + "some reasonable transmission time") seconds then the connection should be considered lost and corrective action be initiated....
Additionally, the message you send should include a local timestamp and the reply to this message should contain that same timestamp. This allows you to measure the application-to-application round-trip time.
Also, some NAT's close your TCP connection after N minutes of inactivity (e.g. after 30 minutes). Sending heart-beat messages allows you to keep a connection up for as long as required.

Sending message between hosts, without probe or ask every host for new messages

My problem is to send messages about the status of calculation and program status. Every host get one chunk of work. If the host finish the work it should send the result to the reciever. The reciever could change while calculation run. For debugging purpose the status on every host should also be transferred to host with rank 0.
From that point I got a lot of messages. But it is not clear for me how I send the messages between the hosts.
One possibility is a message transport like a circle, where every neighbor send the message to the next neighbor.
The non blocking communication method's like MPI_Isend and MPI_Irecv could be the solution. But every host should be sender and reciever.
The easy way is where every host broadcast the messages, but that is a lot of traffic.
I need a function like broadcast, where every host could be reciever and sender. And only then, when a message is there!
regards

Based on "I need a function like broadcast, where every host could be reciever and sender.", MPI_Alltoall fits the bill. Please refer to this link for an actual example.

If you don't know that there is going to be a message, one way to handle this would probably be to have the master process act as a message queue, basically in an endless receive loop until all tasks have sent an exit signal.
This way, you don't have to worry about mixed send/receive counts between neighbor tasks, and each task can independently poll the master task occasionally to see if there is work, and or a message, for it.
This can get a bit complicated to handle, especially of the message queue and to make sure that when a message is waiting that the two tasks then initiate a short send/receive session, but it would prevent a lot of broadcast / all-to-all traffic, which is notoriously expensive time wise.
If the master task is largely only handling message status, it should be very low friction interconnect wise.
Basically for Task 0:
while (1){
MPI_recv(args);
if (any task has a message to send){ add it to the master's queue; }
if (any task is waiting to contact the task that is now polling for messages){ tell current task to initiate a Recv wait and signal the master that it is waiting; }
if (any other task is waiting for the current task to send to it){ initiate a send to that task; }
}
Each task:
if (work needs to be sent to a neighbor){ contact master; master adds to queue; }
if (a neighbor wants to send work){ enter receive loop ; }
if (a neighbor is ready to receive work according to the master){ send work to it; }
Or:
Task 3 has work for Task 8.
Task 3 contacts Master, and says so.
Task 3 continues its business.
Task 8 contacts Master, and sees it has work from Task 3 pending.
Task 8 enters a receive.
Task 3 (at some polling interval) again contacts master to check on work, and sees there is a task awaiting work.
Task 3 initiates a send.
Task 3 initiates a send for any remaining tasks waiting on work.
Program continues.
Now, this has plenty of its own caveats. Whether or not it is efficient depends on how frequently messages are passed between tasks, and how long a given task sits in wait for its neighbor to send.
And in the end, this may not be better than a simply Alltoall(). And it's entirely possible that this solution is downright terrible.

Call to slow service over HTTP from within message-driven bean (MDB)

I have a message driven bean which serves messages in a following way:
1. It takes data from incoming message.
2. Calls external service via HTTP (literally, sends GET requests using HttpURLConnection), using the data from step 1. No matter how long the call takes - the message MUST NOT be dropped.
3. Uses the outcome from step 2 to persist data (using entity beans).
Rate of incoming messages is:
I. Low most of the time: an order of units / tens in a day.
II. Sometimes high: order of hundreds in a few minutes.
QUESTION:
Having that service in step (2) is relatively slow (20 seconds per request and degrades upon increasing workload), what is the best way to deal with situation II?
WHAT I TRIED:
1. Letting MDB to wait until service is executed, no matter how long it takes. This tends to rollback MDB transactions by timeout and to re-deliver message, increasing workload and making things even worse.
2. Setting timeout for HttpURLConnection gives some guarantees in terms of completion time of MDB onMessage() method, but leaves an open question: how to proceed with 'timed out' messages.
Any ideas are very much appreciated.
Thank you!

In that case you can just increase a transaction timeout for your message driven beans.

This is what I ended up with (mostly, this is application server configuration):
Relatively short (comparing to transaction timeout) timeout for HTTP call. The
rationale: long-running transactions from my experience tend to
have adverse side effects such as threads which are "hung" from app.
server point of view, or extra attention to database configuration,
etc.I chose 80 seconds as timeout value.
Increased up to several minutes re-delivery interval for failed
messages.
Careful adjustment of the number of threads which handle messages
simultaneously. I balanced this value with throughput of HTTP service.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex