Have to MPI_Iprobe several times before receiving messages - mpi

The application I'm working on involves multiple processes all working on similar tasks, and only occasionally sharing information. I have a working implementation using openMPI, but I'm having issues with messages sometimes being received much later than they are sent.
At the moment, the main loop of each process begins by processing any waiting messages, then does a whole bunch of computation, then sends the results to each of the other processes using MPI_ISend. Something like:
while problem is unsolved:
bool flag;
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
while flag:
MPI_Recv(&message, ...);
// update local information based on message contents
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
result = doComputation(); // between 2s and 1m
MPI_Request request;
for dest in other_processes:
MPI_Isend(result, ..., dest, &request);
MPI_Wait(&request, MPI_STATUS_IGNORE); // Doesn't seem to make any difference
This works OK, but the following problem often occurs: Process X sends a message, but the next time Process Y probes, it finds no message. Often it is only one or two loops (and many seconds) later that Process Y gets the message sent by Process X. The messages always get through eventually, but Process Y shouldn't have to wait until the second or third time it probes to actually receive the message.
I don't have a solid understanding of how MPI works, but from reading other questions I think the problem has something to do with MPI not having a chance to progress the message, because in my program the MPI functions are only called very occasionally, rather than within a tight loop. Trying to give MPI some extra time to progress, I added a few dummy calls to Iprobe:
bool flag;
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
while flag:
MPI_Recv(&message, ...);
// update local information based on message contents
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
And it works - any sent messages are always received the very next time a process probes for them.
But this is ugly and I suspect it may give different results on different platforms. So is there an alternate way to allow MPI some time to complete message progression before probing? I don't want to use blocking receive without probing because Process Y should be able to continue computation when there are no messages waiting.
Thanks.

Related

Obvious deadlock situation is not occurred

I'm completely stuck. This code
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Send(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD);
MPI_Recv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&status);
does work running on two processes. Why is there no deadlock?
Same thing with non-blocking version
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Isend(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
MPI_Irecv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
Logically there should be a blocking, but it is not. Why?
How do I force the MPI to block?
MPI has two kinds of point-to-point operations - blocking and non-blocking. This might be a bit confusing initially, especially when it comes to sending messages, but the blocking behaviour does not relate to the physical data transfer operation. It rather relates to the time period, in which MPI might still access the buffer (either send or receive) and therefore the application is not supposed to modify its content (with send operations) or may still read old garbage (with read operations).
When you make a blocking call to MPI, it only returns once it has finished using the send or the receive buffer. With receive operations this means that the message has been received and stored in the buffer, but with send operations things are more complex. There are several modes for the send operation:
buffered -- data isn't sent immediately, but rather copied to a local user-supplied buffer and the call returns right after. The actual message transfer happens at some point in the future when MPI decides to do so. There is a special MPI call, MPI_Bsend, which always behaves this way. There are also calls for attaching and detaching the user-supplied buffer.
synchronous -- MPI waits until a matching receive operation is posted and the data transfer is in progress. It returns once the entire message is in transit and the send buffer is free to be reused. There is a special MPI call, MPI_Ssend, which always behaves this way.
ready -- MPI tries to send the message and it only succeeds if the receive operation has already been posted. The idea here is to skip the handshake between the ranks, which may reduce the latency, but it is unspecified what exactly happens if the receiver is not ready. There is a special call for that, MPI_Rsend, but it is advised to not use it unless you really know what you are doing.
MPI_Send invokes what is known as the standard send mode, which could be any combination of the synchronous and the buffered mode, with the latter using a buffer supplied by the MPI system and not the user-supplied one. The actual details are left to the MPI implementation and hence differ wildly.
It is most often the case that small messages are buffered while larger messages are sent synchronously, which is what you observe in your case, but one should never ever rely on this behaviour and the definition of "small" varies with the implementation and the network kind. A correct MPI program will not deadlock if all MPI_Send calls are replaced with MPI_Ssend, which means you must never assume that small messages are buffered. But a correct program will also not expect MPI_Send to be synchronous for larger messages and rely on this behaviour for synchronisation between the ranks, i.e., replacing MPI_Send with MPI_Bsend and providing a large enough buffer should not alter the program behaviour.
There is a very simple solution that always works and it frees you from having to remember to not rely on any assumptions -- simply use MPI_Sendrecv. It is a combined send and receive operation that never deadlocks, except when the send or the receive operation (or both) is unmatched. With send-receive, your code will look like this:
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank==0) i=1;
if (myrank==1) i=0;
MPI_Sendrecv(sendbuf, 1, MPI_INT, i, 99,
recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD, &status);

Issue with #pragma acc host_data use_device

I'd like the MPI function MPI_Sendrecv() to run on the GPU. Normally I use something like:
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, N, MPI_DOUBLE, proc[0], 0,
recv_buf, N, MPI_DOUBLE, proc[0], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
And it works fine. However now, I call MPI_Sendrecv() inside a loop. If I try to accelerate this loop (with #pragma acc parallel loop) or even accelerate the whole routine (#pragma acc routine) where the loop and the MPI call are situated, I get an error:
64, Accelerator restriction: loop contains unsupported statement type
78, Accelerator restriction: unsupported statement type: opcode=ACCHOSTDATA
How can I make run the call on the device if, like in this case, the call is in an accelerated region?
An alternative could be maybe to do not accelerate the routine and the loop, and use #pragma acc host_data use_device(send_buf, recv_buf) alone, but the goal of having everything on the gpu would fail.
EDIT
I removed the #pragma. Anyway, the application runs hundreds of time slower and I cannot figure why.
I'm using nsight-sys to check: Do you have and idea why MPI_Sendrecv is slowing down the app? Now all the routine where it's called is running on the host. If I move the mouse pointer on the NVTX (MPI) section, it prints "ranges on this row have been projected from the CPU on the GPU". What does this mean?
Sorry if this is not clear but I lack of practicality with nsight and I don't know how to analyze the results properly. If you need more details I'm happy to give them to you.
However it seemes weird to me that the MPI calls appear in the GPU section.
You can't make MPI calls from within device code.
Also, the "host_data" is saying to use a device pointer within host code so can't be used within device code. Device pointers are used by default in device code, hence no need for the "host_data" construct.
Questions after edit:
Do you have and idea why MPI_Sendrecv is slowing down the app?
Sorry, no idea. I don't know what you're comparing to or anything about your app so hard for me to tell. Though Sendrecv is a blocking call so putting in in a loop will cause all the sends and receives to wait on the previous ones before proceeding. Are you able to rewrite the code to use ISend and IRecv instead?
"ranges on this row have been projected from the CPU on the GPU". What
does this mean?
I haven't seen this before, but presume it just means that even though these are host calls, the NVTX instrumentation is able to project them onto the GPU timeline. Most likely so the CUDA Aware MPI device to device data transfers will be correlated to the MPI region.

Pcap Dropping Packets

// Open the ethernet adapter
handle = pcap_open_live("eth0", 65356, 1, 0, errbuf);
// Make sure it opens correctly
if(handle == NULL)
{
printf("Couldn't open device : %s\n", errbuf);
exit(1);
}
// Compile filter
if(pcap_compile(handle, &bpf, "udp", 0, PCAP_NETMASK_UNKNOWN))
{
printf("pcap_compile(): %s\n", pcap_geterr(handle));
exit(1);
}
// Set Filter
if(pcap_setfilter(handle, &bpf) < 0)
{
printf("pcap_setfilter(): %s\n", pcap_geterr(handle));
exit(1);
}
// Set signals
signal(SIGINT, bailout);
signal(SIGTERM, bailout);
signal(SIGQUIT, bailout);
// Setup callback to process the packet
pcap_loop(handle, -1, process_packet, NULL);
The process_packet function gets rid of header and does a bit of processing on the data. However when it takes too long, i think it is dropping packets.
How can i use pcap to listen for udp packets and be able to do some processing on the data without losing packets?
Well, you don't have infinite storage so, if you continuously run slower than the packets arrive, you will lose data at some point.
If course, if you have a decent amount of storage and, on average, you don't run behind (for example, you may run slow during bursts buth there are quiet times where you can catch up), that would alleviate the problem.
Some network sniffers do this, simply writing the raw data to a file for later analysis.
It's a trick you too can use though not necessarily with a file. It's possible to use a massive in-memory structure like a circular buffer where one thread (the capture thread) writes raw data and another thread (analysis) reads and interprets. And, because each thread only handles one end of the buffer, you can even architect it without locks (or with very short locks).
That also makes it easy to detect if you've run out of buffer and raise an error of some sort rather than just losing data at your application level.
Of course, this all hinges on your "simple and quick as possible" capture thread being able to keep up with the traffic.
Clarifying what I mean, modify your process_packet function so that it does nothing but write the raw packet to a massive circular buffer (detecting overflow and acting accordingly). That should make it as fast as possible, avoiding pcap itself dropping packets.
Then, have an analysis thread that takes stuff off the queue and does the work formerly done in process_packet (the "gets rid of header and does a bit of processing on the data" bit).
Another possible solution is to bump up the pcap internal buffer size. As per the man page:
Packets that arrive for a capture are stored in a buffer, so that they do not have to be read by the application as soon as they arrive.
On some platforms, the buffer's size can be set; a size that's too small could mean that, if too many packets are being captured and the snapshot length doesn't limit the amount of data that's buffered, packets could be dropped if the buffer fills up before the application can read packets from it, while a size that's too large could use more non-pageable operating system memory than is necessary to prevent packets from being dropped.
The buffer size is set with pcap_set_buffer_size().
The only other possibility that springs to mind is to ensure that the processing you do on each packet is as optimised as it can be.
The splitting of processing into collection and analysis should alleviate a problem of not keeping up but it still relies on quiet time to catch up. If your network traffic is consistently more than your analysis can handle, all you're doing is delaying the problem. Optimising the analysis may be the only way to guarantee you'll never lose data.

Qt application crashes when making 2 network requests from 2 threads

I have a Qt application that launches two threads from the main thread at start up. Both these threads make network requests using distinct instances of the QNetworkAccessManager object. My program keeps crashing about 50% of the times and I'm not sure which thread is crashing.
There is no data sharing or signalling occurring directly between the two threads. When a certain event occurs, one the threads signals the main thread, which may in turn signal the second thread. However, by printing logs, I am pretty certain the crash doesn't occur during the signalling.
The structure of both threads is as follows. There's hardly any difference between the threads except for the URL etc.
MyThread() : QThread() {
moveToThread(this);
}
MyThread()::~MyThread() {
delete m_manager;
delete m_request;
}
MyThread::run() {
m_manager = new QNetworkAccessManager();
m_request = new QNetworkRequest(QUrl("..."));
makeRequest();
exec();
}
MyThread::makeRequest() {
m_reply = m_manager->get(*m_request);
connect(m_reply, SIGNAL(finished()), this, SLOT(processReply()));
// my log line
}
MyThread::processReply() {
if (!m_reply->error()) {
QString data = QString(m_reply->readAll());
emit signalToMainThread(data);
}
m_reply->deleteLater();
exit(0);
}
Now the weird thing is that if I don't start one of the threads, the program runs fine, or at least doesn't crash in around 20 invocations. If both threads run one after the other, the program doesn't crash. The program only crashes about half the times if I start and run both the threads concurrently.
Another interesting thing I gathered from logs is that whenever the program crashes, the line labelled with the comment my log line is the last to be executed by both the threads. So I don't know which thread causes the crash. But it leads me to suspect that QNetworkAccessManager is somehow to blame.
I'm pretty blank about what's causing the crash. I will appreciate any suggestions or pointers. Thanks in advance.
First of all you're doing it wrong! Fix your threading first
// EDIT
From my own experience with this pattern i know that it may lead to many unclear crashes. I would start from clearing this thing out, as it may straighten some things and make finding problem clear. Also I don't know how do you invoke makeRequest. Also about QNetworkRequest. It is only a data structure so you don't need to make it on heap. Stack construction would be enough. Also you should remember (or protect somehow) from overwriting m_reply pointer. Do you call makeRequest more than once? If you do, then it may lead to deleting currently processed request after previous request finished.
What does happen if you call makeRequest twice:
First call of makeRequest assigns m_reply pointer.
Second call of makeRequest assigns m_reply pointer second time (replacing assigned pointer but not deleting pointed object)
Second request finishes before first, so processReply is called. deleteLater is queued at second
Somewhere in eventloop second reply is deleted, so from now m_reply pointer is pointing at some random (deleted) memory.
First reply finishes, so another processReply is called, but it operates on m_reply that is pointing a garbage, so every call at m_reply produces crash.
It is one of possible scenarios. That's why you don't get crash every time.
I'm not sure why do you call exit(0) at reply finish. It's also incorrect here if you use more then one call of makeRequest. Remember that QThread is interface to a single thread, not thread pool. So you can't call start() second time on thread instance when it is still running. Also if you're creating network access manager in entry point run() you should delete it in same place after exec(). Remember that exec() is blocking, so your objects won't be deleted before your thread exits.

event driven MPI

I am interested in implementing a sort of event driven dispatch queue using MPI (message passing interface). The basic problem I want to solve is this: I have a master process which inserts jobs into a global queue, and each available slave process retrieves the next job in the queue if there is one.
From what I've read about MPI, it seems like sending and receiving processes must be in agreement about when they send and receive. As in, suppose one process sends a message but the other process does not know it needs to receive, or vice versa, then everything deadlocks. Is there any way to make every process a bit more independent?
You can do that as follows:
Declare one master-node (0), that is going to distribute the tasks. In this node the pseudocode is:
int sendTo
for task in tasks:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_Send(job,... receiver: sendTo)
for node in nodes:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_SEND(job_null,...,receiver: sendTo)
In the slave nodes the code would be:
while (true)
MPI_Send(myNodenum to 0, MPI_INT)
MPI_Recv(job from 0)
if (job == job_null)
break
else
execute job
I think this should work.
You might want to use charm++; it's not explicitly an event driven framework, but does provide an abstraction mechanism for performing tasks and distributing those tasks dynamically.

Resources