MPI_Bsend and MPI_Isend. How do they work? - asynchronous

using buffered send and non blocking send I was wondering how and if they implement a new level of parallelism in my application eventually generating a thread.
Imagine that a slave process generates a large amount of data and want to send it to the master. My idea was to start a buffered or non blocking send then immediately begin to compute the next result.
Just when I would have to send the new data I wold check if I can reuse the buffer. This would introduce a new level of parallelism in my application between CPU and communication. Does anybody knows how this is done in MPI ? Does MPI generate a new thread to handle the Bsend or Isend ?
Thanks.

What you're looking for is a nonblocking send using your own buffer (MPI_Isend).There is no need to worry about threading -- ISend should return immediately to let you continue your own code. You would then continue your work, and post an MPI_Wait request on the MPI_Request that you passed to Isend. This will then block until the buffer is free to use again. If you have space for several buffers, you could improve parallelism by allocating several buffers and using whichever becomes available through MPI_Waitany.

Related

Does MPI_Probe return as soon as possible?

Suppose my MPI process is waiting for a very big message, and I am waiting for it with MPI_Probe. Is it correct to suppose the MPI_Probe call will return as soon as the process receives the first notice of the message from the network (like a header with the size or something like)?
I.e., will it return much faster than if I was waiting for the message with MPI_Recv, because it wouldn't need to receive the full message?
The standard is fairly silent on this matter (MPI-3.0, section 3.8.1), but does offer this:
The MPI implementation of MPI_PROBE and MPI_IPROBE needs to guarantee progress:
if a call to MPI_PROBE has been issued by a process, and a send that matches the probe
has been initiated by some process, then the call to MPI_PROBE will return, unless the
message is received by another concurrent receive operation (that is executed by another
thread at the probing process).
Since both MPI_PROBE and MPI_RECV will engage the progress engine, I would doubt there is much difference between the two functions, aside from a memory copy. By engaging the progress engine, it's likely the message will be received (internally) by the MPI implementation. The last step of copying it into the user's buffer can be avoided in MPI_PROBE.
If you are worried about performance, then avoiding MPI_ANY_SOURCE and MPI_ANY_TAG if possible will help most implementations (certainly MPICH) take a faster path.

How non-blocking web server works?

I'm trying to understand the idea of non-blocking web server and it seems like there is something I miss.
I can understand there are several reasons for "block" web request(psuedocode):
CPU bound
string on_request(arg)
{
DO_SOME_HEAVY_CPU_CALC
return "done";
}
IO bound
string on_request(arg)
{
DO_A_CALL_TO_EXTERNAL_RESOURCE_SUCH_AS_WEB_IO
return "done";
}
sleep
string on_request(arg)
{
sleep(VERY_VERY_LONG_TIME);
return "done";
}
are all the three can benefit from non-blocking server?
how the situation that do benefit from the non-blocking web server really do that?
I mean, when looking at the Tornado server documentation, it seems
like it "free" the thread. I know that a thread can be put to sleep
and wait for a signal from the operation system (at least in Linux),
is this the meaning of "freeing" the thread? is this some higher
level implementation? something that actually create a new thread
that is waiting for new request instead of the "sleeping" one?
Am I missing something here?
Thanks
Basically the way the non-blocking sockets I/O work is by using polling and the state machine. So your scheme for many connections would be something like that:
Create many sockets and make them nonblocking
Switch the state of them to "connect"
Initiate the connect operation on each of them
Poll all of them until some events fire up
Process the fired up events (connection established or connection failed)
Switch the state those established to "sending"
Prepare the Web request in a buffer
Poll "sending" sockets for WRITE operation
send the data for those who got the WRITE event set
For those which have all the data sent, switch the state to "receiving"
Poll "receiving" sockets for READ operation
For those which have the READ event set, perform read and process the read data according to the protocol
Repeat if the protocol is bidirectional, or close the socket if it is not
Of course, at each stage you need to handle errors, and that the state of each socket is different (one may be connecting while another may be already reading).
Regarding polling I have posted an article about how different polling methods work here: http://www.ulduzsoft.com/2014/01/select-poll-epoll-practical-difference-for-system-architects/ - I suggest you check it.
To benefit from a non-blocking server, your code must also be non-blocking - you can't just run blocking code on a non-blocking server and expect better performance. For example, you must remove all calls to sleep() and replace them with non-blocking equivalents like IOLoop.add_timeout (which in turn involves restructuring your code to use callbacks or coroutines).
How To Use Linux epoll with Python http://scotdoyle.com/python-epoll-howto.html may give you some points about this topic.

Erlang accept incoming tcp connections dynamically

What I am trying to solve: have an Erlang TCP server that listens on a specific port (the code should reside in some kind of external facing interface/API) and each incoming connection should be handled by a gen_server (that is even the gen_tcp:accept should be coded inside the gen_server), but I don't actually want to initially spawn a predefined number of processes that accepts an incoming connection). Is that somehow possible ?
Basic Procedure
You should have one static process (implemented as a gen_server or a custom process) that performs the following procedure:
Listens for incoming connections using gen_tcp:accept/1
Every time it returns a connection, tell a supervisor to spawn of a worker process (e.g. another gen_server process)
Get the pid for this process
Call gen_tcp:controlling_process/2 with the newly returned socket and that pid
Send the socket to that process
Note: You must do it in that order, otherwise the new process might use the socket before ownership has been handed over. If this is not done, the old process might get messages related to the socket when the new process has already taken over, resulting in dropped or mishandled packets.
The listening process should only have one responsibility, and that is spawning of workers for new connections. This process will block when calling gen_tcp:accept/1, which is fine because the started workers will handle ongoing connections concurrently. Blocking on accept ensure the quickest response time when new connections are initiated. If the process needs to do other things in-between, gen_tcp:accept/2 could be used with other actions interleaved between timeouts.
Scaling
You can have multiple processes waiting with gen_tcp:accept/1 on a single listening socket, further increasing concurrency and minimizing accept latency.
Another optimization would be to pre-start some socket workers to further minimize latency after accepting the new socket.
Third and final, would be to make your processes more lightweight by implementing the OTP design principles in your own custom processes using proc_lib (more info). However, this you should only do if you benchmark and come to the conclusion that it is the gen_server behavior that slows you down.
The issue with gen_tcp:accept is that it blocks, so if you call it within a gen_server, you block the server from receiving other messages. You can try to avoid this by passing a timeout but that ultimately amounts to a form of polling which is best avoided. Instead, you might try Kevin Smith's gen_nb_server instead; it uses an internal undocumented function prim_inet:async_accept and other prim_inet functions to avoid blocking.
You might want to check out http://github.com/oscarh/gen_tcpd and use the handle_connection function to convert the process you get to a gen_server.
You should use "prim_inet:async_accept(Listen_socket, -1)" as said by Steve.
Now the incoming connection would be accepted by your handle_info callback
(assuming you interface is also a gen_server) as you have used an asynchronous
accept call.
On accepting the connection you can spawn another ger_server(I would recommend
gen_fsm) and make that as the "controlling process" by calling
"gen_tcp:controlling_process(CliSocket, Pid of spwned process)".
After this all the data from socket would be received by that process
rather than by your interface code. Like that a new controlling process
would be spawned for another connection.

How to use MPI to organize asynchronous communication?

I plan to use MPI to build a solver that supports asynchronous communication. The basic idea is as follows.
Assume there are two parallel processes. Process 1 wants to send good solutions it finds periodically to process 2, and ask for good solutions from process 2 when it needs diversification.
At some point, process 1 uses MPI_send to send a solution to process 2. How to guarantee there is an MPI_Rev matching this MPI_Send, since this send is triggered dynamically?
When process 1 needs a solution, how can it send a request to process 2, and process 2 will notice its request in time?
There are three ways to achieve what you want, although it is not truly asynchronous communication.
1) Use non-blocking send/recvs. Replace your send/recv calls with irecv/isend and wait. The sender can issue an isend and continue working on the next problem. At some point, you will have to issue a mpi-wait to make sure your previous send was received. Your process2 can issue a recv ahead of time using irecv and continue doing its work. Again, at some point you will call mpi-wait to make sure your irecv was received. this may be a bit cumbersome if I understand you requirement correctly.
2) A Elegant way would be to use One-Sided communication. MPI_Put, Get.
3) Restructure your algorithm in such a way that at certain intervals of time, process 1 & 2 exchange information and state.
Depending on the nature of the MPI_* function you call, the send will block until a matching receive has been called by another process, so you need to make sure that's going to happen in your code.
There are also non-blocking function calls MPI_Isend f.ex, which gives you a request-handle which you can check on later to see if the process' send has been received by a matching receive.
Regarding your issue, you could issue a non-blocking receive (MPI_Irecv being the most basic) and check on the status every n seconds depending on your application. The status will then be set to complete when a message has been received and is ready to be read.
If it's time sensitive, use a blocking call while waiting for a message. The blocking mechanism (in OpenMPI at least) uses a spinning poll however, so the waiting process will be eating 100% cpu.

unix network process

I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!
The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).
No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
http://www.ibm.com/developerworks/linux/library/l-system-calls/

Resources