Can ZeroMQ provide grounds for a bidirectional non-blocking asynchronous transmission? - asynchronous

I have a system which consists of two applications. Currently, two applications communicate using multiple ZeroMQ PUB/SUB patterns generated for each specific type of transmission. Sockets are programmed in C.
For example, AppX uses a SUB formal-socket archetype for receiving an information struct from AppY and uses another PUB formal-socket archetype for transmitting raw bit blocks to AppY and same applies to AppY. It uses PUB/SUB patterns for transmission and reception.
To be clear AppX and AppY perform the following communications:
AppX -> AppY :- Raw bit blocks of 1 kbits (continous),- integer command (not continuous, depends on user)
AppY -> AppX :Information struct of 10kbits (continuous)
The design target:
a) My goal is to use only one socket at each side for bidirectional communication in nonblocking mode.
b) I want two applications to process queued received packets without an excess delay.
c) I don't want AppX to crash after a crashed AppY.
Q1: Would it be possible with ZeroMQ?
Q2: Can I use ROUTER/DEALER or any other pattern for this job?
I have read the guide but I could not figure out some aspects.
Actually I'm not well experienced with ZeroMQ. I would be pleased to hear about additional tips on this problem.

A1: Yes, this is possible in ZeroMQ or nanomsg sort of tools
Both the ZeroMQ and it's younger sister nanomsg share the vision of Scaleable ( which you did not emphasise yet )Formal ( hard-wired formal behaviour )Communication ( yes, it's about this )
Pattern ( that are wisely carved and ready to re-use and combine as needed )
This said, if you prefer to have just one socket-pattern on each "side", then you have to choose such a Formal Pattern, that would leave you all the freedom from any hard-wired behaviour, so as to meet your goal.
So, a) "...only one" is doable -- by a solo of zmq.PAIR (which some parts of documentation flag as a still an experimental device) or NN.BUS or a pair of PUSH/PULL if you step back from allowing just a single one ( which in fact does eliminate all the cool powers of the sharing of the zmq.Context() instantiated IO-thread(s) for re-using the low-level IO-engine. If you spend a few minutes with examples referred to below, you will soon realise, that the very opposite policy is quite common and beneficial to the design targets, when one uses more, even many, patterns in a system architecture.
The a) "...non-blocking" is doable, by stating proper directives zmq.NOBLOCK for respective .send() / .recv() functions and by using fast, non-blocking .poll() loops in your application design architecture.
On b) "...without ... delay" is related to the very noted remark on application design architecture, as you may loose this just by relying on a poor selection and/or not possible tuning of the event-handler's internal timings and latency penalties. If you shape your design carefully, you might remain in a full control of the delay/latency your system will experience and not bacoming a victim of any framework's black-box event-loop, where you can nothing but wait for it's surprises on heavy system or traffic loads.
On c) "... X crash after a Y crashed" is doable on { ZeroMQ | nanomsg }-grounds, by a carefull combination of non-blocking mode of all functions + by your design beeing able to handle exceptions in the situations it does not receive any POS_ACK from the intended { local | remote }-functionality. In this very respect, it is fair to state, that some of the Formal Communication Patters do not have this very flexibility, due to some sort of a mandatory internal behaviour, that is "hard wired" internally, so a due care is to be taken in selecting a proper FCP-archetype for each such still scaleable but fault-resilient role.
Q2: No.
The best next step:
You might feel interested in other ZeroMQ posts here and also do not miss the link to the book, referred there >>>

Q1: yes
Q2: no, ZMQ_DEALER should be used by both AppX and AppY.
See http://zguide.zeromq.org/c:asyncsrv. Notice ZMQ_ROUTER in this example just aim to distribute request from multi-client to different thread where ZMQ_DEALER do real work.

Related

MPI_Lock_win / passive synchronization usage confusion

I'm trying to convert an application from using standard point-to-point MPI calls (e.g., MPI_Isend, MPI_Irecv) to using MPI-3's one-sided calls. My goal is to improve performance on my hardware, which is a system that has Infiniband hardware support and an MPI implementation that's optimized for RDMA calls. I've been told that the hardware performs particularly well with passive synchronization mode, as opposed to active synchronization (i.e., Post-Start-Complete-Wait).
However, even after reading through the MPI standard documentation and examples, I'm confused on how to actually use the calls. For context, my program has a setup phase where I will know the communication pattern and even the buffers of the send data and ultimate buffer of the receiver. So, it's straightforward to set up a window and use it.
Specifically, with passive synchronization, I'm confused about when the "receiver" knows the data in the window has been written by the sender. What I want to do is have the sender produce the message data, then call MPI_Win_lock on the window and then do an MPI_Put and then wait for completion with a MPI_Win_Unlock. But, what is an efficient / recommended way for the "receiver" (window target) of the data to know when the message data has been written? Similarly, given that the communication pattern is iterated and the same receive buffer (the target's buffer) is used multiple times, how do I know that the receiver is done consuming the buffer and it can be reused?
I can envision a couple of approaches:
I can use an MPI_Barrier after the MPI_Win_unlock and before the receiver accesses the data. (This seems that it would work but I'm skeptical that this would yield better performance than active synchronization.)
I can possibly use MPI_Lock and MPI_Unlock on the receiver (target), locking the window when the target is actually using the data so the access epoch can't start on the origin (but, is that the way it works? I've read that lock and unlock don't create critical sections in the traditional sense).
Some sort of home-grown approach where the receiver polls for some sort of a nonce to be written, knowing the data is available when that happens.
Docs for MPI_Win_lock: https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_lock.3.php
In general, how does a programmer synchronize with MPI_Lock and 'MPI_Unlock` in a way that's any more efficient than the active synchronization approach? It does feel like I need to just use post-start-complete-wait, but I'm hoping you can help me find a way to try passive synchronization as well.

What is lockstep in Peer-to-Peer gaming?

I am researching about Peer-To-Peer network architecture for games.
What i have read from multiples sources is that Peer-To-Peer model makes it easy for people to hack. Sending incorrect data about your game character, whether it is your wrong position or the amount of health point you have.
Now I have read that one of the things to make Peer-To-Peer more secure is to put an anti-cheat system into your game, which controls some thing like: how fast has someone moved from spot A to spot B, or controls if someones health points did not change drastically without a reason.
I have also read about Lockstep, which is described as a "handshake" between all the clients in Peer-to-Peer network, where clients promise not to do certain things, for instance "move faster than X or not to be able to jump higher than Y" and then their actions are compared to the rules set in the "handshake".
To me this seems like an anti-cheat system.
What I am asking in the end is: What is Lockstep in Peer-To-Peer model, is it an Anti-Cheat system or something else and where should this system be placed in Peer-To-Peer. In every players computer or could it work if it is not in all of the players computer, should this system control the whole game, or only a subset?
Lockstep was designed primarily to save on bandwidth (in the days before broadband).
Question: How can you simulate (tens of) thousands of units, distributed across multiple systems, when you have only a vanishingly small amount of bandwidth (14400-28800 baud)?
What you can't do: Send tens of thousands of positions or deltas, every tick, across the network.
What you can do: Send only the inputs that each player makes, for example, "Player A orders this (limited size) group ID=3 of selected units to go to x=12,y=207".
However, the onus of responsibility now falls on each client application (or rather, on developers of P2P client code) to transform those inputs into exactly the same gamestate per every logic tick. Otherwise you get synchronisation errors and simulation failure, since no peer is authoritative. These sync errors can result from a great deal more than just cheaters, i.e. they can arise in many legitimate, non-cheating scenarios (and indeed, when I was a young man in the '90s playing lockstepped games, this was a frequent frustration even over LAN, which should be reliable).
So now you are using only a tiny fraction of the bandwidth. But the meticulous coding required to be certain that clients do not produce desync conditions makes this a lot harder to code than an authoritative server, where non-sane inputs or gamestate can be discarded by the server.
Cheating: It is easy to see things you shouldn't be able to see: every client has all the simulation data available. It is very hard to modify the gamestate without immediately crashing the game.
I've accidentally stumbled across this question in google search results, and thought I might as well answer years later. For future generations, you know :)
Lockstep is not an anti-cheat system, it is one of the common p2p network models used to implement online multiplayer in games (most notably in strategy games). The base concept is fairly straightforward:
The game simulation is split into fairly short time frames.
After each frame players collect input commands from that frame and send those commands over the network
Once all the players receive the commands from all the other players, they apply them to their local game simulation during the next time frame.
If simulation is deterministic (as it should be for lockstep to work), after applying the commands all the players will have the same world state. Implementing the determinism right is arguably the hardest part, especially for cross-platform games.
Being a p2p model lockstep is inherently weak to cheaters, since there is no agent in the network that can be fully trusted. As opposed to, for example, server-authoritative network models, where developer can trust a privately-owned server that hosts the game. Lockstep does not offer any special protection against cheaters by itself, but it can certainly be designed to be less (or more) vulnerable to cheating.
Here is an old but still good write-up on lockstep model used in Age of Empires series if anyone needs a concrete example.

MPI and hierarchical collectives - missing 'hierarch' coll module from OpenMPI 2.x?

I'm working on an application that is critically dependent on the performance of MPI_Alltoall calls with very small messages (less than 4KB) flying among a large number of processes (currently about 200, while the target is in the thousands and more).
I was under the impression, and reading this paper seems to corroborate that impression, that it would be sensible to exploit the hierarchy of a modern PC cluster (as the one I have at my disposal) by segregating processes belonging to a single cluster node within separate communicators (OpenMPI even has a function, from the MPI 3 standard, that does just that) and then, instead of an MPI_Alltoall over MPI_COMM_WORLD, using a sequence MPI_Gather - MPI_Alltoall - MPI_Scatter whereas Gather and Scatter are restricted within those communicators while the Alltoall is over another communicator including one and one only 'gatherer' process per node, in this way exploiting the supposedly faster transfers in internal memory for the Gather and Scatter while hopefully increasing the efficiency of the Alltoall among nodes by having fewer, larger messages passing through the network interfaces (ConnectX InfiniBand NICs by Mellanox, in my case).
I'm trying to verify the assertions of the paper, and if someone is interested I can share my findings, but what I wanted to know is this: perusing the 1.10.x OpenMPI sources, I clearly see a 'hierarch' component among the 'coll' modules, which is also mentioned in the README that makes it seem like such a hierarchical implementation had been pulled in already.
Nevertheless, I was never able to make it work and it seems that it vanished altogether from the 2.x branch (there is none in the 'ompi_info' output).
Is there anyone that succeeded in using it? Can you tell about any improvements, if any, compared to a regular MPI_Alltoall?

How to write with a single node in MPI

I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?
You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh)
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

Asynchronous MPI with SysV shared memory

We have a large Fortran/MPI code-base which makes use of system-V shared memory segments on a node. We run on fat nodes with 32 processors, but only 2 or 4 NICs, and relatively little memory per CPU; so the idea is that we set up a shared memory segment, on which each CPU performs its calculation (in its block of the SMP array). MPI is then used to handle inter-node communications, but only on the master in the SMP group. The procedure is double-buffered, and has worked nicely for us.
The problem came when we decided to switch to asynchronous comms, for a bit of latency hiding. Since only a couple of CPUs on the node communicate over MPI, but all of the CPUs see the received array (via shared memory), a CPU doesn't know when the communicating CPU has finished, unless we enact some kind of barrier, and then why do asynchronous comms?
The ideal, hypothetical solution would be to put the request tags in an SMP segment and run mpi_request_get_status on the CPU which needs to know. Of course, the request tag is only registered on the communicating CPU, so it doesn't work! Another proposed possibility was to branch a thread off on the communicating thread and use it to run mpi_request_get_status in a loop, with the flag argument in a shared memory segment, so all the other images can see. Unfortunately, that's not an option either, since we are constrained not to use threading libraries.
The only viable option we've come up with seems to work, but feels like a dirty hack. We put an impossible value in the upper-bound address of the receive buffer, that way once the mpi_irecv has completed, the value has changed and hence every CPU knows when it can safely use the buffer. Is that ok? It seems that it would only work reliably if the MPI implementation can be guaranteed to transfer data consecutively. That almost sounds convincing, since we've written this thing in Fortran and so our arrays are contiguous; I would imagine that the access would be also.
Any thoughts?
Thanks,
Joly
Here's a pseudo-code template of the kind of thing I'm doing. Haven't got the code as a reference at home, so I hope I haven't forgotten anything crucial, but I'll make sure when I'm back to the office...
pseudo(array_arg1(:,:), array_arg2(:,:)...)
integer, parameter : num_buffers=2
Complex64bit, smp : buffer(:,:,num_buffers)
integer : prev_node, next_node
integer : send_tag(num_buffers), recv_tag(num_buffers)
integer : current, next
integer : num_nodes
boolean : do_comms
boolean, smp : safe(num_buffers)
boolean, smp : calc_complete(num_cores_on_node,num_buffers)
allocate_arrays(...)
work_out_neighbours(prev_node,next_node)
am_i_a_slave(do_comms)
setup_ipc(buffer,...)
setup_ipc(safe,...)
setup_ipc(calc_complete,...)
current = 1
next = mod(current,num_buffers)+1
safe=true
calc_complete=false
work_out_num_nodes_in_ring(num_nodes)
do i=1,num_nodes
if(do_comms)
check_all_tags_and_set_safe_flags(send_tag, recv_tag, safe) # just in case anything else has finished.
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
safe(current)=true
else
wait_until_true(safe(current))
end if
calc_complete(my_rank,current)=false
calc_complete(my_rank,current)=calculate_stuff(array_arg1,array_arg2..., buffer(current), bounds_on_process)
if(not calc_complete(my_rank,current)) error("fail!")
if(do_comms)
check_all_tags_and_set_safe(send_tag, recv_tag, safe)
check_tags_and_wait_if_need_be(next, send_tag, recv_tag)
recv(prev_node, buffer(next), recv_tag(next))
safe(next)=false
wait_until_true(all(calc_complete(:,current)))
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
send(next_node, buffer(current), send_tag(current))
safe(current)=false
end if
work_out_new_bounds()
current=next
next=mod(next,num_buffers)+1
end do
end pseudo
So ideally, I would have liked to have run "check_all_tags_and_set_safe_flags" in a loop in another thread on the communicating process, or even better: do away with "safe flags" and make the handle to the sends / receives available on the slaves, then I could run: "check_tags_and_wait_if_need_be(current, send_tag, recv_tag)" (mpi_wait) before the calculation on the slaves instead of "wait_until_true(safe(current))".
"...unless we enact some kind of barrier, and then why do asynchronous comms?"
That sentence is a bit confused. The purpose of asynchrononous communications is to overlap communications and computations; that you can hopefully get some real work done while the communications is going on. But this means you now have two tasks occuring which eventually have to be synchronized, so there has to be something which blocks the tasks at the end of the first communications phase before they go onto the second computation phase (or whatever).
The question of what to do in this case to implement things nicely (it seems like what you've got now works but you're rightly concerned about the fragility of the result) depends on how you're doing the implementation. You use the word threads, but (a) you're using sysv shared memory segments, which you wouldn't need to do if you had threads, and (b) you're constrained not to be using threading libraries, so presumably you actually mean you're fork()ing processes after MPI_Init() or something?
I agree with Hristo that your best bet is almost certainly to use OpenMP for on-node distribution of computation, and would probably greatly simplify your code. It would help to know more about your constraint to not use threading libraries.
Another approach which would still avoid you having to "roll your own" process-based communication layer that you use in addition to MPI would be to have all the processes on the node be MPI processes, but create a few communicators - one to do the global communications, and one "local" communicator per node. Only a couple of processes per node would be a part of a communicator which actually does off-node communications, and the others do work on the shared memory segment. Then you could use MPI-based methods for synchronization (Wait, or Barrier) for the on-node synchronization. The upcoming MPI3 will actually have some explicit support for using local shared memory segments this way.
Finally, if you're absolutely bound and determined to keep doing things through what's essentially your own local-node-only IPC implementation --- since you're already using SysV shared memory segments, you might as well use SysV semaphores to do the synchronization. You're already using your own (somewhat delicate) semaphore-like mechanism to "flag" when the data is ready for computation; here you could use a more robust, already-written semaphore to let the non-MPI processes know when the data is ready for computation (and a similar mechanism to let the MPI process know when the others are done with the computation).

Resources