Where can I find documentation of gasnet collectives - gasnet

I am writing a distributed-shared memory library using GASNET_SEGMENT_EVERYTHING, and for that I need to communicate the address of an allocation from some root node to all other nodes, like an MPI_Bcast. However, I am having a tough time understanding how to implement this. Can someone give me an example on how to mimic MPI_Send with active messages, or explain how the undocumented gasnet_coll_broadcast from gasnet_coll.h works?

The best source of information for the GASNet collectives API in the current GASNet-EX release is the GASNet-EX specification (search for the section entitled // Collectives (Coll)).
Here's a simple example of initiating a non-blocking broadcast operation and then synchronously awaiting its completion:
gex_Event_Wait(gex_Coll_BroadcastNB(myteam, root_rank, dstmem, srcmem, payloadsz, 0));
The example above is taken from the GASNet test testcollperf, and there are other examples in GASNet test testcoll; although these tests are respectively written as a performance microbenchmark and a correctness validation test (and not really intended to serve as example codes).


.NET/C# - What is the best option instead of an ActionBlock<T> (or Channel<T>) for speed?

corefxlab has something called a Channel which is a really nice implementation of an async P-C queue and definitely does what I'm looking for. I'm curious if there's an implementation that ultimately had a similar API to ActionBlock<T>:
Must be able to accept/deny from multiple producers.
Only needs to have one consuming task but would be preferable that it continue processing until empty. Then 'wait' for new items.
A Channel<T> is much faster than an BufferBlock<T> but I'm just curious if given the specific requirements if there was something even faster.
According to a readme by Stephen Toub, Channels might end up being the underlying implementation around some Dataflow blocks. Channels wins for P-C queue async speed.

How to use non-blocking point-to-point MPI routines instead of collectives

In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.

what are exactly MPI, MPICH, and OPENMPI? what does "implementation" mean in this context?

My question might seem silly to those who have been in the field for long time, but I appreciate your patience in elaborating it for me.
When they say MPICH is an "implementation" of MPI, what does it mean?
Is the following analogy true(?):
if we think of MPI as a set of standards for a FORTRAN compiler, then MPICH, and OPENMPI are different versions of FORTRAN compilers, like Intel.Fortran, Compaq.Fortran, GNU.Fortran, and so on.
MPI is a standard: it outlines a particular model for message passing in a distributed system. However, it only gives a series of requirements: it does not actually include any code, nor does it specify how exactly these requirements need to be fulfilled. For example, take a look at this excerpt from the official MPI 2.2 spec (as of today):
A valid MPI implementation guarantees certain general properties of
point-to-point communication, which are described in this section.
Order Messages are non-overtaking: If a sender sends two messages in succession to the same destination, and both match the same
receive, then this operation cannot receive the second message if the
first one is still pending.
It then goes on to explain the rationale behind this requirement and provide an example, but says nothing more about the requirement itself.
An MPI implementation is a library that fulfills every requirement - like the one above - in the MPI specification. However, the standard contains absolutely no requirements as to what language constructs, OS calls, 3rd party libraries, etc can/can't/should be used. Occasionally, it will give advice to implementors, like this:
Advice to implementors. The implementation may keep a reference count
of active communications that use the datatype, in order to decide
when to free it. Also, one may implement constructors of derived
datatypes so that they keep pointers to their datatype arguments,
rather then copying them. In this case, one needs to keep track of
active datatype definition references in order to know when a datatype
object can be freed. (End of advice to implementors.)
however, these are still vague, very language-agnostic, and only recommendations: an implementation can ignore every single one of these advices, and still conform to the standard.
So yes, in essence it's similar to various implementations of a compiler. If a program takes valid source code for a language, and produces binary code that does everything that the language specification says it should do given the original source code, it's a conforming compiler for that language. Similarly, if you can use a library to pass messages in a way that doesn't break any rules of the MPI spec, then that's a valid MPI implementation.

Rewriting network packets on the fly using libnetfilter_queue

I am attempting to write a userspace application that can hook into an OS's network stack, sniff packets flying past and edit ones that its interested in.
After much Googling, it appears to me that the simplest (yet reasonably robust) method of doing so (on any platform) is Linux's libnetfilter_queue project. However, I'm having trouble finding any reasonable documentation for the project, outside of the limited official documentation. Its main features (as stated by the first link are)
receiving queued packets from the kernel nfnetlink_queue subsystem
issuing verdicts and/or reinjecting altered packets to the kernel nfnetlink_queue subsystem
Emphasis is my own. How exactly am I meant go about this? I've tried modifying the sample code provided, but perhaps I am misunderstanding something. The code is operating in NFQNL_COPY_PACKET mode, so I am receiving the whole packet -- but my modifications to it seem to be restricted to my own application -- as one would expect, given the "copy" semantics.
My feeling is that I am meant to make use of NF_QUEUE somehow, but I haven't quite grokked it. Any pointers?
(If there is a simpler mechanism for doing this, which is also cross-platform, I'd love to hear about it!)
I can't believe I missed this previously. As reticent as I am to post questions on SO, I thought I would never work this one out myself. :)
I didn't look at the function prototype properly. It turns out in the "verdict" function (outlined below),
int nfq_set_verdict(struct nfq_q_handle *qh,
u_int32_t id,
u_int32_t verdict,
u_int32_t data_len,
const unsigned char *buf
The last two parameters are for the data to be returned to the network stack. Obvious in hindsight, but I missed it completely as the print_pkt function doesn't take the packet data as a parameter, but extracts it from the struct nfq_data.
The key is to NF_ACCEPT the packet and pass the suitably modified packet back to the kernel.
Just a wild guess from digging around the source code: try explicitly adding the mangled payload using nfnl_addattr_l(…, NFQA_PAYLOAD, …)?

mpi under the hood

I need to deliver a presentation on programming in MPI. I need to add a segment on how MPI works under the hood. For Example What happens when I call MPI_Init?
Do you know of any good source from where I can learn these details?
The MPI Spec contains the description of the knobs, sliders, and displays that are on the outside of the "black box" of each API.
The interior details of the black boxes will be implementation dependent...and will also depend on the interconnect (e.g. TCP, IBV, DAPL, etc), the OS (e.g. is the implementation using LSB, or native libraries, etc), and on many other factors to a lesser degree (e.g. message size thresholds will trigger different code paths, and so on). Using "strace" and "ltrace" on the a.out may provide some insight into the actual goings on inside the blackbox.
The best recommendation is to pick an open source implementation and examine the code to determine the internal details.
MPI is a specification, not a particular implementation. The observable behavior is given in the MPI spec. How it works under the hood depends on the particular implementation. If you'd like to take a look at an example implementation, you might be interested in looking at MPICH2 and browsing their source code.
Complement your study of the source code of an implementation of MPI with consideration of how you would implement MPI_Init on your platform of choice. MPI sits on top of already available O/S functionality. I don't mean to suggest that you can figure out how a particular version of MPI is implemented by this approach, but to suggest that you can learn better what is going on under the hood by tackling the problem from another angle.
MPI is only a spec. MPI spec is implemented by various groups and organizations. You will want to pick one implementation, say, MPICH, and you can find their design documentation. That will tell you how the MPI spec is implemented by that group.
If you just want to describe what happens when an application written in MPI is started, you can read about MPI and MPI programming. I highly recommend http://www.citutor.org
