MPI_Scatter: order of scatter - mpi

I my work, I noticed that even if I scatter same amount of data to each process, it takes more time to transfer data from root to the highest-rank process. I tested this on distributed memory machine. If a MWE is needed I will prepare one but before that I would like to know if MPI_Scatter gives privilege to lower rank processes.

The MPI standard does not say such a thing, so MPI libraries are free to implement MPI_Scatter() the way they want regarding which task might return earlier than others.
Open MPI for example can either do a linear or a binomial scatter (by default, the algo is chosen based on communicator and message sizes).
That being said, all data has to be sent from the root process to the other nodes, so obviously, some nodes will be served first. If root process has rank zero, i would expect the highest rank process receive the data at last (i am not aware of any MPI library implementing a topology aware MPI_Scatter(), but that might come some day). If root process has not rank zero, then MPI might internally renumber the ranks (so root is always virtual rank zero), and if this pattern is implemented, the last process to receive the data would be (root + size - 1) % size.
If this is suboptimal from your application point of view, you always have the option to re-implement MPI_Scatter() your own way (that can call the library provided PMPI_Scatter() if needed). An other approach would be to MPI_Comm_split() (with a single color) in order to renumber the ranks, and use the new communicator for MPI_Scatter()


Using MPI_Gatherv with only a subset of all processes

I want to use MPI_Gatherv to collect data onto a single process. The thing is, I don't need data from all other processes, just a subset of them. The only communicator in the code is MPI_COMM_WORLD. So, does every process in MPI_COMM_WORLD have to call MPI_Gatherv? I have looked at the MPI standard and can't really make out what it is saying. If all processes must make the call, can some of the values in MPI_Gatherv's "recvcounts" array be zero? Or would there be some other way to signal to the root process which processes should be ignored? I guess I could introduce another communicator, but for this particular problem that would be coding overkill.

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

Broadcasting a Better Value in MPI

I want to write a small program in MPI (Java implementation)
A variable x (double variable) is declared. Threads try to modify the variable (let's say a random modification). When a thread i finds a new value of X which is smaller than the older one, a broadcasting to other threads is done so that they can update the value of their variable X
I have looked at the Bcast function in MPI ... but in all examples it was called by all threads whether the variable is modified or not.
This is one of those scenarios that are quite easy to implement in a multithreaded environment (e.g. OpenMP or Java threads) and very hard to impossible to implement efficiently in MPI. The usual approach is to refactor your algorithm in such a way that the best value could be communicated every N steps (with N possibly equal to 1, but that could be very inefficient due to the communication overhead) and then use Intracomm.Allreduce with the reduce operation set to MPI.MIN. Each process provides its own minimum value and the reduction returns the global minimum. If you would also like to know the rank of the process that holds the global minimum value, MPI.MINLOC should be used instead.
If you are trying to implement parallel genetic optimisation, there are some C++ libraries that might give you an inspiration.

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

time-based simulation with actors model

we have a single threaded application that simulates the interaction of a hundred of thousands of objects over time with the shared memory model.
obviously, it suffers from its inability to scale over multi CPU hardware.
after reading a little about agent based modeling and functional programming/actor model I was considering a rewrite with the message-passing paradigm.
the idea is very simple - each object will be an actor and their interactions will be messages so that the simulation could happen in parallel. given a configuration of objects at a certain time - its future consequences can be easily computed.
the question is how to model the time:
for example let's assume the the behavior of object X depends on A and B, as the actors and the messages calculations order is not guaranteed it could be that when X is to be computed A has already sent its message to X but B didn't.
how to make sure the computation happens correctly?
I hope the question is clear
thanks in advance.
Your approach of using message passing to parallelize a (discrete-event?) simulation is well-known and does not require a functional style per se (although, of course, this does not prevent you to implement it like that).
The basic problem you describe w.r.t. to the timing of events is also known as the local causality constraint (see, for example, this textbook). Basically, you need to use a synchronization protocol to ensure that each object (or agent) processes its messages in the right order. In the domain of parallel discrete-event simulation, such objects are called logical processes, and they communicate via events (i.e. time-stamped messages).
Correctly implementing a synchronization protocol for these events is challenging and the right choice of protocol is highly application-specific. For example, one important factor is the average amount of computation required per event: if there is little computation required, the communication costs dominate the overall execution time and it will be hard to scale the simulation.
I would therefore recommend to look for existing solutions/libraries on top of the actors framework you intend to use before starting from scratch.
