MPI_Igatherv VS MPI_Isend & MPI_Irecv

MPI_Igatherv VS MPI_Isend & MPI_Irecv - mpi

I have to gather lots of data on my root from other processors at each step of my algorithm, and the size of data at some processors could be zero.
So, I am wondering which method is more efficient and faster to use, MPI_Igatherv, or using (MPI_Isend & MPI_Irecv) when the buffer is not empty.
Does MPI internally take care of zero buffer in MPI_Igatherv?
Is it better and possible to use one sided communication instead of MPI_Igatherv or MPI_Isend?

A lot of this you're just going to have to implement it and see what happens. Implementations can provide a lot of optimizations for different cases. Network hardware/topology might impact results.
In general, there is nothing wrong with passing zero bytes to a collective operation. The process contributing zero bytes might still be well-situated (topologically speaking) and could participate in the collective operation.
One-sided operations might be better, or might not. Depends again on many factors.

Related

Difference between shuffle() and rebalance() in Apache Flink

I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:
shuffle(): Partitions elements randomly according to a uniform distribution.
rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.
Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning
Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?
The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.
I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.
Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D

As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.
On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.
The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.

This statement by Flink is misleading:
Useful for performance optimisation in the presence of data skew.
Since it's used to describe rebalance, but not shuffle, it suggests it's the distinguishing factor. My understanding of it was that if some items are slow to process and some fast, the partitioner will use the next free channel to send the item to. But this is not the case, compare the code for rebalance and shuffle. The rebalance just adds to next channel regardless how busy it is.
// rebalance
nextChannelToSendTo = (nextChannelToSendTo + 1) % numberOfChannels;
// shuffle
nextChannelToSendTo = random.nextInt(numberOfChannels);
The statement can be also understood differently: the "load" doesn't mean actual processing time, just the number of items. If your original partitioning has skew (vastly different number of items in partitions), the operation will assign items to partitions uniformly. However in this case it applies to both operations.
My conclusion: shuffle and rebalance do the same thing, but rebalance does it slightly more efficiently. But the difference is so small that it's unlikely that you'll notice it, java.util.Random can generate 70m random numbers in a single thread on my machine.

what are the advantages of implementing register in microcontroller architectures i.e. load store architecture

Major difference in RISC and CISC is that in RISC we must need to use registers to do any arithmetic or logic operation. But in case of CISC we can do such operation directly with memory locations. So what is the advantage of implementing register banking in micro controller architectures? Question is not for the advantage of RISC but the question is for what is need of register in RISC architecture. As in other architecture CISC operation can be done directly with meomery location we don't need to take it in register and then again move into the memory location. Below is the example:
CISC: MUL A,B
RISC:
LDA R0,A
LDA R1,B
MUL R0,R1
STR A,R0
So in above example what is the advantage of using R0 and R1 ie. registers. what is the advantage of load store architecture?

Register banking is something else, I assume you are simply asking about using a register directly or not. Well the memory access takes an eternity, even if cached. Several to hundreds of clock cycles for each of the operands where in RISC if you are assuming a pure register based scheme which not all are, the lines are getting fuzzy. With CISC if microcoded it is going to registers anyway, then the operation is happening, if not microcoded then it still gets latched into internal temporary storage (registers) and then the operation can begin. With risc you have a couple-three extra, simpler, instructions the latching to registers takes the same amount of time as it does in CISC. Now if the algorithm never uses that result or does not use it for a while, it might be a win for CISC (if not microcoded) but if the value is an intermediate value in an algorithm then a clear win for RISC. Even if everything is cached it is a half a dozen to dozen clock cycles to get each parameter and write it back, any cache misses and it is an eternity. Same for RISC but with more registers, and significantly faster access to those registers, zero or one clock for each value and to store back, for some percentage if not the whole algorithm.
As with any benchmarking it is trivial to show a RISC winning case and to show a CISC winning case.
The major difference between RISC and CISC is CISC are complicated time consuming instructions where RISC they are much simpler, you arrange the tasks you need to do and have tighter control over those tasks, you dont have a lot of waste per step. One could argue caches were created to deal with the inefficiencies of CISC or at least one popular one. Both benefit sure, but one relies on the other doesnt as much. Trivial to show CISC winning code and trivial to show RISC winning code. Same goes for VLIW, and others.
RISC designs are simpler, smaller, pipes can be shorter, compiler has more control over the performance, etc. So with microcontrollers you can have a very nice processor core with a 3 stage pipeline that is really low power and still quite efficient. The 6502, z80, 8051, etc have really died off for the most part, you still do see a lot of 8051s if you are looking, the desktop/laptop you might be reading this with probably has one 8051, but that is due to royalties and not because of its size or performance, you probably have several to dozens of ARM cores for every x86, within the same box or certainly around the house. A CISC is going to be relatively massive and inefficient, it might be possible to get the power consumption down to RISC levels, that may just be a matter of design and not CISC vs RISC, but the RISC implementations are doing a much better job at watts per mhz than the CISC implementations.

Using registers can simplify the operand fetching logic of functional unis. With CISC functional units should be able to fetch data from memory. With RISC, all the functional units will operate on registers as it is guaranteed that the data will be there, so less complicated.
Also, think of a case where you have multiple MUL operations some uses data at location A, some use B, shown below.
'MUL A, B'
'MUL C, B'
When you perform the operation in CISC, you will be reading B, twice. But in RISC, you load it to a register once, and can use multiple times. So less memory (cache) accesses.
Also think of number of bits needed to represent that MUL in CISC. As A, B, C can be memory locations, they could be anywhere within your address spaces. On the other hand with registers in RISC, bits needed to represent your operands are less, hence less complicated instruction set.

As from above responses, we can conclude that the using registers instead of direct memory location gives the benefit in efficiency in terms of clock cycle and so the power consumption. They also give the benefit in term of complexity of instructions.

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!

There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.

Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

Optimization of parallel programming

I want to use MPI to make my program parallel, and I want to send something to other computers. I want to know which one is better: sending a huge buffer one time or sending two smaller messages 3 times atrent times during the execution instead of all at once?

It's almost always going to be faster to send the one big message than the smaller one. Each time you do a Send/Receive pair, the two processes have to go through the entire process of sending a message to each other, including at least 6 roundtrip messages. If you are just sending one larger message, there is a minimum of 2 roundtrip messages. Each of those messages can be very expensive (compared to doing things locally like packing all of your data into one buffer).
I'd encourage you to try it out both ways though to be sure that this applies to your application. It could be different if you're doing something unexpected.

Depending on your problem, sending all data may be more efficient, because the nodes have to be synced, every time. That may cause a delay.

I always try to send as much data as I can in a single MPI call. In my experience, sending many small bits of data greatly increases the overhead and network traffic, and I have even run into problems where I overwhelmed the computers' ability to keep up with the number of requests, because I was sending a large member of a complicated class, one integer at a time, to many workers. Therefore, when possible, send the entire data at once, unless you have some reason to believe it is too large.
Further, I strive to use 100% of all the CPU's my program claims. When you are working on shared resources, if you use a CPU, you need to actually use it. Otherwise, someone else who wants to use that core, or node, is blocked out while your program sits and does nothing. For example, on a Cray which I have used, even if you call for only two 'cores', the manager will reserve a full bank of 24 cores, essentially wasting 22. Or, perhaps one worker has nothing to do, while another chugs away -- again, wasting time. Hopefully, there is a way to balance the load, so to speak, to avoid unintentional waste of resources.
Back to the topic at hand. Demonstrate timing and efficiency of vector sending to yourself -- write a program which breaks up the vector into varying sizes of packets, and do the sends/receives. Test it with varying numbers of workers, and on several different configurations of computers, if you can. Before writing production code, do proof of concept, and what optimization you can. Test and time it!

Write multiple kernels or a Single kernel

Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?

One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.

In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.

The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex