Difference between All-to-All Reduction and All-Reduce in MPI

Difference between All-to-All Reduction and All-Reduce in MPI - mpi

Trying to figure out the difference between All-to-All Reduction and All-Reduce in open MPI. From my understanding All-to-One Reduction takes a piece m (integer, array, etc..) from all processes and combines all the pieces together with an operator (min, max, sum, etc..) and stores it in the selected process. From this i assume that All-to-All Reduction is the same but the product is stored in all the processes instead of just one. From this document it seems like All-Reduce is basically doing the same as All-to-All Reduction, is this right or am i getting it wrong?

The all-reduce (MPI_Allreduce) is a combined reduction and broadcast (MPI_Reduce, MPI_Bcast). They might have called it MPI_Reduce_Bcast. It is important to note that a MPI reduction does not do any global reduction. So if have 10 numbers each on 5 processes, after a MPI_Reduce one process has 10 numbers. After MPI_Allreduce, all 5 processes have the same 10 numbers.
In contrast, the all-to-all reduction performs a reduction and scatter, hence it is called MPI_Reduce_scatter[_block]. So if you have 10 numbers each on 5 processes, after a MPI_Reduce_scatter_block, the 5 processes have 2 numbers each. Note that MPI doesn't itself use the terminology all-to-all reduction, probably due to the misleading ambiguity.

Related

How to distribute a number range to each process?

I am doing a learning exercise. I want to calculate the number of primes in a range from 0 to N. What function of mpi can I use to distribute ranges of numbers to each process? In other words, each process calculates the number of primes within a number range of the main range.

You could simply use a for loop and MPI_Send on the root (and MPI_Recv on receivers) to send to each process the number at which it should start and how many numbers it should check.
Another possibility, even better, is to send N to each process with MPI_Bcast (On root and receivers) and let each process compute which numbers it should check using it's own rank. (something like start=N/MPI_Comm_size*MPI_Comm_rank and length=N/MPI_Comm_size, and some adequate rounding etc.)
You can probably optimize load balancing even more but you should get it working first.
At the end you should call MPI_Reduce with a sum.

How would I normalize a float array to the range [0.0, 1.0] in parallel?

I want to design a kernel in which I can pass an array of floats and have them all come out with the maximum being 1.0 and the minimum being 0.0. Theoretically, each element would be mapped to something like (x-min)/(max-min). How can I parallelize this?

A simple solution would be to split the problem into 2 kernels:
Reduction kernel
Divide your array into chunks of N * M elements each, where N is the number of work-items per group, and M is the number of array elements processed by each work-item.
Each work-item computes the min() and max() of its M items.
Within the workgroup, perform a parallel reduction of min and max across the N work-items, giving you the min/max for each chunk.
With those values obtained, one of the items in the group can use atomics to update global min/max values. Given that you are using floats, you will need to use the well known workaround for the lack of atomic min/max/CAS operations on floats.
Application
After your first kernel has completed, you know that the global min and max values must be correct. You can compute your scale factor and normalisation offset, and then kick off as many work items as your array has elements, to multiply/add each array element to adjust it.
Tweak your values for N and M to find an optimum for a given OpenCL implementation and hardware combination. (Note that M = 1 may be the optimum, i.e. launching straight into the parallel reduction.)
Having to synchronise between the two kernels is not ideal but I don't really see a way around that. If you have multiple independent arrays to process, you can hide the synchronisation overhead by submitting them all in parallel.

Encoding DNA strand in Binary

Hey guys I have the following question:
Suppose we are working with strands of DNA, each strand consisting of
a sequence of 10 nucleotides. Each nucleotide can be any one of four
different types: A, G, T or C. How many bits does it take to encode a
DNA strand?
Here is my approach to it and I want to know if that is correct.
We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.
4^10 = 1048576.
We will then find the log base 2 of that. What do you guys think of my approach?

Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.
It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.
If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.
Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).
20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).
This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.
Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.
I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.

do the maths:
4^10 = 2^2^10 = 2^20
Answer: 20 bits

MPI One sided communication, Remote memory access for low memory

I have a matrix A and a bunch of vectors (b0..bn) which I want to multiply to produce vectors c0..cn.
Using the std approach, with MPI will be to make a copy of the matrix A on every process and distribute the vectors among the processes accordingly.
for example , 100 vectors b0..b99, among 10 processes(MPI_Ranks)
rank0:
c0=A.b0 ,c1=A.b1, ... c9=A.b9
rank1:
c10=A.b10,c11=A.b11 ... c19=A.b19
...
rank9
c90=A.b90,c91=A.b91 ... c99=A.b99
My problem is that A is really big and sparse and make 10 or more copies will be a bad idea or impractically , simply it wont fit (lets say I have a quad core i.e. in my host I will have to copy A four times, when one copy barely fits).
Would it be possible to use MPI one-sided communication to have A in process 0 and the other 9 Processors just read A using RMA and still produce c0..c99? like
rank0:
c0=A.b0 ,c1=A.b1, ... c9=A.b9
rank1:
c10=(remote A in Rank 0).b10, c11=(remote A in Rank 0).b11 ... c19= (remote A in Rank 0).b19
...
rank9
c90=(remote A in Rank 0).b90,c91=(remote A in Rank 0).b91 ... c99=(remote A in Rank 0).b99

Generalization of MPI rank number to MPI groups?

Is there any generalization of rank number to group numbers? For my code I would like to create a hierarchical decomposition of MPI::COMM_WORLD. Assume we make use of 16 threads. I use MPI::COMM_WORLD.Split to create 4 communicators each having 4 ranks. Is there now an MPI function that provides some unique ids to the corresponding four groups?

Well, you can still refer to each process by its original rank in MPI_COMM_WORLD. You also have complete control over what rank each process receives in its new communicator via the color and key arguments of MPI_Comm_split(). This is plenty enough information to create a mapping between old ranks and new groups/ranks.

If you don't like #suszterpatt's answer (I do) you could always abuse a Cartesian communicator and pretend that the process at index (2,3) in the communicator is process 3 in group 2 of your hierarchical decomposition.
But don't read this and take away the impression that I recommend such abuse, it's just a thought.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex