Why should I use MPI_REDUCE instead of MPI_ALLREDUCE? - mpi

I am very new to MPI programming (like two days old) and it is the first time I post a question on stack overflow. I am now dealing with MPI_ALLREDUCE and MPI_REDUCE. I understand that the difference between the two is that with the former the final value of the reduced variable is passed to all the processes, while in the latter case only to a target process. Even tough in many cases you do not need to pass the updated value of a variable to the child process I do not understand what are the benefits of not doing it (or doing it). I initially though it could be better to use MPI_REDUCE so that no computational time is wasted broadcasting the value of the variable, but I did not see any difference in the two cases in my code. I run the code with using a number of processes between 2 and 6.
The code takes a value nand the task of every process is to add 1 to the variable mypartialsum n/num_procs times, where num_procs is the number of processes. After the reduction the values of mypartialsum are gathered in sum and the final result is sum=n.
program test
use mpi
IMPLICIT NONE
!include 'mpif.h'
integer:: ierr, num_procs, my_id, status(MPI_STATUS_SIZE), sender(MPI_STATUS_SIZE), root, rank
integer:: i, n
!real:: sum=0., partialsum=0., mypartialsum=0.
integer:: sum=0, partialsum=0, mypartialsum=0
real:: starttime, endtime
root=0
call MPI_INIT ( ierr )
call MPI_COMM_RANK (MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
starttime = MPI_WTIME()
if (my_id .eq. root) then
print*, "Running in process 0."
print*, "total numeber of process is", num_procs
n=1d9!1000000000
endif
call MPI_BCAST(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
!print*, "running process", my_id
mypartialsum=0.
do i = my_id+1, n, num_procs
mypartialsum=mypartialsum+int(1)
enddo
partialsum=mypartialsum
print*, "Running process", my_id, "Partial sum is ", partialsum
call MPI_REDUCE(partialsum, sum, 1, MPI_INTEGER, MPI_SUM, ROOT, MPI_COMM_WORLD, ierr)
!call MPI_ALLREDUCE(partialsum, sum, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr)
endtime = MPI_WTIME()
if (my_id .eq. 0) then
print*, "sum is", sum, "time spent processing", endtime-starttime
! else if (my_id .gt. 0) then
! print*, "sum on process", my_id, "is", sum , "time spent processing is", endtime-starttime
endif
call MPI_FINALIZE ( ierr )
end program

First: there are no "child" processes in MPI. In your case you arbitrarily designate one process as the root of your collectives, but MPI is in principle symmetric: all processes are identical.
Secondly: your observation is correct: Allreduce runs in exactly the same time as Reduce.
So why do you need Allreduce? Because it is the common case in practice. Say you have a vector of billions of elements that is distributed over your processes. And it is too large to fit on a single process, which is why you use a distributed programming model. Now say that you want to normalize that vector. That means 1. calculate the norm with some sort of reduction, 2. make sure that every process knows that norm, so that 3. every process divides its own elements by that norm.
That sort of scenario is very common in scientific aplications, which is why I consider MPI_Allreduce to be the most basic collective routine.

Related

MPI communicator for sub-range of MPI_COMM_WORLD

What is a simple way to create a (sub)communicator containing consecutive ranks [rStart, ..., last rank of MPI_COMM_WORLD] of MPI_COMM_WORLD?
rStart is >= 0, i.e., first rStart ranks need to be excluded.
The simplest code is to have
MPI_Comm_split(MPI_COMM_WORLD, rank < rStart, rank, &new_comm);
run on all ranks of MPI_COMM_WORLD. It will create two communicators - all ranks starting with rStart will get the one you desire, the others can just MPI_Comm_free their communicator.
If you cannot easily have the excluded ranks run the same code, you can use MPI_Comm_create_group, but then you have to also create the group first.

Time complexity of this recursive block

int recursiveFunc(int n) {
if (n == 1) return 0;
for (int i = 2; i < n; i++)
if (n % i == 0) return i + recursiveFunc(n / i);
return n;
}
I know Complexity = length of tree from root node to leaf node * number of leaf nodes, but having hard time to come to an equation.
This one is tricky, because the runtime is highly dependent on what number you provide in as input in a way that most recursive functions are not.
For starters, notice that the way that this recursion works, it takes in a number and then either
returns without making any further calls if the number is prime, or
recursively calls itself on number divided by that proper factor.
This means that in one case, the function, called on a number n, will do Θ(n) work and make no calls (which happens if the number is prime), and in the other case will do Θ(d) work and then make a recursive call on the number n / d, which happens if n is composite and is the largest divisor of n.
One useful fact we'll use to analyze this function is that given a composite number n, the smallest factor d of n is never any greater than √n. If it were, then we would have that n = df for some other factor f, and since d is the smallest proper divisor, we'd have that f ≥ d, so df > √n √ n = n, which would be impossible.
With that in mind, we can argue that the worst-case runtime of this function is O(n), and in fact that happens when n is prime. Here's how to see this. Imagine the worst-case amount of time this function can take if it ends up making a recursive call. In that case, the function will do at most Θ(√n) work (let's assume our smallest divisor is as large as possible), then recursively makes a call on a number whose size is at most n / 2 (which is the absolute largest number we could get as part of the recursive call. In that case, we'd get this recurrence relation under the pessimistic assumption that we do the maximum work possible
T(n) = T(n / 2) + √n
This solves, by the Master Theorem, to Θ(√n), which is less work than what we'd do if we had a prime number as an input.
But what happens if, instead, we do the maximum amount of work possible for some number of iterations, and then end up with a prime number and stop? In that case, using the iteration method, we'd see that the work done would be
n1/2 + n1/4 + ... + n / 2k,
which would happen if we stopped after k iterations. In this case, notice that this expression is maximized when we pick k to be as small as possible - which would correspond to stopping as soon as possible, which happens if we pick a prime number for n.
So in this sense, the worst-case runtime of this function is Θ(n), which happens for n being a prime number, with composite numbers terminating much faster than this.
So how fast can this function be? Well, imagine, for example, that we have a number of the form pk, where p is some prime number. In that case, this function will do Θ(p) work to discover p as a prime factor, then recursively call itself on the number pk-1. If you think about what this will look like, this function will end up doing Θ(p) work Θ(k) times for a total runtime of Θ(pk). And since n = pk, we'd have k = logp n, so the runtime would be Θ(p logp n). That's minimized at either p = 2 or p = 3, and in either case gives us a runtime of Θ(log n) in this case.
I strongly suspect that's the best case here, though I'm not entirely sure. But what this does mean is that
the worst-case runtime is definitely Θ(n), occurring at prime numbers, and
the best-case runtime is O(log n), which I'm fairly certain is a tight bound but I'm not 100% sure how to prove.

OpenCL clEnqueueNDRangeKernel work_dim VS global_work array elements

I'm new in OpenCL and I'm trying to understand this piece of code:
size_t global_work1[3] = {BLOCK_SIZE, 1, 1};
size_t local_work1[3] = {BLOCK_SIZE, 1, 1};
err = clEnqueueNDRangeKernel(cmd_queue, diag, 2, NULL, global_work1, local_work1, 0, 0, 0);
So, in the clEnqueueNDRangeKernel 2 dimension for the kernel are specified (work_dim field), this means that:
the dimension 0 kernel got a number of threads equal to BLOCK_SIZE and only one group (I guess the number of groups can be calculated in this way => ( global_work1[0] ) / ( local_work1[0] ) ).
the dimension 1 Kernel got a number of threads equal to 1 and only one group.
and also why a dimension of 2 is specified in the queue instruction when three are the elements in global_work1 and local_work1.
You are telling CL:
"Run this kernel, in this queue, with 2D and these global/local sizes"
CL is just getting the first 2 dimensions of the argument, and ignoring the 3rd one.
About the difference between 1D and 2D. There is none. Since OpenCL kernels launched as 1D do not fail on get_global_id(1) and get_global_id(2) calls. They will just return 0. So there is no difference at all, apart from the hint that the kernel will probably support bigger sizes for the 2nd dimension argument (ie: 512x128)

How to accumulate results with SSA in Erlang?

This is a total functional newbie question.
I'm trying to learn some Erlang and have created a (hopefully concurrent) Monte Carlo Simulation where multiple processes are spawned, which report their local results to the parent process via message passing.
So in the parent process I have something like
parent(NumIterations, NumProcs) ->
random:seed(),
% spawn NumProcs processes
lists:foreach(
spawn(moduleName, workerFunction, [self(), NumIterations/NumProcs, 0, 0]),
lists:seq(0, NumProcs - 1)),
% accumulate results
receive
{N, M} -> ???; % how to accumulate this into global results?
_ -> io:format("error")
end.
Let's say I want to sum up all Ns and Ms received from the spawned processes.
I understand that accumulating values is usually done via recursion in functional programming, but how to do that within a receive statement..?
You will have to receive the results in a separate process that acts as the "target" for the calculations. Here is a complicated way of doing multiplications that shows the principle:
-module(example).
-export([multiply/2, loop/2]).
multiply(X, Y) ->
Pid = spawn(example, loop, [0, Y]),
lists:foreach(fun(_) -> spawn(fun() -> Pid ! X end) end, lists:seq(1, Y)).
loop(Result, 0) -> io:format("Result: ~w~n", [Result]);
loop(Result, Count) ->
receive
X -> loop(Result + X, Count - 1)
end.
The multiply-function multiplys X and Y by first starting a new process with the loop-function and then start Y processes whose only task it is to send X to the loop-process.
The loop-process will receive the X:s and add them up and call itself again with the new sum as it its state. This it will do Y times and then print the result. This is basically Erlang's server pattern.

MPI several broadcast at the same time

I have a 2D processor grid (3*3):
P00, P01, P02 are in R0, P10, P11, P12, are in R1, P20, P21, P22 are in R2.
P*0 are in the same computer. So the same to P*1 and P*2.
Now I would like to let R0, R1, R2 call MPI_Bcast at the same time to broadcast from P*0 to p*1 and P*2.
I find that when I use MPI_Bcast, it takes three times the time I need to broadcast in only one row.
For example, if I only call MPI_Bcast in R0, it takes 1.00 s.
But if I call three MPI_Bcast in all R[0, 1, 2], it takes 3.00 s in total.
It means the MPI_Bcast cannot work parallel.
Is there any methods to make the MPI_Bcast broadcast at the same time?
(ONE node broadcast with three channels at the same time.)
Thanks.
If I understand your question right, you would like to have simultaneous row-wise broadcasts:
P00 -> P01 & P02
P10 -> P11 & P12
P20 -> P21 & P22
This could be done using subcommunicators, e.g. one that only has processes from row 0 in it, another one that only has processes from row 1 in it and so on. Then you can issue simultaneous broadcasts in each subcommunicator by calling MPI_Bcast with the appropriate communicator argument.
Creating row-wise subcommunicators is extreamly easy if you use Cartesian communicator in first place. MPI provides the MPI_CART_SUB operation for that. It works like that:
// Create a 3x3 non-periodic Cartesian communicator from MPI_COMM_WORLD
int dims[2] = { 3, 3 };
int periods[2] = { 0, 0 };
MPI_Comm comm_cart;
// We do not want MPI to reorder our processes
// That's why we set reorder = 0
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm_cart);
// Split the Cartesian communicator row-wise
int remaindims[2] = { 0, 1 };
MPI_Comm comm_row;
MPI_Cart_sub(comm_cart, remaindims, &comm_row);
Now comm_row will contain handle to a new subcommunicator that will only span the same row that the calling process is in. It only takes a single call to MPI_Bcast now to perform three simultaneous row-wise broadcasts:
MPI_Bcast(&data, data_count, MPI_DATATYPE, 0, comm_row);
This works because comm_row as returned by MPI_Cart_sub will be different in processes located at different rows. 0 here is the rank of the first process in comm_row subcommunicator which will correspond to P*0 because of the way the topology was constructed.
If you do not use Cartesian communicator but operate on MPI_COMM_WORLD instead, you can use MPI_COMM_SPLIT to split the world communicator into three row-wise subcommunicators. MPI_COMM_SPLIT takes a color that is used to group processes into new subcommunicators - processes with the same color end up in the same subcommunicator. In your case color should equal to the number of the row that the calling process is in. The splitting operation also takes a key that is used to order processes in the new subcommunicator. It should equal the number of the column that the calling process is in, e.g.:
// Compute grid coordinates based on the rank
int proc_row = rank / 3;
int proc_col = rank % 3;
MPI_Comm comm_row;
MPI_Comm_split(MPI_COMM_WORLD, proc_row, proc_col, &comm_row);
Once again comm_row will contain the handle of a subcommunicator that only spans the same row as the calling process.
The MPI-3.0 draft includes a non-blocking MPI_Ibcast collective. While the non-blocking collectives aren't officially part of the standard yet, they are already available in MPICH2 and (I think) in OpenMPI.
Alternatively, you could start the blocking MPI_Bcast calls from separate threads (I'm assuming R0, R1 and R2 are different communicators).
A third possibility (which may or may not be possible) is to restructure the data so that only one broadcast is needed.

Resources