MPI - broadcast to all processes to printf something - mpi

I have declared int value in my main, and all of the processes has inicialized this value. All of them are storing value, which I want to write on the screen after computing is finished. Is Broadcast a solution? E.g. how to implement it?
int i;
int value;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD;&myrank);
left = (myrank - 1); if (left < 0) left = numtasks-1;
right = (myrank + 1); if (right >= numtasks) right = 0;
if(myrank==0){
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
}
else if(myrank==(numtasks-1)){
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
else{
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
Theese should make logical circle. I do one computing (sum of all ranks), and in process 0 I get result. This result (for 4 processes it will be 6) I want to be printed by each of the processes after this computing. But I don't see how to use barrier exactly and where.
There is also one thing, after all N-1 sendings (where N is number of processes) I should have sum of all ranks in each of processes. In my code I get this sum only into process 0... It might be a bad approach :-(

Some more detail about the structure of your code would help, but it sounds like you can just use MPI_Barrier. Your processes don't need to exchange any data, they just have to wait until everyone reached the point in your code where you want the printing to happen, which is exactly what Barrier does.
EDIT: In the code you posted, the Barrier would go at the very end (after the if statement), followed by printf(value).
However, your code will not compute the total sum of all ranks in all nodes, since process i only receives the summed ranks of the first i-1 processes. If you want each process to have the total sum at the end, then replacing Barrier with Broadcast is indeed the best option. (In fact, the entire if statement AND the Broadcast could be replaced by a single MPI_Reduce() call, but that wouldn't really help you learn MPI. :) )

Related

Why my attempt to global syncronization does not work?

I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}
You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

opencl atomic operation doesn't work when total work-items is large

I've working with openCL lately. I create a kernel that basically take one global variable
shared by all the work-items in a kernel. The kernel can't be simpler, each work-item increment the value of result, which is the global variable. The code is shown.
__kernel void accumulate(__global int* result) {
*result = 0;
atomic_add(result, 1);
}
Every thing goes fine when the total number of work-items are small. On my MAC pro retina, the result is correct when the work-item is around 400.
However, as I increase the global size, such as, 10000. Instead of getting 10000 when getting
back the number stored in result, the value is around 900, which means more than one work-item might access the global at the same time.
I wonder what could be the possible solution for this types of problem? Thanks for the help!
*result = 0; looks like the problem. For small global sizes, every work items does this then atomically increments, leaving you with the correct count. However, when the global size becomes larger than the number that can run at the same time (which means they run in batches) then the subsequent batches reset the result back to 0. That is why you're not getting the full count. Solution: Initialize the buffer from the host side instead and you should be good. Alternatively, to do initialization on the device you can initialize it only from global_id == 0, do a barrier, then your atomic increment.

How to avoid reading back in OpenCL

I am implementing an algorithm with OpenCL. I will loop in C++ many times and call a same OpenCL kernel each time. The kernel will generate the input data of next iteration and the number of these data. Currently, I read back this number in each loop for two usages:
I use this number to decide how many work items I need for next loop; and
I use this number to decide when to exit the loop (when the number is 0).
I found the reading takes most of time of the loop. Is there any way to avoid it?
Generally speaking, if you need to call a kernel repeatedly, and the exit condition is dependent to the result generated by the kernel (not fixed number loops), how can you do it efficiently? Is there anything like the occlusion query in OpenGL that you can just do some query instead of reading back from GPU?
Reading a number back from a GPU Kernel will always take 10s - 1000s microseconds or more.
If the controlling number is always reducing, you can keep in global memory, and test it against the global id and decide if the kernel does work or not on each iteration. Use a global memory barrier to sync all the threads ...
kernel void x(global int * the_number, constant int max_iterations, ... )
{
int index = get_global_id(0);
int count = 0; // stops an infinite loop
while( index < the_number[0] && count < max_iterations )
{
count++;
// loop code follows
....
// Use one thread decide what to do next
if ( index == 0 )
{
the_number[0] = ... next value
}
barrier( CLK_GLOBAL_MEM_FENCE ); // Barrier to sync threads
}
}
You have a couple of options here:
If possible, you can simply move the loop and the conditional into the kernel? Use a scheme where additional work items do nothing depending on the input for the current iteration.
If 1. isn't possible, I would recommend that you store the data generated by the "decision" kernel in a buffer and use that buffer to "direct" your other kernels.
Both these options will allow you to skip the readback.
I'm just finishing up some research where we had to tackle this exact problem!
We discovered a couple of things:
Use two (or more) buffers! Have the first iteration of the kernel
operate on data in b1, then the next on b2, then on b1 again. In
between each kernel call, read back the result of the other buffer
and check to see if it's time to stop iterating. Works best when the kernel takes longer than a read. Use a profiling tool to make sure you aren't waiting on reads (and if you are, increase the number of buffers).
Over shoot! Add a finishing check to each kernel, and call it
several (100s) of times before copying data back. If your kernel is
low-cost, this can work very well.

MPI - receive multiple int from task 0 (root)

I am solving this problem. I am implementing cycling mapping, I have 4 processors, so one task is mapped on processor 1 (root), and then three others are workers. I am using cyclic mapping, and I have as a input several integers, e.g. 0-40. I want from each worker to receive (in this case it would be 10 integers for each worker), do some counting and save it.
I am using MPI_Send to send integers from root, but I don't how to multiply receive some numbers from the same process (root). Also I send int with buffer size fixed on 1, when there is number e.g. 12, it would do bad things. How to check length of an int?
Any advice would be appreciated. Thanks
I'll assume you're working in C++, though your question doesn't say. Anyway, let's look at the arguments of MPI_Send:
MPI_SEND(buf, count, datatype, dest, tag, comm)
The second argument specifies how many data items you want to send. This call basically means "buf points to a point in memory where there are count number of values, all of them of type datatype, one after the other: send them". This lets you send the contents of an entire array, like this:
int values[10];
for (int i=0; i<10; i++)
values[i] = i;
MPI_Send(values, 10, MPI_INTEGER, 1, 0, MPI_COMM_WORLD);
This will start reading memory at the start of values, and keep reading until 10 MPI_INTEGERs have been read.
For your case of distributing numbers between processes, this is how you do it with MPI_Send:
int values[40];
for (int i=0; i<40; i++)
values[i] = i;
for (int i=1; i<4; i++) // start at rank 1: don't send to ourselves
MPI_Send(values+10*i, 10, MPI_INTEGER, i, 0, MPI_COMM_WORLD);
However, this is such a common operation in distributed computing that MPI gives it its very own function, MPI_Scatter. Scatter does exactly what you want: it takes one array and divides it up evenly between all processes who call it. This is a collective communication call, which is a slightly advanced topic, so if you're just learning MPI (which it sounds like you are), then feel free to skip it until you're comfortable using MPI_Send and MPI_Recv.

Resources