I am solving this problem. I am implementing cycling mapping, I have 4 processors, so one task is mapped on processor 1 (root), and then three others are workers. I am using cyclic mapping, and I have as a input several integers, e.g. 0-40. I want from each worker to receive (in this case it would be 10 integers for each worker), do some counting and save it.
I am using MPI_Send to send integers from root, but I don't how to multiply receive some numbers from the same process (root). Also I send int with buffer size fixed on 1, when there is number e.g. 12, it would do bad things. How to check length of an int?
Any advice would be appreciated. Thanks
I'll assume you're working in C++, though your question doesn't say. Anyway, let's look at the arguments of MPI_Send:
MPI_SEND(buf, count, datatype, dest, tag, comm)
The second argument specifies how many data items you want to send. This call basically means "buf points to a point in memory where there are count number of values, all of them of type datatype, one after the other: send them". This lets you send the contents of an entire array, like this:
int values[10];
for (int i=0; i<10; i++)
values[i] = i;
MPI_Send(values, 10, MPI_INTEGER, 1, 0, MPI_COMM_WORLD);
This will start reading memory at the start of values, and keep reading until 10 MPI_INTEGERs have been read.
For your case of distributing numbers between processes, this is how you do it with MPI_Send:
int values[40];
for (int i=0; i<40; i++)
values[i] = i;
for (int i=1; i<4; i++) // start at rank 1: don't send to ourselves
MPI_Send(values+10*i, 10, MPI_INTEGER, i, 0, MPI_COMM_WORLD);
However, this is such a common operation in distributed computing that MPI gives it its very own function, MPI_Scatter. Scatter does exactly what you want: it takes one array and divides it up evenly between all processes who call it. This is a collective communication call, which is a slightly advanced topic, so if you're just learning MPI (which it sounds like you are), then feel free to skip it until you're comfortable using MPI_Send and MPI_Recv.
Related
I have seen many tutorials about configuring work dimensions, in which the number of work items conveniently easy to divide into 3 dimensions. I have a big number of work items, speak 164052. What is the best way to configure arbitrary number of work items ? Since in my programm the number of work items might vary, i need a way to calculate it automatically.
What should I do when the number is prime, say 7879 ?
First off, by default you should only be using 1 dimension for your kernels. Some tasks require 2 or 3 dimensions (generally, image processing), but unless you are expressly working on one of those tasks, it probably doesn't benefit you to try to divide stuff up among multiple dimensions, since the benefits are largely about code organization, not about performance.
So that leaves the question of how to divide up work items among local groups. Given a task size of N work items, you have a few options for dividing them up into local groups.
The simplest solution is to simply specify N work items, and let the driver decide for you how to divide those work items among the groups.
size_t work_items = 164052;
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &work_items, nullptr, 0, nullptr, nullptr);
If you're programming for a specific environment where you know in advance the ideal number of local work items (often 32 or 64 for NVidia/AMD architectures), you might get better performance by forcing your work item count to align to a multiple of that number.
size_t work_items = 164052;
size_t LOCAL_SIZE = 64;
work_items += LOCAL_SIZE - (work_items % LOCAL_SIZE);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &work_items, &LOCAL_SIZE, 0, nullptr, nullptr);
Note, however, that this requires that you add a check to your kernel code to prevent processing on work items that don't actually exist, or that you pad your buffers to include space for the dummy items.
kernel void main(..., int N) {
if(get_global_id(0) >= N) return;
...
}
Using MPI and C, I'm looking to distribute (scatter and gather) a 2D array of complex double values (ie. every element in the 2D array is of type complex double, so has a creal and cimag component). If I use regular declaration of a 2D array of size n-by-n:
double complex grid[n][n];
Everything works just fine, BUT my program will fail depending on the size of n, giving a "segmentation fault" error. Anything above, say, 256 will immediately spit out a "segmentation fault" error. This is the problem that I'm having and am failing miserably to figure out.
After browsing through numerous similar issues, I'm guessing my problem is that I'm overloading the stack memory (something I'm honestly not 100% in understanding), meaning that I need to dynamically allocate my 2D arrays using malloc or calloc.
However, in my understanding, allocating a 2D array that you can call like grid[n][n] won't work since the allocated memory is not necessarily aligned, meaning that MPI_Scatter fails.
double complex **alloc_2d_complex(int rows, int cols){
double complex *data = (complex double*) malloc(rows*cols*sizeof(complex double));
double complex **array = (complex double**) malloc(rows*sizeof(complex double*));
int i;
for (i = 0; i < rows; i++)
array[i] = &(data[cols*i]);
return array;
}
int main(int argc, char*argv[]){
double complex **grid;
grid = alloc_2d_complex(n,n);
/* Continue to initialize MPI and attempt Scatter... */
}
I've tried initializing a 2D by this method and scatter does fail for me, giving errors "memcpy argument memory ranges overlap" since something in memory apparently doesn't line up right.
This means I must allocate everything in 1D arrays in row-major order, like:
grid[y][x] ==> grid[y*n + x]
I'm really, really trying to avoid this because I'm dealing with numerous transposed and untransposed matrices (which is hard enough to keep track of in [y][x] logic) and it's going to make things difficult to keep track of for my purpose, but fine, if it's what I have to do then let's get it over with. But this ALSO doesn't work with MPI_Scatter, giving me once again "memcpy" errors, which I am utterly dumbfounded by. Below is an example of how I'm trying to do everything using 1D arrays. Since I'm getting the same error for this and the 2D allocated array, maybe the 2D allocation will work and I'm just missing something here. I'm only using a number of processors, numProcs, that can evenly divide n.
int n = 128;
double complex *grid = malloc(n*n*sizeof(complex double));
/* ... Initialize MPI ... */
stepSize = (int) n/numProcs;
double complex *gridChunk = malloc(stepSize*n*sizeof(complex double));
/* ... Initialize grid[y*n+x] Values... */
MPI_Scatter(&grid, n*stepSize, MPI_C_DOUBLE_COMPLEX,
&gridChunk, n*stepSize, MPI_C_DOUBLE_COMPLEX,
0, MPI_COMM_WORLD);
This is the moment I feel I need something like MPI_Neighbor_allreduce, but I know it doesn't exist.
Foreword
Given a 3D MPI cartesian topology describing how a 3D physical domain is distributed among processes, I wrote a function probe that asks for a scalar value (which is supposed to be put in a simple REAL :: val) given the 3 coordinates of a point inside the domain.
There can only be 1, 2, 4, or 8 process(es) that are actually involved in the computation of val.
1 if the point is internal to a process subdomain (and it has no neighbors involved),
2 if the point is on a face between 2 processes' subdomains (and each of them has 1 neighbor involved),
4 if the point is on a side between 4 processes' subdomains (and each of them has 2 neighbors involved),
8 if the point is a vertex between 8 processes' subdomain (and each of them has 3 neighbors involved).
After the call to probe as it is now, each process holds val, which is some value for involved processes, 0 or NaN (I decide by (de)commenting the proper lines) for not-involved processes. And each process knows if it is involved or not (through a LOGICAL :: found variable), but does not know if it is the only one involved, nor who are the involved neighbors if it is not.
In the case of 1 involved process, that only value of that only process is enough, and the process can write it, use it, or whatever is needed.
In the latter three cases, the sum of the different scalar values of the processes involved must be computed (and divided by the number of neighbors +1, i.e. self included).
The question
What is the best strategy to accomplish this communication and computation?
What solutions I'm thinking about
I'm thinking about the following possibilities.
Every process executes val = 0 before the call to probe, then MPI_(ALL)REDUCE can be used, (the involved processes participating with val /= 0 in general, all others with val == 0), but this would mean that if more points are asked for val, those points would be treated serially, even if the set of process(es) involved for each of them does not overlap with other sets.
Every process calls MPI_Neighbor_allgather to share found among neighboring processes to make each involved process know which one(s) of the 6 neighbors participate(s) to the sum and then perform individual MPI_send(s) and an MPI_recv(s) to communicate val. But this would still involve every process (even though each communicates only with the 6 neighbors.
Maybe the best choice is that each process defines a communicator made up of itself plus the 6 neighbors and then use.
EDIT
For what concerns the risk of deadlock mentioned by #JorgeBellón, I initially solved it by calling MPI_SEND before MPI_RECV for communications in the positive direction, i.e. those corresponding to even indices in who_is_involved, and vice-versa in the negative direction. As a special case, this could not deal with a periodic direction with only two processes along it (since each of the two would see the other one as a neighbor in both positive and negative directions, thus resulting in both processes calling MPI_SEND and MPI_RECV in the same order, thus causing a deadlock); the solution to this special case was the following ad-hoc edit to who_is_involved (which I called found_neigh in my code):
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
As a reference for the readers, the solution that I implemented so far (a solution I'm not so satisfied with) is the following.
found = ... ! .TRUE. or .FALSE. depending whether the process is/isn't involved in computation of val
IF ( found) val = ... ! compute own contribution
IF (.NOT. found) val = NaN
! share found among neighbors
found_neigh(:) = .FALSE.
CALL MPI_NEIGHBOR_ALLGATHER(found, 1, MPI_LOGICAL, found_neigh, 1, MPI_LOGICAL, procs_grid, ierr)
found_neigh = found_neigh .AND. found
! modify found_neigh to deal with special case of TWO processes along PERIODIC direction
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
! exchange contribution with neighbors
val_neigh(:) = NaN
IF (found) THEN
DO id = 1, ndims
IF (found_neigh(2*id)) THEN
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idp(id), 999, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(val_neigh(2*id), 1, MPI_DOUBLE_PRECISION, idp(id), 666, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
END IF
IF (found_neigh(2*id - 1)) THEN
CALL MPI_RECV(val_neigh(2*id - 1), 1, MPI_DOUBLE_PRECISION, idm(id), 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idm(id), 666, MPI_COMM_WORLD, ierr)
END IF
END DO
END IF
! combine own contribution with others
val = somefunc(val, val_neigh)
As you said, MPI_Neighbor_allreduce does not exist.
You can create derived communicators that only include your adjacent processes and then perform a regular MPI_Allreduce on them. Each process can have up to 7 communicators in a 3D grid.
The communicator in which a specific process will be placed in the center of the stencil.
The respective communicator for each of the adjacent processes.
This can be a quite expensive process, but it does not mean it could be worthwhile (HPLinpack makes extensive use of derived communicators, for example).
If you already have a cartesian topology, a good approach is to use MPI_Neighbor_allgather. This way you will not only know how many neighbors are involved but also who it is.
int found; // logical: either 1 or 0
int num_neighbors; // how many neighbors i got
int who_is_involved[num_neighbors]; // unknown, to be received
MPI_Neighbor_allgather( &found, ..., who_is_involved, ..., comm );
int actually_involved = 0;
int r = 0;
MPI_Request reqs[2*num_neighbors];
for( int i = 0; i < num_neighbors; i++ ) {
if( who_is_involved[i] != 0 ) {
actually_involved++;
MPI_Isend( &val, ..., reqs[r++]);
MPI_Irecv( &val, ..., reqs[r++]);
}
}
MPI_Waitall( r, reqs, MPI_STATUSES_IGNORE );
Note that I'm using non-blocking point to point routines. This is important in most cases because MPI_Send may wait for the receiver to call MPI_Recv. Unconditionally calling MPI_Send and then MPI_Recv in all processes, may cause a deadlock (see MPI 3.1 standard section 3.4).
Another possibility is to send both the real value and the found in a single communication, so that the number of transfers are reduced. Since all processes are involved in the MPI_Neighbor_allgather anyway, you could use it to get everything done (for a small increase in the amount of data transferred it really pays off).
INTEGER :: neighbor, num_neighbors, found
REAL :: val
REAL :: sendbuf(2)
REAL :: recvbuf(2,num_neighbors)
sendbuf(1) = found
sendbuf(2) = val
CALL MPI_Neighbor_allgather( sendbuf, 1, MPI_2REAL, recvbuf, num_neighbors, MPI_2REAL, ...)
DO neighbor = 1,num_neighbors
IF recvbuf(1,neighbor) .EQ. 1 THEN
! use neighbor val, placed in recvbuf(2,neighbor)
END IF
END DO
I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!
This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.
I have declared int value in my main, and all of the processes has inicialized this value. All of them are storing value, which I want to write on the screen after computing is finished. Is Broadcast a solution? E.g. how to implement it?
int i;
int value;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD;&myrank);
left = (myrank - 1); if (left < 0) left = numtasks-1;
right = (myrank + 1); if (right >= numtasks) right = 0;
if(myrank==0){
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
}
else if(myrank==(numtasks-1)){
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
else{
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
Theese should make logical circle. I do one computing (sum of all ranks), and in process 0 I get result. This result (for 4 processes it will be 6) I want to be printed by each of the processes after this computing. But I don't see how to use barrier exactly and where.
There is also one thing, after all N-1 sendings (where N is number of processes) I should have sum of all ranks in each of processes. In my code I get this sum only into process 0... It might be a bad approach :-(
Some more detail about the structure of your code would help, but it sounds like you can just use MPI_Barrier. Your processes don't need to exchange any data, they just have to wait until everyone reached the point in your code where you want the printing to happen, which is exactly what Barrier does.
EDIT: In the code you posted, the Barrier would go at the very end (after the if statement), followed by printf(value).
However, your code will not compute the total sum of all ranks in all nodes, since process i only receives the summed ranks of the first i-1 processes. If you want each process to have the total sum at the end, then replacing Barrier with Broadcast is indeed the best option. (In fact, the entire if statement AND the Broadcast could be replaced by a single MPI_Reduce() call, but that wouldn't really help you learn MPI. :) )