Say I have a comm for 64 ranks. How can I create a group in mpi4py consisting of the first x ranks, a second group consisting of the remaining 64-x ranks, and comms for each group?
MPI_Comm_split creates new communicators by splitting a communicator into a group of sub-communicators based on the input values color and key.
All processes which pass in the same value for color are assigned to the same communicator. In your case, the first x processes should pass in a value for color and the rest should choose a different value.
key determines the ordering (rank) within each new communicator. The process which passes in the smallest value for key will be rank 0, the next smallest will be rank 1, and so on. If you don't need to change the original order of processes, you can use their rank as the key.
Combining these, here is an example in C:
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int x = 10;
int color = rank < x;
MPI_Comm new_comm;
MPI_Comm_split(MPI_COMM_WORLD, color, rank, &new_comm);
Source and further information: http://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/
Related
What is a simple way to create a (sub)communicator containing consecutive ranks [rStart, ..., last rank of MPI_COMM_WORLD] of MPI_COMM_WORLD?
rStart is >= 0, i.e., first rStart ranks need to be excluded.
The simplest code is to have
MPI_Comm_split(MPI_COMM_WORLD, rank < rStart, rank, &new_comm);
run on all ranks of MPI_COMM_WORLD. It will create two communicators - all ranks starting with rStart will get the one you desire, the others can just MPI_Comm_free their communicator.
If you cannot easily have the excluded ranks run the same code, you can use MPI_Comm_create_group, but then you have to also create the group first.
Is there a way to do weighted randomization in System Verilog based on runtime data. Say, I have a queue of integers and a queue of weights (unsigned integers) and wish to select a random integer from the first queue as per the weights in the second queue.
int data[$] = '{10, 20, 30};
uint_t weights[$] = '{100, 200, 300};
Any random construct expects the weights hardcoded as in
constraint range { Var dist { [0:1] := 50 , [2:7] := 50 }; }
But in my case, I need to pick an element from an unknown number of elements.
PS: Assume the number of elements and weights will be the same always.
Unfortunately, the dist constraint only lets you choose from a fixed number of values.
Two approaches I can think of are
Push each data value into a queue using the weight as a repetition count. In your example, you wind up with a queue of 600 values. Randomly pick an index into the queue. The selected element has the distribution you want. An example is posted here.
Create an array of ranges for each weight. For your example the array would be uint_t ranges[][2]'{{0,99},{100,299},{300,599}}. Then you could do the following in a constraint
index inside {[0:weights.sum()-1]};
foreach (data[ii])
index inside {[ranges[ii][0]:ranges[ii][1]} -> value == date[ii];
I'm new in OpenCL and I'm trying to understand this piece of code:
size_t global_work1[3] = {BLOCK_SIZE, 1, 1};
size_t local_work1[3] = {BLOCK_SIZE, 1, 1};
err = clEnqueueNDRangeKernel(cmd_queue, diag, 2, NULL, global_work1, local_work1, 0, 0, 0);
So, in the clEnqueueNDRangeKernel 2 dimension for the kernel are specified (work_dim field), this means that:
the dimension 0 kernel got a number of threads equal to BLOCK_SIZE and only one group (I guess the number of groups can be calculated in this way => ( global_work1[0] ) / ( local_work1[0] ) ).
the dimension 1 Kernel got a number of threads equal to 1 and only one group.
and also why a dimension of 2 is specified in the queue instruction when three are the elements in global_work1 and local_work1.
You are telling CL:
"Run this kernel, in this queue, with 2D and these global/local sizes"
CL is just getting the first 2 dimensions of the argument, and ignoring the 3rd one.
About the difference between 1D and 2D. There is none. Since OpenCL kernels launched as 1D do not fail on get_global_id(1) and get_global_id(2) calls. They will just return 0. So there is no difference at all, apart from the hint that the kernel will probably support bigger sizes for the 2nd dimension argument (ie: 512x128)
I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.
How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can
get_global_id(0);
get_global_id(1);
I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so
size_t global_work_size[] = { 4 };
Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would ...help.
image i made to try to understand this question
image decribing work groups i found on google
Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.
The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims.
The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.
The dimensions:
As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:
size_t global_work_size[2] = { 4, 4 };
It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.
The global size:
Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.
The global unique identifiers:
Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.
The workgroup (or local) size:
The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).
The local unique identifiers:
Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.
The number of groups:
Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.
Each group has also an id:
...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.
Summary:
To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:
the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.
Example:
Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).
Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.
To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).
//in is a 24 int array, result is a 6 int array, temp is a 4 int array
kernel void foo(global int *in, global int *result, local int *temp){
//use vectors for conciseness
int2 globalId = (int2)(get_global_id(0), get_global_id(1));
int2 localId = (int2)(get_local_id(0), get_local_id(1));
int2 groupId = (int2)(get_group_id (0), get_group_id (1));
int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));
//Read from global and store to local
temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
//Sync
barrier(CLK_LOCAL_MEM_FENCE);
//Only the threads with local id (0, 0) sum elements up
if(localId.x == 0 && localId.y == 0){
int sum = 0;
for(int i = 0; i < locallSize.x * locallSize.y ; i++){
sum += temp[i];
}
//store result in global
result[groupId.x + numberOfGrp.x * groupId.y] = sum;
}
}
And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?
Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).
Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.
I have a 2D processor grid (3*3):
P00, P01, P02 are in R0, P10, P11, P12, are in R1, P20, P21, P22 are in R2.
P*0 are in the same computer. So the same to P*1 and P*2.
Now I would like to let R0, R1, R2 call MPI_Bcast at the same time to broadcast from P*0 to p*1 and P*2.
I find that when I use MPI_Bcast, it takes three times the time I need to broadcast in only one row.
For example, if I only call MPI_Bcast in R0, it takes 1.00 s.
But if I call three MPI_Bcast in all R[0, 1, 2], it takes 3.00 s in total.
It means the MPI_Bcast cannot work parallel.
Is there any methods to make the MPI_Bcast broadcast at the same time?
(ONE node broadcast with three channels at the same time.)
Thanks.
If I understand your question right, you would like to have simultaneous row-wise broadcasts:
P00 -> P01 & P02
P10 -> P11 & P12
P20 -> P21 & P22
This could be done using subcommunicators, e.g. one that only has processes from row 0 in it, another one that only has processes from row 1 in it and so on. Then you can issue simultaneous broadcasts in each subcommunicator by calling MPI_Bcast with the appropriate communicator argument.
Creating row-wise subcommunicators is extreamly easy if you use Cartesian communicator in first place. MPI provides the MPI_CART_SUB operation for that. It works like that:
// Create a 3x3 non-periodic Cartesian communicator from MPI_COMM_WORLD
int dims[2] = { 3, 3 };
int periods[2] = { 0, 0 };
MPI_Comm comm_cart;
// We do not want MPI to reorder our processes
// That's why we set reorder = 0
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm_cart);
// Split the Cartesian communicator row-wise
int remaindims[2] = { 0, 1 };
MPI_Comm comm_row;
MPI_Cart_sub(comm_cart, remaindims, &comm_row);
Now comm_row will contain handle to a new subcommunicator that will only span the same row that the calling process is in. It only takes a single call to MPI_Bcast now to perform three simultaneous row-wise broadcasts:
MPI_Bcast(&data, data_count, MPI_DATATYPE, 0, comm_row);
This works because comm_row as returned by MPI_Cart_sub will be different in processes located at different rows. 0 here is the rank of the first process in comm_row subcommunicator which will correspond to P*0 because of the way the topology was constructed.
If you do not use Cartesian communicator but operate on MPI_COMM_WORLD instead, you can use MPI_COMM_SPLIT to split the world communicator into three row-wise subcommunicators. MPI_COMM_SPLIT takes a color that is used to group processes into new subcommunicators - processes with the same color end up in the same subcommunicator. In your case color should equal to the number of the row that the calling process is in. The splitting operation also takes a key that is used to order processes in the new subcommunicator. It should equal the number of the column that the calling process is in, e.g.:
// Compute grid coordinates based on the rank
int proc_row = rank / 3;
int proc_col = rank % 3;
MPI_Comm comm_row;
MPI_Comm_split(MPI_COMM_WORLD, proc_row, proc_col, &comm_row);
Once again comm_row will contain the handle of a subcommunicator that only spans the same row as the calling process.
The MPI-3.0 draft includes a non-blocking MPI_Ibcast collective. While the non-blocking collectives aren't officially part of the standard yet, they are already available in MPICH2 and (I think) in OpenMPI.
Alternatively, you could start the blocking MPI_Bcast calls from separate threads (I'm assuming R0, R1 and R2 are different communicators).
A third possibility (which may or may not be possible) is to restructure the data so that only one broadcast is needed.