MPI: Broadcasting by k-ary Replication

MPI: Broadcasting by k-ary Replication - mpi

I have to implement a broadcasting algorithm that uses k-ary replication.
void mpi_mbroadcast(int fromp, int multi, int degree, char * from, char * to, int nbytes);
1) fromp is the source processor that holds the message to be broadcast.
2) multi is the length of the data structure that is being copied (the structure to be broadcast is an array of multi entries, the size of each entry being nbytes long).
3) degree is the degree of replication. Each processor in each superstep would send the received message to at most degree - 1 other processors.
4) from is the base address of the first byte that will be broadcast.
5) to is the base address of the first position that will receive the broadcast message.
6) nbytes is the size in bytes of the elementary data type (array element) that is being copied.
I have implemented a program that calculates partial sums and sends the sums to Processor 0. Now I have to implement a program that broadcast sums by k-ary replication.
Link: MPI Summation
How do I use the function above to send and receive data. (I think it is all-to-all broadcast)?

Related

Code never runs for arrays larger than 8000 entries with errors

I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?

I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.

You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}

What's the purpose of passing MPI_Gather a recvtype not equal to the sendtype?

Consider the MPICH docs for the function MPI_Gather quoted below. It takes the arguments sendtype and recvtype.
When does it make sense to not pass the same type, e.g. MPI_FLOAT or MPI_DOUBLE, for both?
I ask because it seems useless to have to pass the same argument twice, so MPI probably has a reason for accepting both a receive- and a sendtype.
MPI_Gather
Gathers together values from a group of processes
Synopsis
int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)
Input Parameters
sendbuf
starting address of send buffer (choice)
sendcount
number of elements in send buffer (integer)
sendtype
data type of send buffer elements (handle)
recvcount
number of elements for any single receive (integer, significant only at root)
recvtype
data type of recv buffer elements (significant only at root) (handle)
root
rank of receiving process (integer)
comm
communicator (handle)
Output Parameters
recvbuf
address of receive buffer (choice, significant only at root)

MPI only requires matching signatures.
For example you could send 10 MPI_INT and receive 1 derived datatype which is a vector of 10 MPI_INT.

For completeness' sake, I'm also quoting Gilles' answer.
MPI only requires matching signatures.
For example you could send 10 MPI_INT and receive 1 derived datatype
which is a vector of 10 MPI_INT.
Gilles Gouaillardet's answer contains the very helpful remark about a "derived datatype" which led me to this explanation by Victor Eijkhout at the Texas Advanced Computing Center.
A vector type describes a series of blocks, all of equal size, spaced
with a constant stride. [...] The vector datatype gives the first
non-trivial illustration that datatypes can be different on the sender
and receiver. If the sender sends b blocks of length l each, the
receiver can receive them as b*l contiguous elements, either as a
contiguous datatype, or as a contiguous buffer of an elementary type;
see figure . In this case, the receiver has no knowledge of the
stride of the datatype on the sender.
[...] In this example a vector type is created only on the sender, in
order to send a strided subset of an array; the receiver receives the
data as a contiguous block.
As an example of this datatype, consider the example of transposing a
matrix, for instance to convert between C and Fortran arrays [...]
Suppose that a processor has a matrix stored in C, row-major, layout,
and it needs to send a column to another processor. If the matrix is
declared as int M,N; double mat[M][N], then a column has M blocks
of one element, spaced N locations apart. In other words:
MPI_Datatype MPI_column;
MPI_Type_vector(
/* count= */ M, /* blocklength= */ 1, /* stride= */ N,
MPI_DOUBLE, &MPI_column );
Sending the first column is easy:
MPI_Send( mat, 1,MPI_column, ... );
The second column is just a little trickier: you now need to pick out
elements with the same stride, but starting at A0 .
MPI_Send( &(mat[0][1]), 1,MPI_column, ... );
You can make this marginally more efficient (and harder to read) by
replacing the index expression by mat+1 .

Serial point to point protocol but with 8 bytes instead of 16

I was looking at answers in Simple serial point-to-point communication protocol and it doesn't help me enough with my issue. I am also trying to communicate data between a computer and an 8-bit microcontroller at first, then eventually I want to communicate the one microcontroller to about 40 others via wireless radio modules. Basically one is designated as a master and the rest are slaves.
speed is an issue
The issue at hand is speed. because communication of every packet needs to be done at least 4x a second back and forth between the master and each slave.
Let's assume baud rate for data is 9600bps. That's 960 bytes a second.
If I used 16-byte packets then: 40 (slaves) times 16 (bytes) times 2 (ways) = 640. Divide that into 960 and that would mean well more than 1/2 a second. Not good.
If I used 8-byte packets then: 40 (slaves) times 8 (bytes) times 2 (ways) = 320. Divide that into 960 and that would mean 1/3 second. It's so-so.
But the thing is I need to watch my baud because too high of baud might mean missed data at larger distances, but you can see the speed difference between an 8 and 16 byte packet.
packet format idea
In my design, I may have a need to transmit a number in the low millions so that will use 24-bits which fits in my idea. But here's my initial idea:
Byte 1: Recipient address 0-255
Byte 2: Sender address 0-255
Byte 3: Command
Byte 4-6: Data
Byte 7-8: 16-bit fletcher checksum of above data
I don't mind if the above format is adjusted, just as long as I have at least 6 bits to identify the sender and receiver (since I'll only deal with 40 units), and the data with command included should be at least 4 bytes total.
How should I modify my data packet idea so that even the device that just turned on in the middle of reception can be in sync with the next set of data? Is there a way without stripping a bit from each data byte?

Rely on the check sum! My packet would consists of:
Recipient's address (0..40) XORed with 0x55
Sender's address (0..40) XORed with 0xAA
Command Byte
Data Byte 0
Data Byte 1
Data Byte 2
CRC8 sum, as suggested by Vroomfondel
Every receiver should have a sliding window of the last seven received bytes. When a byte was shifted in, that window should checked if it is valid:
Are the two addresses in the valid range?
Is it a valid command?
Is the CRC correct?
Especially the last one should safely reject packets on which the receiver hopped on off-sync.
If you have less than 32 command codes, you may go down to six bytes per packet: 40[Senders] times 40[Receivers] times 32[Commands] evaluates to 51200, which would fit into 16 bits instead of 24.
Don't forget to turn off the parity bit!
Update 2017-12-09: Here a receiving function:
typedef uint8_t U8;
void ByteReceived(U8 Byte)
{
static U8 Buf[7]; //Bytes received so far
static U8 BufBC=0;
Buf[BufBC++] = Byte;
if (BufBC<7) return; //Msg incomplete
/*** Seven Byte Message received ***/
//Check Addresses
U8 Adr;
Adr = Buf[0] ^ 0x55; if (Adr >= 40) goto Fail;
Adr = Buf[1] ^ 0xAA; if (Adr >= 40) goto Fail;
if (Buf[2] > ???) goto Fail; //Check Cmd
if (CalcCRC8(Buf, 6) != Buf[6]) goto Fail;
Evaluate(...);
BufBC=0; //empty Buf[]
return;
Fail:
//Seven Byte Msg invalid -> chop off first byte, could use memmove()
Buf[0] = Buf[1];
Buf[1] = Buf[2];
Buf[2] = Buf[3];
Buf[3] = Buf[4];
Buf[4] = Buf[5];
Buf[5] = Buf[6];
BufBC = 6;
}

How to calculate a 256-modulo checksum on arduino

I am writing a computer program which utilizes input from some equipment which I seldom have availible in my office. In order to develop and test this program I am trying to use an Arduino board to simulate the communication from the actual equipment. To this effect I create datapackets on the Arduino and send them to my computeer over the serial port. The packets are formated as a header and a hexidecimal integer, representing some sensor data.
The header is supposed to contain a checksum (2's complement 256-modulo). I am however not sure how to calculate it. In the datasheet of the equipment (which communication I try to simulate), it is stated that I should first compute the sum all bytes in the packet, and then take the 256-modulo of the sum and perform a 8-bit two's complement on the result.
However, as I am a newbie to bits, bytes and serial communication, I do not understand the following:
1) Lets say that I want to send the value 5500 as two bytes (high byte and low byte). Then the high-byte is '15' and the low-byte is '7c' in hexidecimal encoding, which corresponds to 21 and 124 in decimal values. Do I then add the contributions 21 and 124 to the checksum before taking the 256-modulo, or do I have to do some bit-magic beforehand?
2) How do I perform a two's compliment?
Here is a code which should illustrate how I think. The idea is to send a packet with a header containing a byte which states the length of the packet, a byte which states the type of the packet, and a byte for the checksum. Then a two-byte integer value representing some sensor value is devided into a high-byte and a low-byte, and transmitted low-byte first.
int intVal;
byte Len = 5;
byte checksum;
byte Type = 2;
byte intValHi;
byte intValLo;
void setup(){
Serial.begin(9600);
}
void loop(){
intVal = 5500; //assume that this is a sensor value
intValHi = highByte(intVal);
intValLo = lowByte(intVal);
//how to calculate the checksum? I unsuccessfully tried the following
checksum = 0;
checksum = (Len+checksum+Type+intValHi+intValLo) % 256;
//send header
Serial.write(Len);
Serial.write(checksum);
Serial.write(Type);
//send sensor data
Serial.write(intValLo);
Serial.write(intValHi)
}
Thanks!

The first thing you should understand is that mod 256 is the same thing as looking at the bottom log(256) => 8 bits.
To understand this you have to first realize what the 'mod' operation does and how digits are represented in hardware.
Mod is the remainder after an old-school division (ie only with whole numbers).
eg 5%2 = 1
Digits in hardware are stored in 'bits' which can be interpreted as base 2 mathematics.
Thus if you want to take the mod operation of a power of 2 you don't actually have to do any math.
This is just like if you want to have the remainder of the power of 10, you just take the lower digits.
ie. 157 % 100 = 57.
This can be sped up by using the fact that bytes should overflow by themselves. This means that all you have to do to take %256 of a bunch of numbers is to add them to a single byte and the arduino will take care of the rest.
For twos compliment see this question:
What is “2's Complement”?

How to determine MPI rank/process number local to a socket/node

Say, I run a parallel program using MPI. Execution command
mpirun -n 8 -npernode 2 <prg>
launches 8 processes in total. That is 2 processes per node and 4 nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual core) and network interconnect between nodes is InfiniBand.
Now, the rank number (or process number) can be determined with
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
This returns a number between 0 and 7.
But, How can I determine the node number (in this case a number between 0 and 3) and the process number within a node (number between 0 and 1)?

I believe you can achieve that with MPI-3 in this manner:
MPI_Comm shmcomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &shmcomm);
int shmrank;
MPI_Comm_rank(shmcomm, &shmrank);

It depends on the MPI implementation - and there is no standard for this particular problem.
Open MPI has some environment variables that can help. OMPI_COMM_WORLD_LOCAL_RANK will give you the local rank within a node - ie. this is the process number which you are looking for. A call to getenv will therefore answer your problem - but this is not portable to other MPI implementations.
See this for the (short) list of variables in OpenMPI.
I don't know of a corresponding "node number".

This exact problem is discussed on Markus Wittmann's Blog, MPI Node-Local Rank determination.
There, three strategies are suggested:
A naive, portable solution employs MPI_Get_processor_name or gethostname to create an unique identifier for the node and performs an MPI_Alltoall on it. [...]
[Method 2] relies on MPI_Comm_split, which provides an easy way to split a communicator into subgroups (sub-communicators). [...]
Shared memory can be utilized, if available. [...]
For some working code (presumably LGPL licensed?), Wittmann links to MpiNodeRank.cpp from the APSM library.

Alternatively you can use
int MPI_Get_processor_name( char *name, int *resultlen )
to retreive node name, then use it as color in
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
This is not as simple as MPI_Comm_split_type, however it offers a bit more freedom to split your comunicator the way you want.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex