I would like to send an array of strings from the master to a slave thread using Messgae Passing Interface (MPI).
i.e. String [] str = new String [10]
str[0]= "XXX" ... etc
How can I do that while avoiding to send each of the elements in this array as a chain of characters?
I succeeded to send an array of integers in one send operation ... but I don't know how to do that when it is about an array of strings
I don't know Java, but I'll give you the C answer. The concepts -- particularly the two approaches one might take to solve this - are the same in any language, though.
Imagine if this were a simple c-string (some characters terminated with '\0'). There are two approaches:
over-provision memory and receive up to some limit,
or send a message indicating how much data to expect.
Do you have a maximum length? (e.g. PATH_MAX or something like that). If you do not need every byte of memory, you could do
MPI_Send(str, strlen(str), MPI_CHAR, slave_rank, slave_tag, MPI_COMM_WORLD);
and you'd pair that with
MPI_Recv(str, MAX_LENGTH, MPI_CHAR, master_rank, slave_tag, MPI_COMM_WORLD);
If you don't like having slop at the end, you'll have to do it in two messages:
len=strlen(str) + 1; /* +1 for the NULL byte */
MPI_Send(&len, 1, MPI_INT, slave_rank, slave_tag, MPI_COMM_WORLD);
MPI_Send(str, strlen(str), MPI_CHAR, slave_rank, slave_tag, MPI_COMM_WORLD);
and you'd match that with
MPI_Recv(&len, 1, MPI_INT, master_rank, slave_tag, MPI_COMM_WORLD);
payload= malloc(len);
MPI_Recv(&payload, len, MPI_CHAR, master_rank, slave_tag, MPI_COMM_WORLD);
Sending arrays of strings, especially if of varying sizes, is quite an involving process. There are several options but the most MPI-friendly one is to use the packing and unpacking facilities of MPI, exposed in mpiJava as Comm.Pack, Comm.Unpack, and Comm.Pack_size.
You could do something of the sort:
byte[][] bytes = new byte[nStr][];
int[] lengths = new int[nStr];
int bufLen = MPI.COMM_WORLD.Pack_size(1, MPI.INT);
bufLen += MPI.COMM_WORLD.Pack_size(nStr, MPI.INT);
for (int i = 0; i < nStr; i++) {
bytes[i] = str[i].getBytes(Charset.forName("UTF-8"));
lengths[i] = bytes[i].length;
bufLen += MPI.COMM_WORLD.Pack_size(lengths[i], MPI.BYTE);
byte[] buf = new byte[bufLen];
int position = 0;
int nStrArray[] = new int[1];
nStrArray[0] = nStr;
position = MPI.COMM_WORLD.Pack(nStrArray, 0, 1, MPI.INT,
buf, position);
position = MPI.COMM_WORLD.Pack(lengths, 0, nStr, MPI.INT,
buf, position);
for (int i = 0; i < nStr; i++) {
position = MPI.COMM_WORLD.Pack(bytes[i], 0, lengths[i], MPI.BYTE,
buf, position);
MPI.COMM_WORLD.Send(buf, 0, bufLen, MPI.PACKED, rank, 0);
Having string lengths in an auxiliary array and packing it at the beginning of the message simplifies the receiver logic.
Assumes that the sender is rank 0.
Status status = MPI.COMM_WORLD.Probe(0, 0);
int bufLen = status.Get_count(MPI.PACKED);
byte[] buf = new byte[bufLen];
MPI.COMM_WORLD.Recv(buf, 0, bufLen, MPI.PACKED, status.source, status.tag);
int position = 0;
int nStrArray[] = new int[1];
position = MPI.COMM_WORLD.Unpack(buf, position,
nStrArray, 0, 1, MPI.INT);
int nStr = nStrArray[0];
int lengths[] = new int[nStr];
position = MPI.COMM_WORLD.Unpack(buf, position,
lengths, 0, nStr, MPI.INT);
String[] str = new String[nStr];
for (int i = 0; i < nStr; i++) {
byte[] bytes = new byte[lengths[i]];
position = MPI.COMM_WORLD.Unpack(buf, position,
bytes, 0, lengths[i], MPI.BYTE);
str[i] = new String(bytes, "UTF-8");
Disclaimer: I don't have MPJ Express installed and my Java knowledge is very limited. The code is based on the mpiJava specification, the MPJ Express JavaDocs, and some examples found on the Internet.
I am currently trying to implement an OpenCL kernel. The kernel is supposed to output a number of previously calculated elements divided by the total number of elements remapped to a value from 0 to 255.
The kernel runs in a single work group with 256 work items where LX is the local ID:
#define LX get_local_id(0)
kernel void reduceStatistic(global int *inout, int nr_workgroups, int nr_pixels)
int i = 1;
for (; i < nr_workgroups; i++)
inout[LX] += inout[LX + i * 256];
inout[LX] = (int)floor(((float)inout[LX] / (float)nr_pixels) * 256.0f);
The calculation before the remapping operation is for clean up after a previous calculation on the same buffer.
The first item of inout[LX] after the cleanup is 17176, the nr_pixels is 160000 so this should result in a value of 27 using the calculation above. The code, however, returns 6.
The relevant host-side code is as follows:
// nr_workgroups is of type int
cl_mem outputBuffer = clCreateBuffer(mgr->context, CL_MEM_READ_WRITE, nr_workgroups * 256 * sizeof(cl_int), NULL, NULL);
// another kernel writes into outputBuffer
// set kernel arguments
clSetKernelArg(mgr->reduceStatisticKernel, 0, sizeof(outputBuffer), &outputBuffer);
clSetKernelArg(mgr->reduceStatisticKernel, 1, sizeof(cl_int), &nr_workgroups);
clSetKernelArg(mgr->reduceStatisticKernel, 2, sizeof(cl_int), &imgSeqSize);
size_t global_work_size_statistics[1] = { 256 };
size_t local_work_size_statistics[1] = { 256 };
// run the kernel
clEnqueueNDRangeKernel(mgr->commandQueue, mgr->reduceStatisticKernel, 1, NULL, global_work_size_statistics, local_work_size_statistics, 0, NULL, NULL);
// read result
cl_int *reducedResult = new cl_int[256];
clEnqueueReadBuffer(mgr->commandQueue, outputBuffer, CL_TRUE, 0, 256 * sizeof(cl_int), reducedResult, 0, NULL, NULL);
Help much appreciated! (:
We established in the comments that the global buffer index calculation is wrong:
inout[LX] += inout[LX + i * 265];
Should be 256
Going out of range on a buffer leads to undefined behaviour, so this is always one of the prime culprits to look for.
I have a kernel function that only writes number to a __global int* c
To be specific it looks like this:
__kernel void Add1(__global int* c)
*c = 3;
and in host code I have allocated memory for C value:
cl_mem bufferC[deviceNumber]; // deviceNumber = 8
for(int i = 0; i< deviceNumber; i++){
bufferC[i] = clCreateBuffer(context[i], CL_MEM_WRITE_ONLY, sizeof(cl_int) * global_size, NULL, &error);
for(int i = 0; i< deviceNumber; i++){
error = clSetKernelArg(kernel[i], 0, sizeof(cl_mem), (void*)&bufferC[i]);
for(int i = 0; i< deviceNumber; i++){
error = clEnqueueReadBuffer(commandQueue[i], bufferC[i], CL_TRUE, 0, sizeof(cl_int) * global_size, &c[i], 0, NULL, NULL);
and I print it like:
for (size_t i = 0; i < deviceNumber; ++i)
std::cout<< "delta = " << c[i] << std::endl;
and output:
delta = 3
delta = 11165
delta = -1329524360
delta = 11165
delta = 0
delta = 0
delta = -1329520352
delta = 11165
so first value is ok, rest is sort of garbage, do you know what mistake I made writing it?
Of course it is only a partial code, but I think I pasted all the lines regarding that 'c' value. Global size is set to 1.
Well, my mistake was creating number of contexts but in argument I put one device instead of a array of them. But I found it by printing error codes in program - try to do that if you have some problems! Cheers
The thing I am still not too certain about is what happens with the root process in MPI Scatter / Scatterv.
If I divide an array as I try in my code, do I need to include the root process in the number of receivers (hence making the sendcounts of size nproc) or is it excluded?
In my example code for Matrix Multiplication, I still get an error by one of the processes running into aberrant behaviour, terminating the program prematurely:
void readMatrix();
double StartTime;
int rank, nproc, proc;
//double matrix_A[N_ROWS][N_COLS];
double **matrix_A;
//double matrix_B[N_ROWS][N_COLS];
double **matrix_B;
//double matrix_C[N_ROWS][N_COLS];
double **matrix_C;
int low_bound = 0; //low bound of the number of rows of each process
int upper_bound = 0; //upper bound of the number of rows of [A] of each process
int portion = 0; //portion of the number of rows of [A] of each process
int main (int argc, char *argv[]) {
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
matrix_A = (double **)malloc(N_ROWS * sizeof(double*));
for(int i = 0; i < N_ROWS; i++) matrix_A[i] = (double *)malloc(N_COLS * sizeof(double));
matrix_B = (double **)malloc(N_ROWS * sizeof(double*));
for(int i = 0; i < N_ROWS; i++) matrix_B[i] = (double *)malloc(N_COLS * sizeof(double));
matrix_C = (double **)malloc(N_ROWS * sizeof(double*));
for(int i = 0; i < N_ROWS; i++) matrix_C[i] = (double *)malloc(N_COLS * sizeof(double));
int *counts = new int[nproc](); // array to hold number of items to be sent to each process
// -------------------> If we have more than one process, we can distribute the work through scatterv
if (nproc > 1) {
// -------------------> Process 0 initalizes matrices and scatters the portions of the [A] Matrix
if (rank==0) {
StartTime = MPI_Wtime();
int counter = 0;
for (int proc = 0; proc < nproc; proc++) {
counts[proc] = N_ROWS / nproc ;
counter += N_ROWS / nproc ;
counter = N_ROWS - counter;
counts[nproc-1] = counter;
//set bounds for each process
low_bound = rank*(N_ROWS/nproc);
portion = counts[rank];
upper_bound = low_bound + portion;
printf("I am process %i and my lower bound is %i and my portion is %i and my upper bound is %i \n",rank,low_bound, portion,upper_bound);
//scatter the work among the processes
int *displs = new int[nproc]();
displs[0] = 0;
for (int proc = 1; proc < nproc; proc++) displs[proc] = displs[proc-1] + (N_ROWS/nproc);
MPI_Scatterv(matrix_A, counts, displs, MPI_DOUBLE, &matrix_A[low_bound][0], portion, MPI_DOUBLE, 0, MPI_COMM_WORLD);
//broadcast [B] to all the slaves
// -------------------> Everybody does their work
for (int i = low_bound; i < upper_bound; i++) {//iterate through a given set of rows of [A]
for (int j = 0; j < N_COLS; j++) {//iterate through columns of [B]
for (int k = 0; k < N_ROWS; k++) {//iterate through rows of [B]
matrix_C[i][j] += (matrix_A[i][k] * matrix_B[k][j]);
// -------------------> Process 0 gathers the work
The root process also takes place in the receiver side. If you are not interested in that, just set sendcounts[root] = 0.
See MPI_Scatterv for specific information on which values you have to pass exactly.
However, take care of what you are doing. I strongly suggest that you change the way you allocate your matrix as a one-dimensional array, using a single malloc like this:
double* matrix = (double*) malloc( N_ROWS * N_COLS * sizeof(double) );
If you still want to use a two-dimensional array, then you may need to define your types as a MPI derived datatype.
The datatype you are passing is not valid if you want to send more than a row in a single MPI transfer.
With MPI_DOUBLE you are telling MPI that the buffer contains a contiguous array of count MPI_DOUBLE values.
Since you are allocating a two-dimensional array using multiple malloc calls, then your data is not contiguous.
I'm trying to understand a simple OpenCL example, which is vector addition. The kernel is the following:
__kernel void addVec(__global double* a, __global double* b, __global double* c)
size_t id = get_global_id(0);
c[id] = a[id] + b[id];
For example, my input arrays have a size of 1 million elements each.
In my host program, I set global_work_size to be exactly the size of the vectors input arrays (1 million).
But when i set it to a smaller value, for example 1000, it also works with this kernel!
I don't understand why the global_work_size can be lesser than the problem dimension, and still, the OpenCL program compute every elements of the input arrays.
Could someone clarify on this?
EDIT: here is the code where I copy the data:
size_t arraySize = 1000000;
const size_t global_work_size[1] = {512};
double *host_a = malloc(arraySize*sizeof(double));
double *host_b = malloc(arraySize*sizeof(double));
double *host_c = calloc(arraySize, sizeof(double));
// Create the input and output arrays in device memory for our calculation
device_a = clCreateBuffer(context, CL_MEM_READ_ONLY, arraySize*sizeof(double), NULL, NULL);
device_b = clCreateBuffer(context, CL_MEM_READ_ONLY, arraySize*sizeof(double), NULL, NULL);
device_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, arraySize*sizeof(double), NULL, NULL);
// Copy data set into the input array in device memory. [host --> device]
status = clEnqueueWriteBuffer(command_queue, device_a, CL_TRUE, 0, arraySize*sizeof(double), host_a, 0, NULL, NULL);
status |= clEnqueueWriteBuffer(command_queue, device_b, CL_TRUE, 0, arraySize*sizeof(double), host_b, 0, NULL, NULL);
// Copy-back the results from the device [host <-- device]
clEnqueueReadBuffer(command_queue, device_c, CL_TRUE, 0, arraySize*sizeof(double), host_c, 0, NULL, NULL );
printf("checking result validity ...\n");
for (size_t i=0; i<arraySize; ++i)
if(host_c[i] - 1 > 1e-6) // the array is supposed to be 1 everywhere
printf("*** ERROR! Invalid results ! host_c[%zi]=%.9lf\n", i, host_c[i]);
Your test function doesn't look good, it will be met for any value < 1, it should be like this:
for (size_t i=0; i<arraySize; ++i){
cl_double val = host_c[i] - 1; // the array is supposed to be 1 everywhere
if((val > 1e-6) || (val < -1e-6))
printf("*** ERROR! Invalid results ! host_c[%zi]=%.9lf\n", i, host_c[i]);
Non initialized values in the GPU are likely to be 0, therefore meeting your condition.
Additionally, remember that if you run the program once with the full size, consecutive reads will still hold the proper processed data (even if you close and open the app again). Since the GPU memory is not cleaned after the buffer is created/destroyed.
I'm trying to create a log file from each processor and then send that to the root as a char array. I first send the length and then I send the data. The length sends fine, but the data is always garbage! Here is my code:
string out = "";
MPI_Status status[2];
MPI_Request reqs[num_procs];
string log = "TEST";
int length = log.length();
char* temp = (char *) malloc(length+1);
strcpy(temp, log.c_str());
if (my_id != 0)
MPI_Send (&length, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Send (&temp, length+1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
else {
int length;
for (int i = 1; i < num_procs; i++)
MPI_Recv (&length, 2, MPI_INT, i, 1, MPI_COMM_WORLD, &status[0]);
char* rec_buf;
rec_buf = (char *) malloc(length+1);
MPI_Recv (rec_buf, length+1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &status[1]);
out += rec_buf;
You are passing a char** to MPI_Send instead of a char* this causes memory corruption, or in your case the garbled output you are getting. Everything should be fine if you use
MPI_Send (temp, length+1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
(note the removed & in front of the first argument, temp.)