How accelerate MPI-calls? - mpi

I try to accelerate simple MPI-programm with OpenMP. I use MPICH2 and 4-core Intel processor. I have simple code:
int main(int argc, char** argv) {
int size, rank, provided;
const int root = 0;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int cubeCount = StrToDouble(argv[1]);
int matrixSize = *StrToDouble(argv[2]);
WorkNode node(rank, size, cubeCount, matrixSize);
time_t t0 = time(0);
node.work();
time_t t1 = time(0);
time_t total = t1 - t0;
Class WorkNode also very simple, contains only array of Cube and method work.
class Cube {
public:
Cube(int matrixSize);
double *matrix;
int matrixSize;
}
Cube::Cube(int matrixSize) {
matrix = new double[matrixSize];
this->matrixSize = matrixSize;
}
Finally method work:
// double *inBuffer = new double[cubes[0]->matrixSize];
MPI_Status status;
for (int i = 0; i < processorCount; i++) {
int nodeToSend = this->id + i;
int nodeRecv = this->id - i;
if (nodeToSend >= processorCount) {
nodeToSend -= processorCount;
}
if (nodeRecv < 0) {
nodeRecv += processorCount;
}
#pragma omp parallel for num_threads(2)
for (int i = 0; i < cubeCount; i++) {
Cube *cube = cubes[i];
if (nodeToSend != this->id) {
MPI_Bsend(cube->matrix, cube->matrixSize, MPI_DOUBLE, nodeToSend, _MY_MPI_ANY_TAG, MPI_COMM_WORLD);
}
if (nodeRecv != this->id) {
double *inBuffer = new double[cubes[0]->matrixSize];
MPI_Recv(inBuffer, cube->matrixSize, MPI_DOUBLE, nodeRecv, _MY_MPI_ANY_TAG, MPI_COMM_WORLD, &status);
delete inBuffer;
}
}
}
//delete inBuffer
Unfortunately, openMP does not accelerate the program (even if the number of MPI processes = 2), and sometimes even slows down. Can I somehow accelerate MPI calls?

Related

Error in MPI program when using MPI_Reduce function of MPI_MAXLOC

I want to use MPI_Reduce function find largest value and its PID(rank) at the same time, but the result shows it is not true, I don't know how to fix it, result:
PID:1, loc_num:2
PID:2, loc_num:3
PID:3, loc_num:4
global data: 1
coresponding PID: 0
my program:
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
//init MPI
int PID, P;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &P);
MPI_Comm_rank(MPI_COMM_WORLD, &PID);
struct{
int value;
int PID;
} in, out;
int value = 1;
in.value = value;
in.PID = PID;
for(int i = 1; i <= P; i++){
if (PID == i){
value = value + i;
printf("PID:%d, loc_num:%d \n",PID, value);
}
}
MPI_Reduce(&in, &out, 1, MPI_2INT, MPI_MAXLOC, 0, MPI_COMM_WORLD);
int max_PID = out.PID;
int max_num = out.value;
if (PID == 0){
printf("global data: %d \n", max_num);
printf("coresponding PID: %d \n",max_PID);
}
MPI_Finalize();
return 0;
}
I just follow the structure of in.value= value and in.PID=PID and then,every PID calculate value=value+PID so the answer is when PID=1, loc=2;when PID=2, loc=3 ...next compare all of them by max, and sent them to PID=0
There is no error in the MPI_Reduce of your example. As #Gilles pointed out, the issue is you are not assigning the newly calculated value to in.value.
If you put the assignment statement after calculation as below, then everything work as expected.
for(int i = 1; i <= P; i++){
if (PID == i){
value = value + i;
printf("PID:%d, loc_num:%d \n",PID, value);
}
}
in.value = value;
in.PID = PID;
MPI_Reduce(&in, &out, 1, MPI_2INT, MPI_MAXLOC, 0, MPI_COMM_WORLD);
In your example below, you are not assigning the calculated values to the in struct object.
in.value = value; // value is set as 1
in.PID = PID;
for(int i = 1; i <= P; i++){
if (PID == i){
value = value + i; // calculating the value but not assigning to in.value
printf("PID:%d, loc_num:%d \n",PID, value);
}
}
// uses the old value for in.value (i.e 1) for reduction
MPI_Reduce(&in, &out, 1, MPI_2INT, MPI_MAXLOC, 0, MPI_COMM_WORLD);

mpi scatterv and gatherv

When I run my code using command " mpirun -hosts o251-12,o251-13 ./matrixmult " and I type ijk R 4 , it gives error. Worse thing is if I run again the error is changed sometimes.
And if I use 1 or 2 instead of 4, it works well, but if I use big number, it gives error. I tried change my code, but doesn't work.
Here is my code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#include <string.h>
void putValue(int *A, int *B, char *flag, int n);
void ijk(int *A, int *B, int *result, int n, int tmpAn);
void ikj(int *A, int *B, int *result, int n, int tmpAn);
void kij(int *A, int *B, int *result, int n, int tmpAn);
int main(){
int *A;
int *B;
int *result;
int n;
char flag[2];
char form[4];
int my_rank;
int comm_sz;
double time1;
double time2;
double time;
int i;
int j;
int *tmpA;
int *tmpResult;
int *sc;
int *displs;
int alpha;
int d;
MPI_Init(NULL,NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD,&comm_sz);
MPI_Barrier(MPI_COMM_WORLD);
if(my_rank==0){
time1=MPI_Wtime();
}
if(my_rank==0){
scanf("%s", form);
scanf("%s", flag);
scanf("%d", &n);
A = (int*)malloc(n * n * sizeof(int));
B = (int*)malloc(n * n * sizeof(int));
result = (int*)malloc(n * n * sizeof(int));
putValue(A,B,flag,n);
printf("running on %d processors\n", comm_sz);
}
if(my_rank){
B = (int*)malloc(n * n * sizeof(int));
}
MPI_Bcast(&n, 1, MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(B,n*n,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&form, 4, MPI_CHAR,0,MPI_COMM_WORLD);
sc = (int*)malloc(comm_sz * sizeof(int));
displs = (int*)malloc(comm_sz * sizeof(int));
alpha = n%comm_sz;
d=0;
for(i=0;i<comm_sz;i++){
sc[i] = n/comm_sz;
if(i<alpha){
sc[i] = sc[i]+1;
}
sc[i]=sc[i]*n;
displs[i] = d;
d+=sc[i];
}
tmpA = (int*)malloc(n * sc[my_rank] * sizeof(int));
tmpResult = (int*)malloc(n * sc[my_rank] * sizeof(int));
for(i=0; i<sc[my_rank]; i++){
tmpResult[i]=0;
}
MPI_Scatterv(A,sc,displs,MPI_INT,tmpA,sc[my_rank],MPI_INT,0,MPI_COMM_WORLD);
if(strcmp(form,"ijk")==0){
ijk(tmpA,B,tmpResult,n,sc[my_rank]);
}
else if(strcmp(form,"ikj")==0){
ikj(tmpA,B,tmpResult,n,sc[my_rank]);
}
else if(strcmp(form,"kij")==0){
kij(tmpA,B,tmpResult,n,sc[my_rank]);
}
MPI_Gatherv(tmpResult,sc[my_rank],MPI_INT,result,sc,displs,MPI_INT,0,MPI_COMM_WORLD);
if(my_rank==0){
time2=MPI_Wtime();
time=time2-time1;
printf("elapsed time = %.6e secondes\n",time);
if(strcmp(flag,"I")==0){
for(i=0;i<n;i++){
for(j=0;j<n;j++){
printf("%d ",result[i*n+j]);
}
printf("\n");
}
}
}
MPI_Finalize();
return 0;
}
void putValue(int *A, int *B, char *flag, int n){
int i,j;
srand((unsigned)time(NULL));
if(strcmp(flag,"R")==0){
for(i=0; i<n; i++){
for(j=0; j<n; j++){
A[i*n+j] = (int)rand()%101;
B[i*n+j] = (int)rand()%101;
printf("%d\n",A[i*n+j]);
printf("%d\n",B[i*n+j]);
}
}
}
else if(strcmp(flag,"I")==0){
for(i=0; i<n; i++){
for(j=0; j<n; j++){
int x;
scanf("%d", &x);
A[i*n+j]=x;
}
}
for(i=0; i<n; i++){
for(j=0; j<n; j++){
int x;
scanf("%d", &x);
B[i*n+j]=x;
}
}
}
}
void ijk(int *A, int *B, int *result, int n, int tmpAn){
int i,j,k;
for(i=0;i<(tmpAn/n);i++){
for(j=0;j<n;j++){
for(k=0;k<n;k++){
result[i*n+j] += (A[i*n+k] * B[k*n+j]);
}
}
}
}
void ikj(int *A, int *B, int *result, int n, int tmpAn){
int i,j,k;
for(i=0;i<(tmpAn/n);i++){
for(k=0;k<n;k++){
for(j=0;j<n;j++){
result[i*n+j] += (A[i*n+k] * B[k*n+j]);
}
}
}
}
void kij(int *A, int *B, int *result, int n, int tmpAn){
int i,j,k;
for(k=0;k<n;k++){
for(i=0;i<(tmpAn/n);i++){
for(j=0;j<n;j++){
result[i*n+j] += (A[i*n+k] * B[k*n+j]);
}
}
}
}
Sometimes it the error is
rank = 3, revents = 25, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 1
and sometimes the error is
Fatal error in PMPI_Gatherv: Unknown error class, error stack:
PMPI_Gatherv(1001)....................: MPI_Gatherv failed(sbuf=0x231f140,
scount=4, MPI_INT, rbuf=0x231f060, rcnts=0x231f0b0, displs=0x231f0d0, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gatherv_impl(545)................: fail failed
I_MPIR_Gatherv_intra(617).............: Failure during collective
I_MPIR_Gatherv_intra(590).............: fail failed
MPIR_Gatherv_advanced(720)............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
====================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 17870 RUNNING AT o251-13
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
This line is executed by the root process
scanf("%d", &n);
This line is executed by the rest before n is broadcast:
B = (int*)malloc(n * n * sizeof(int));
Move that alloc statement after the broadcast.

Using ScatterV to split an array to multiple processes

I am working with MPI and I have to send parts of an array to different processes. As an example, consider 3 processes. Then I need to send the red elements to the first process, the greed to the second and the black to the third process.
I know I could use Scatterv twice, but I want to minimize the communication between processes and the real array that I'm splitting apart is huge. Does anyone have a suggestion on how I can accomplish this?
Here is my attempt with a derived data type:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void print_array(int *array,int n){
int i;
printf("\t[");
for (i=0; i<n; i++) {
printf(" %d",array[i]);
}
printf("]\n");
}
int main(int argc, char **argv){
int rank,world_size,i,n = 16, block_count = 2;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int *array = malloc(n * sizeof(int));
for (i=0; i<n; i++) { array[i]=i;}
if (rank==0) { print_array(array,n);}
int *sendcounts = malloc(world_size * sizeof(int));
int *reccounts = malloc(world_size * sizeof(int));
int *displs = malloc(world_size * sizeof(int));
sendcounts[0]=3; sendcounts[1]=3; sendcounts[2]=2;
displs[0]=0; displs[1]=3; displs[2]=6;
for (i=0; i<world_size; i++) {
reccounts[i] = sendcounts[i]*block_count;
}
int root = 0;
int *recvbuf = malloc(reccounts[rank] * sizeof(int));
MPI_Datatype newtype;
MPI_Type_contiguous(block_count, MPI_INT, &newtype);
MPI_Type_commit(&newtype);
if (rank==0) {
MPI_Scatterv(array, sendcounts, displs,
newtype, recvbuf, sendcounts[rank],
newtype, root, MPI_COMM_WORLD);
}
else {
MPI_Scatterv(NULL, sendcounts, displs,
newtype, recvbuf, reccounts[rank],
newtype, root, MPI_COMM_WORLD);
}
MPI_Type_free (&newtype);
print_array(recvbuf,reccounts[rank]);
free(array);array = NULL;
free(sendcounts);sendcounts = NULL;
free(displs);displs = NULL;
free(recvbuf);recvbuf = NULL;
MPI_Finalize();
return 0;
}
There is a way, but it is a bit convoluted.
The idea is you create a derived datatype with two elements at offset 0 and 8, and then resize this datatype so the upper bound is the size of one element.
Then you can MPI_Scatterv() once with counts={3,3,2} and displs={0,3,6}.
Note you also need to create a derived datatype on the receive side, otherwise MPI task 1 would receive {3, 11, 4, 12, 5, 13} when i guess you expect {3, 4, 5, 11, 12, 13}

Removing MPI_Bcast()

So I have a some code where I am using MPI_Bcast to send information from the root node to all nodes, but instead I want to get my P0 to send chunks of the array to individual processes.
How do I do this with MPI_Send and MPI_Receive?
I've never used them before and I don't know if I need to loop my MPI_Receive to effectively send everything or what.
I've put giant caps lock comments in the code where I need to replace my MPI_Bcast(), sorry in advance for the waterfall of code.
Code:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
#define MAXSIZE 10000000
int add(int *A, int low, int high)
{
int res = 0, i;
for(i=low; i<=high; i++)
res += A[i];
return(res);
}
int main(argc,argv)
int argc;
char *argv[];
{
int myid, numprocs, x;
int data[MAXSIZE];
int i, low, high, myres, res;
double elapsed_time;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if (myid == 0)
{
for(i=0; i<MAXSIZE; i++)
data[i]=1;
}
/* star the timer */
elapsed_time = -MPI_Wtime();
//THIS IS WHERE I GET CONFUSED ABOUT MPI_SEND AND MPI_RECIEVE!!!
MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);
x = MAXSIZE/numprocs;
low = myid * x;
high = low + x - 1;
if (myid == numprocs - 1)
high = MAXSIZE-1;
myres = add(data, low, high);
printf("I got %d from %d\n", myres, myid);
MPI_Reduce(&myres, &res, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
/* stop the timer*/
elapsed_time += MPI_Wtime();
if (myid == 0)
printf("The sum is %d, time taken = %f.\n", res,elapsed_time);
MPI_Barrier(MPI_COMM_WORLD);
printf("The sum is %d at process %d.\n", res,myid);
MPI_Finalize();
return 0;
}
You need MPI_Scatter. A good intro is here: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
I think in your code it could look like this:
elements_per_proc = MAXSIZE/numprocs;
// Create a buffer that will hold a chunk of the global array
int *data_chunk = malloc(sizeof(int) * elements_per_proc);
MPI_Scatter(data, elements_per_proc, MPI_INT, data_chunk,
elements_per_proc, MPI_INT, 0, MPI_COMM_WORLD);
If you really want use MPI_Send and MPI_Recv, then you can use something like this:
int x = MAXSIZE / numprocs;
int *procData = new int[x];
if (rank == 0) {
for (int i = 1; i < num; i++) {
MPI_Send(data + i*x, x, MPI_INT, i, 0, MPI_COMM_WORLD);
}
} else {
MPI_Recv(procData, x, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}

Spellcheck program using MPI

So, my assignment is to write a spell check program and then parallelize it using openMPI. My take was to load the words from a text file into my array called dict[] and this is used as my dictionary. Next, I get input from the user and then it's supposed go through the dictionary array and check whether the current word is within the threshold percentage, if it is, print it out. But I'm only supposed to print out a certain amount of words. My problem is, is that, my suggestions[] array, doesn't seem to fill up the way I need it to, and it gets a lot of blank spots in it, whereas, I thought at least, is that the way I wrote it, is to just fill it when a word is within threshold. So it shouldn't get any blanks in it until there are no more words being added. I think it's close to being finished but I can't seem to figure this part out. Any help is appreciated.
#include <stdio.h>
#include <mpi.h>
#include <string.h>
#include <stdlib.h>
#define SIZE 30
#define max(x,y) (((x) > (y)) ? (x) : (y))
char *dict[50000];
char *suggestions[50000];
char enterWord[50];
char *myWord;
int wordsToPrint = 20;
int threshold = 40;
int i;
int words_added = 0;
int levenshtein(const char *word1, int len1, const char *word2, int len2){
int matrix[len1 + 1][len2 + 1];
int a;
for(a=0; a<= len1; a++){
matrix[a][0] = a;
}
for(a=0;a<=len2;a++){
matrix[0][a] = a;
}
for(a = 1; a <= len1; a++){
int j;
char c1;
c1 = word1[a-1];
for(j = 1; j <= len2; j++){
char c2;
c2 = word2[j-1];
if(c1 == c2){
matrix[a][j] = matrix[a-1][j-1];
}
else{
int delete, insert, substitute, minimum;
delete = matrix[a-1][j] +1;
insert = matrix[a][j-1] +1;
substitute = matrix[a-1][j-1] +1;
minimum = delete;
if(insert < minimum){
minimum = insert;
}
if(substitute < minimum){
minimum = substitute;
}
matrix[a][j] = minimum;
}//else
}//for
}//for
return matrix[len1][len2];
}//levenshtein
void prompt(){
printf("Enter word to search for: \n");
scanf("%s", &enterWord);
}
int p0_compute_output(int num_processes, char *word1){
int totalNumber = 0;
int k = 0;
int chunk = 50000 / num_processes;
for(i = 0; i < chunk; i++){
int minedits = levenshtein(word1, strlen(word1), dict[i], strlen(dict[i]));
int thresholdPercentage = (100 * minedits) / max(strlen(word1), strlen(dict[i]));
if(thresholdPercentage < threshold){
suggestions[totalNumber] = dict[i];
totalNumber = totalNumber + 1;
}
}//for
return totalNumber;
}//p0_compute_output
void p0_receive_output(int next_addition){
int num_to_add;
MPI_Comm comm;
MPI_Status status;
MPI_Recv(&num_to_add,1,MPI_INT,MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("--%d\n", num_to_add);
suggestions[next_addition] = dict[num_to_add];
next_addition = next_addition + 1;
}
void compute_output(int num_processes, int me, char *word1){
int chunk = 0;
int last_chunk = 0;
MPI_Comm comm;
if(50000 % num_processes == 0){
chunk = 50000 / num_processes;
last_chunk = chunk;
int start = me * chunk;
int end = me * chunk + chunk;
for(i = start; i < end;i++){
int minedits = levenshtein(word1, strlen(word1), dict[i], strlen(dict[i]));
int thresholdPercentage = (100 * minedits) / max(strlen(word1), strlen(dict[i]));
if(thresholdPercentage < threshold){
int number_to_send = i;
MPI_Send(&number_to_send, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
}
}
else{
chunk = 50000 / num_processes;
last_chunk = 50000 - ((num_processes - 1) * chunk);
if(me != num_processes){
int start = me * chunk;
int end = me * chunk + chunk;
for(i = start; i < end; i++){
int minedits = levenshtein(word1, strlen(word1), dict[i], strlen(dict[i]));
int thresholdPercentage = (100 * minedits) / max(strlen(word1), strlen(dict[i]));
if(thresholdPercentage < threshold){
int number_to_send = i;
MPI_Send(&number_to_send, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}//if
}//for
}//if me != num_processes
else{
int start = me * chunk;
int end = 50000 - start;
for(i = start; i < end; i++){
int minedits = levenshtein(word1, strlen(word1), dict[i], strlen(dict[i]));
int thresholdPercentage = (100 * minedits) / max(strlen(word1), strlen(dict[i]));
if(thresholdPercentage < threshold){
int number_to_send = i;
MPI_Send(&number_to_send, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
}
}//me == num_processes
}//BIG else
return;
}//COMPUTE OUTPUT
void set_data(){
prompt();
MPI_Bcast(&enterWord,20 ,MPI_CHAR, 0, MPI_COMM_WORLD);
}//p0_send_inpui
//--------------------------MAIN-----------------------------//
main(int argc, char **argv){
int ierr, num_procs, my_id, loop;
FILE *myFile;
loop = 0;
for(i=0;i<50000;i++){
suggestions[i] = calloc(SIZE, sizeof(char));
}
ierr = MPI_Init(NULL, NULL);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
printf("Check in from %d of %d processors\n", my_id, num_procs);
set_data();
myWord = enterWord;
myFile = fopen("words", "r");
if(myFile != NULL){
for(i=0;i<50000;i++){
dict[i] = calloc(SIZE, sizeof(char));
fscanf(myFile, "%s", dict[i]);
}//for
fclose(myFile);
}//read word list into dictionary
else printf("File not found");
if(my_id == 0){
words_added = p0_compute_output(num_procs, enterWord);
printf("words added so far: %d\n", words_added);
p0_receive_output(words_added);
printf("Threshold: %d\nWords To print: %d\n%s\n", threshold, wordsToPrint, myWord);
ierr = MPI_Finalize();
}
else{
printf("my word %s*\n", enterWord);
compute_output(num_procs, my_id, enterWord);
// printf("Process %d terminating...\n", my_id);
ierr = MPI_Finalize();
}
for(i=0;i<wordsToPrint;i++){
printf("*%s\n", suggestions[i]);
}//print suggestions
return (0);
}//END MAIN
Here are a few problems I see with what you're doing:
prompt() should only be called by rank 0.
The dictionary file should be read only by rank 0, then broadcast the array out to the other ranks
Alternatively, have rank 1 read the file while rank 0 is waiting for input, broadcast input and dictionary afterwards.
You're making the compute_output step overly complex. You can merge p0_compute_output and compute_output into one routine.
Store an array of indices into dict in each rank
This array will not be the same size in every rank, so the simplest way to do this would be to send from each rank a single integer indicating the size of the array, then send the array with this size. (The receiving rank must know how much data to expect). You could also use the sizes for MPI_Gatherv, but I expect this is more than you're wanting to do right now.
Once you have a single array of indices in rank 0, then use this to fill suggestions.
Save the MPI_Finalize call until immediately before the return call
For the final printf call, only rank 0 should be printing that. I suspect this is causing a large part of the "incorrect" result. As you have it, all ranks are printing suggestions, but it is only filled in rank 0. So the others will all be printing blank entries.
Try some of these changes, especially the last one, and see if that helps.

Resources