parallelize for loop using boost MPI - mpi

I am learning to use Boost.MPI to parallelize the large amount of computation, here below is just my simple test see if I can get MPI logic correctly. However, I did not get it to work. I used world.size()=10, there are total 50 elements in data array, each process will do 5 iteration. I would hope to update data array by having each process sending the updated data array to root process, and then the root process receives the updated data array then print out. But I only get a few elements updated.
Thanks for helping me.
#include <boost/mpi.hpp>
#include <iostream>
#include <cstdlib>
namespace mpi = boost::mpi;
using namespace std;
#define max_rows 100
int data[max_rows];
int modifyArr(const int index, const int arr[]) {
return arr[index]*2+1;
}
int main(int argc, char* argv[])
{
mpi::environment env(argc, argv);
mpi::communicator world;
int num_rows = 50;
int my_number;
if (world.rank() == 0) {
for ( int i = 0; i < num_rows; i++)
data[i] = i + 1;
}
broadcast(world, data, 0);
for (int i = world.rank(); i < num_rows; i += world.size()) {
my_number = modifyArr(i, data);
data[i] = my_number;
world.send(0, 1, data);
//cout << "i=" << i << " my_number=" << my_number << endl;
if (world.rank() == 0)
for (int j = 1; j < world.size(); j++)
mpi::status s = world.recv(boost::mpi::any_source, 1, data);
}
if (world.rank() == 0) {
for ( int i = 0; i < num_rows; i++)
cout << "i=" << i << " results = " << data[i] << endl;
}
return 0;
}

Your problem is probably here:
mpi::status s = world.recv(boost::mpi::any_source, 1, data);
This is the only way data can get back to the master node.
However, you do not tell the master node where in data to store the answers it is getting. Since data is the address of the array, everything should get stored in the zeroth element.
Interleaving which elements of the array you are processing on each node is a pretty bad idea. You should assign blocks of the array to each node so that you can send entire chunks of the array at once. That will reduce communication overhead significantly.
Also, if your issue is simply speeding up for loops, you should consider OpenMP, which can do things like this:
#pragma omp parallel for
for(int i=0;i<100;i++)
data[i]*=4;
Bam! I just split that for loop up between all of my processes with no further work needed.

Related

How can I reverse a series of numbers as quickly as possible?

I am trying to solve a question called 'smurf' on codebreaker.xyz, but my code is too slow. The task is to take in a series of numbers, one at a time, and reverse the array each time.
So for this input:
6
0 1 2 1 2 0
it should give the output 0 1 1 0 2 2.
The logic is correct and the code works, but how can I make it execute under 1 second for s <= 200,000...
This is the code in C++17:
#include <algorithm>
#include <bits/stdc++.h>
using namespace std;
#define int unsigned int
vector<int> S;
int s, t;
int32_t main(){
ios_base :: sync_with_stdio(false); cin.tie(0); cout.tie(0);
cin >> s;
for(int i = 0; i < s; i++){
cin >> t;
S.push_back(t);
reverse(S.begin(), S.end());
}
for(int j = 0; j < s; j++){
cout << S[j] << " ";
}
}
As I understand your question, you need to reverse your container every time you add a new element. We can omit that by just adding the new number to the back or the front of the container, depending on if the container should be "reversed" or not at the moment. Since adding elements to the front of a vector is quite costly I would suggest using a list instead. An example would look like this
#include <iostream>
#include <list>
#include <ranges> //c++20
int main() {
std::list<int> numbers{};
bool reversed = false;
int N;
std::cin >> N;
while (N>0) {
int inputNumber;
std::cin >> inputNumber;
if(reversed){
numbers.push_front(inputNumber);
}
else{
numbers.push_back(inputNumber);
}
reversed = not reversed;
--N;
}
if(reversed){
for(const int i : numbers | std::views::reverse){
std::cout << i << ' ';
}
}
else{
for(const int i : numbers){
std::cout << i << ' ';
}
}
}

Base case condition in quick sort algorithm

For the quick sort algorithm(recursive), every time when it calls itself, it have the condition if(p < r). Please correct me if I am wrong: as far as I know, for every recursive algorithm, it has a condition as the time when it entered the routine, and this condition is used to get the base case. But I still cannot understand how to correctly set and test this condition ?
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
For my entire code, please refer to the following:
/*
filename : main.c
description: quickSort algorithm
*/
#include<iostream>
using namespace std;
void exchange(int* val1, int* val2)
{
int temp = *val1;
*val1 = *val2;
*val2 = temp;
}
int partition(int* arr, int p, int r)
{
int x = arr[r];
int j = p;
int i = j-1;
while(j<=r-1)
{
if(arr[j] <= x)
{
i++;
// exchange arr[r] with arr[j]
exchange(&arr[i],&arr[j]);
}
j++;
}
exchange(&arr[i+1],&arr[r]);
return i+1;
}
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
// driver program to test the quick sort algorithm
int main(int argc, const char* argv[])
{
int arr1[] = {13,19,9,5,12,8,7,4,21,2,6,11};
cout <<"The original array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
quickSort(arr1,0,11);
//print out the sorted array
cout <<"The sorted array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
cin.get();
return 0;
}
Your question is not quite clear, but I will try to answer.
Quicksort works by sorting smaller and smaller arrays. The base case is an array with less than 2 elements because no sorting would be required.
At each step it finds a partition value and makes it true that all the values to the left of the partition value are smaller and all values to the right of the partition value are larger. In other words, it puts the partition value in the correct place. Then it recursively sorts the array to the left of the partition and the array to right of the partition.
The base case of quicksort is an array with one element because a one element array requires no sorting. In your code, p is the index of the first element and r is the index of the last element. The predicate p < r is only true for an array of at least size 2. In other words, if p >= r then you have an array of size 1 (or zero, or nonsense) and there is no work to do.

MPI datatype for structure of arrays

In an MPI application I have a distributed array of floats and two "parallel" arrays of integers: for each float value there are two associated integers that describe the corresponding value. For the sake of cache-efficiency I want to treat them as three different arrays, i.e. as a structure of arrays, rather than an array of structures.
Now, I have to gather all these values into the first node. I can do this in just one communication instruction, by defining an MPI type, corresponding to a structure, with one float and two integers. But this would force me to use the array of structures pattern instead of the structure of arrays one.
So, I can choose between:
Performing three different communications, one for each array and keep the efficient structure of arrays arrangement
Defining an MPI type, perform a single communication, and deal with the resulting array of structures by adjusting my algorithm or rearranging the data
Do you know a third option that would allow me do have the best of both worlds, i.e. having a single communication and keeping the cache-efficient configuration?
You take take a look at Packing and Unpacking.
http://www.mpi-forum.org/docs/mpi-11-html/node62.html
However, I think if you want to pass a same "structure" often you should define you own MPI derivate type.
E.g. by using the *array_of_blocklength* parameter of MPI_Type_create_struct
// #file mpi_compound.cpp
#include <iterator>
#include <cstdlib> // for rng
#include <ctime> // for rng inits
#include <iostream>
#include <algorithm>
#include <mpi.h>
const std::size_t N = 10;
struct Asset {
float f[N];
int m[N], n[N];
void randomize() {
srand(time(NULL));
srand48(time(NULL));
std::generate(&f[0], &f[0] + N, drand48);
std::generate(&n[0], &n[0] + N, rand);
std::generate(&m[0], &m[0] + N, rand);
}
};
int main(int argc, char* argv[]) {
MPI_Init(&argc,&argv);
int rank,comm_size;
MPI_Status stat;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&comm_size);
Asset a;
MPI_Datatype types[3] = { MPI_FLOAT, MPI_INT, MPI_INT };
int bls[3] = { N, N, N };
MPI_Aint disps[3];
disps[0] = 0;
disps[1] = int(&(a.m[0]) - (int*)&a)*sizeof(int);
disps[2] = int(&(a.n[0]) - (int*)&a)*sizeof(int);
MPI_Datatype MPI_USER_ASSET;
MPI_Type_create_struct(3, bls, disps, types, &MPI_USER_ASSET);
MPI_Type_commit(&MPI_USER_ASSET);
if(rank==0) {
a.randomize();
std::copy(&a.f[0], &a.f[0] + N, std::ostream_iterator<float>(std::cout, " "));
std::cout << std::endl;
std::copy(&a.m[0], &a.m[0] + N, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::copy(&a.n[0], &a.n[0] + N, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
MPI_Send(&a.f[0],1,MPI_USER_ASSET,1,0,MPI_COMM_WORLD);
} else {
MPI_Recv(&a.f[0],1,MPI_USER_ASSET,0,0,MPI_COMM_WORLD, &stat);
std::cout << "\t=> ";
std::copy(&a.f[0], &a.f[0] + N, std::ostream_iterator<float>(std::cout, " "));
std::cout << std::endl << "\t=> ";
std::copy(&a.m[0], &a.m[0] + N, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl << "\t=> ";
std::copy(&a.n[0], &a.n[0] + N, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
MPI_Type_free(&MPI_USER_ASSET);
MPI_Finalize();
return 0;
}
worked with
mpirun -n 2 ./mpi_compound
with mpich2 v1.5 (HYDRA) on x86_86 linux and g++ 4.4.5-8 - based mpic++

Printing the contents of an array to a file

Pointer related question. I'm going through some example code that currently reads in data from a file called dataFile into a buffer. The reading is done inside a loop as follows:
unsigned char* buffer = (unsigned char*)malloc(1024*768*);
fread(buffer,1,1024*768,dataFile);
redPointer = buffer;
bluePointer = buffer+1024;
greenPointer = buffer+768;
Now, I want to try and write the entire contents of the array buffer to a file, so that I can save just those discrete images (and not have a large file). However, I am not entirely sure how to go about doing this.
I was trying to cout statements, however I get a print-out of garbage characters on the console and also a beep from the PC. So then I end my program.
Is there an alternative method other than this:
for (int i=0; i < (1024*768); i++) {
fprintf(myFile, "%6.4f , ", buffer[i]);
}
By declaring your buffer as a char*, any pointer arithmatic or array indexes will use sizeof(char) to calculate the offset. A char is 1 byte (8 bits).
I'm not sure what you are trying to do with the data in your buffer. Here are some ideas:
Print the value of each byte in decimal, encoded as ASCII text:
for (int i=0; i < (1024*768); i++) {
fprintf(myFile, "%d , ", buffer[i]);
}
Print the value of each byte in hexadecimal, encoded in ASCII text:
for (int i=0; i < (1024*768); i++) {
fprintf(myFile, "%x , ", buffer[i]);
}
Print the value of each floating point number, in decimal, encoded in ASCII text (I think my calculation of the array index is correct to process adjacent non-overlapping memory locations for each float):
for (int i=0; i < (1024*768); i += sizeof(float)) {
fprintf(myFile, "%6.4f , ", buffer[i]);
}
Split the buffer into three files, each one from a non-overlapping section of the buffer:
fwrite(redPointer, sizeof(char), 768, file1);
fwrite(greenPointer, sizeof(char), 1024-768, file2);
fwrite(bluePointer, sizeof(char), (1024*768)-1024, file3);
Reference for fwrite. Note that for the count parameter I simply hard-coded the offsets that you had hard-coded in your question. One could also subtract certain of the pointers to calculate the number of bytes in each region. Note also that the contents of these three files will only be sensible if those are sensibly independent sections of the original data.
Maybe this gives you some ideas.
Updated: so I created a complete program to compile and test the formatting behavior. This only prints the first 20 items from the buffer. It compiles (with gcc -std=c99) and runs. I created the file /tmp/data using ghex and simply filled in some random data.
#include <stdlib.h>
#include <stdio.h>
int main()
{
FILE* dataFile = fopen("/tmp/data", "rb");
if (dataFile == NULL)
{
printf("fopen() failed");
return -2;
}
unsigned char* buffer = (unsigned char*)malloc(1024*768);
if (buffer == NULL)
{
printf("malloc failed");
return -1;
}
const int bytesRead = fread(buffer,1,1024*768,dataFile);
printf("fread() read %d bytes\n", bytesRead);
// release file handle
fclose(dataFile); dataFile = NULL;
printf("\nDecimal:\n");
for (int i=0; i < (1024*768); i++) {
printf("%hd , ", buffer[i]);
if (i > 20) { break; }
}
printf("\n");
printf("\nHexadecimal:\n");
for (int i=0; i < (1024*768); i++) {
printf("%#0hx , ", buffer[i]);
if (i > 20) { break; }
}
printf("\n");
printf("\nFloat:\n");
for (int i=0; i < (1024*768); i += sizeof(float)) {
printf("%6.4f , ", (float)buffer[i]);
if (i > 20) { break; }
}
printf("\n");
return 0;
}

Runtime allocation of multidimensional array

So far I thought that the following syntax was invalid,
int B[ydim][xdim];
But today I tried and it worked! I ran it many times to make sure it did not work by chance, even valgrind didn't report any segfault or memory leak!! I am very surprised. Is it a new feature introduced in g++? I always have used 1D arrays to store matrices by indexing them with correct strides as done with A in the program below. But this new method, as with B, is so simple and elegant that I have always wanted. Is it really safe to use? See the sample program.
PS. I am compiling it with g++-4.4.3, if that matters.
#include <cstdlib>
#include <iostream>
int test(int ydim, int xdim) {
// Allocate 1D array
int *A = new int[xdim*ydim](); // with C++ new operator
// int *A = (int *) malloc(xdim*ydim * sizeof(int)); // or with C style malloc
if (A == NULL)
return EXIT_FAILURE;
// Declare a 2D array of variable size
int B[ydim][xdim];
// populate matrices A and B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++) {
A[y*xdim + x] = y*xdim + x;
B[y][x] = y*xdim + x;
}
}
// read out matrix A
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << A[y*xdim + x] << " ";
std::cout << std::endl;
}
std::cout << std::endl;
// read out matrix B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << B[y][x] << " ";
std::cout << std::endl;
}
delete []A;
// free(A); // or in C style
return EXIT_SUCCESS;
}
int main() {
return test(5, 8);
}
int b[ydim][xdim] is declaring a 2-d array on the stack. new, on the other hand, allocates the array on the heap.
For any non-trivial array size, it's almost certainly better to have it on the heap, lest you run yourself out of stack space, or if you want to pass the array back to something outside the current scope.
This is a C99 'variable length array' or VLA. If they are supported by g++ too, then I believe it is an extension of the C++ standard.
Nice, aren't they?

Resources