Image processing in MPI - mpi

This is my attempt to code the classical smoothing pixel average algorithm in MPI. I almost got it working but something weird happens with the halo exchange as can see the lines right in the edges. I can't seem to find the bug. Am I properly exchanging halos? What section of the final array should I gather?
https://pastebin.com/4rtFnSJ5
int next = rank + 1;
int prev = rank - 1;
if (next >= size) {
next = MPI_PROC_NULL;
}
if (prev < 0) {
prev = MPI_PROC_NULL;
}
int rows = y / px;
int cols = x;
int d = 1;
for (int iter = 0; iter < TotalIter; iter++) {
for (int i = 0; i < rows + 2; i++)
for (int j = 0; j < cols + 2; j++)
for (int k = 0; k < rgb; k++)
new[i][j * rgb + k] = 0;
for (int i = 1; i < rows + 1; i++) {
int iMin = -min(d, i - 1);
int iMax = min(d, (rows + 1 - i - 1));
for (int j = 1; j < cols + 1; j++) {
int jMin = -min(d, j - 1);
int iMax = min(d, (cols + 1 - j - 1));
int counter = 0;
for (int p = iMin; p <= iMax; p++)
for (int q = jMin; q <= jMax; q++) {
counter = counter + 1;
for (int k = 0; k < rgb; k++) {
new[i][j * rgb + k] += old[i + p][(j + q) * rgb + k];
}
}
for (int k = 0; k < rgb; k++) {
new[i][j * rgb + k] -= old[i][j * rgb + k];
new[i][j * rgb + k] /= (counter - 1);
}
}
}
for (int i = 2; i < rows; i++)
for (int j = 2; j < cols; j++)
for (int k = 0; k < rgb; k++) {
old[i][j * rgb + k] = new[i][j * rgb + k];
}
MPI_Sendrecv(&old[rows][1], cols * rgb, MPI_INT, next, 1, &old[0][1],
cols * rgb, MPI_INT, prev, 1, MPI_COMM_WORLD, &status);
MPI_Sendrecv(&old[1][1], cols * rgb, MPI_INT, prev, 2, &old[rows + 1][1],
cols * rgb, MPI_INT, next, 2, MPI_COMM_WORLD, &status);
}
for (int i = 1; i< rows+1; i++)
for (int j = 1; j< cols+1; j++)
for (int k = 0; k< rgb; k++) {
buf[i-1][(j-1)*rgb+k] = old[i][j*rgb+k] ;
}
MPI_Gather(&buf[0][0], rows *cols *rgb, MPI_INT, &Finalbuffer[0][0],
rows *cols *rgb, MPI_INT, 0, MPI_COMM_WORLD);
The output looks like this when run on 8 MPI processes. I can clearly see delimiting lines. For that reason I thought I was not doing halo exchanges properly.

OK, so there are a bunch of issues here.
First, your code could only ever work with d=1 since you only swap halos of depth 1. If you want to process neighbours of distance d, you need to swap halos of depth d.
Second, you do the first halo swap after your first sweep through the arrays so you are reading junk halo data on iteration 1 - you need to do a halo swap before you start processing your arrays.
Third, when you copy back new to old you start from index 2 : you need to include all the pixels from 1 to lrows and 1 to lcols.
Finally, your logic of Imin, Imax etc seems wrong. You don't want to truncate the range at the edges in the parallel program - you need to go off the edges to pick up the halo data. I just set Imin = -d, Imax = d etc.
With these fixes the code seems to run OK, i.e. there are no obvious halo effects, but it still gives different results on different numbers of processes.
PS I was also flattered to see you used the "arraymalloc2d" code from one of my own MPI examples - http://www.archer.ac.uk/training/course-material/2018/07/intro-epcc/exercises/cfd.tar.gz ; I'm glad to see that these training codes are proving useful to people!

Related

Game Of Life ends quickly (Java)

I've created a basic version of the Game Of Life: each turn, the board is simulated by a 2D array of 1's and 0's, after which another class creates a drawing of it for me using the 2d array
I've read all the other questions here regarding this game, but no answer seems to work out for me....sorry if I'm beating a dead horse here.
I think I have a problem with my algorithm, thus maybe the board gets filled with the wrong amount of dead and alive cells and thus ends rather quickly (5-10 turns).
I've found an algorithm here to scan all the neighbors and even added a count = -1 in case it a cell in the grid scans itself as it's own neighbor, but I think I'm missing something here.
public static void repaint(board game, int size,int[][] alive, int[][] newGeneration)
{
int MIN_X = 0, MIN_Y = 0, MAX_X =9, MAX_Y =9, count;
for ( int i = 0; i < size; i++ )
{
for (int j = 0; j < size; j++) //here we check for each matrix cell's neighbors to see if they are alive or dead
{
count = 0;
if (alive[i][j] == 1)
count = -1;
int startPosX = (i - 1 < MIN_X) ? i : i - 1;
int startPosY = (j - 1 < MIN_Y) ? j : j - 1;
int endPosX = (i + 1 > MAX_X) ? i : i + 1;
int endPosY = (j + 1 > MAX_Y) ? j : j + 1;
for (int rowNum = startPosX; rowNum <= endPosX; rowNum++)
{
for (int colNum = startPosY; colNum <= endPosY; colNum++)
{
if (alive[rowNum][colNum] == 1)
count++;
}
}
if (alive[i][j] == 0 && count == 3) //conditions of the game of life
newGeneration[i][j] = 1; //filling the new array for the next life
if (alive[i][j] == 1 && count < 2)
newGeneration[i][j] = 0;
if (alive[i][j] == 1 && count >= 4)
newGeneration[i][j] = 0;
if (alive[i][j] == 1 && count == 3)
newGeneration[i][j] = 1;
}
}
game.setAlive(newGeneration); //we created a new matrix with the new lives, now we set it
SetupGUI(game,size); //re drawing the panel
}
}
What am I doing wrong? thanks for the help.

Arduino: float function returns inf

I have a function (shown below) that I need some advice on. The function returns the slope of a line which is fit (via the least squares method) to n data points. To give you a context, my project is a barometric pressure based altimeter which uses this function to determine velocity based on the n most recent altitude-time pairs. These altitude-time pairs are stored in 2 global arrays(times[] and alts[]).
My problem is not that this method doesn't work. It usually does. But sometimes I will run the altimeter and this function will return the value 'inf' interspersed with a bunch of other wrong values (I have also seen 'NaN' but that is more rare). There are a few areas of suspicion I have at this point but I would like a fresh perspective. Here is some further contextual information that may or may not be of use:
I am using interrupts for a quadrature encoder
The times[] array is of type unsigned long
The alts[] array is of type float
n is a const int, in this case n = 9
On the ATMEGA328 a double is the same as a float.. Arduino-double
float velF() { // uses the last n data points, fits a line to them,
// and uses the slope of that line as the velocity at that moment
float sumTY = 0, sumT = 0, sumY = 0, sumT2 = 0;
for (int i = 0; i < n; i++) {
sumTY += (float)times[i] * alts[i] / 1000;
sumT += (float)times[i] / 1000;
sumY += alts[i];
sumT2 += (float)times[i] * times[i] / 1000000;
}
return (n*sumTY - sumT*sumY) / (n*sumT2 - sumT*sumT);
}
Any help or advice would be greatly appreciated!
Code is certainly performing division by zero.
For a variety of reasons, n*sumT2 - sumT*sumT will be zero. #John Bollinger In most of these cases, the top (dividend) of the division will also be zero and a return value of zero would be acceptable.
float velF(void) {
float sumTY = 0, sumT = 0, sumY = 0, sumT2 = 0;
for (size_t i = 0; i < n; i++) {
// insure values are reasoable
assert(alts[i] >= ALT_MIN && alts[i] <= ALT_MAX);
assert(times[i] >= TIME_MIN && times[i] <= TIME_MAX);
sumTY += (float)times[i] * alts[i] / 1000;
sumT += (float)times[i] / 1000;
sumY += alts[i];
sumT2 += (float)times[i] * times[i] / 1000000;
}
float d = n*sumT2 - sumT*sumT;
if (d == 0) return 0;
return (n*sumTY - sumT*sumY) / d;
}
Side note: could factor out the division for improved accuracy and speed. Suggest performing the last calculation as double.
float velF(void) {
float sumTY = 0, sumT = 0, sumY = 0, sumT2 = 0;
for (size_t i = 0; i < n; i++) {
float tf = (float) times[i];
sumTY += tf * alts[i];
sumT += tf;
sumY += alts[i];
sumT2 += tf * tf;
}
double nd = n;
double sumTd = sumT;
double d = nd*sumT2 - sumTd*sumTd;
if (d == 0) return 0;
return (nd*sumTY - sumTd*sumY)*1000 / d;
}

OpenCL: Move data between __global memory

I am trying to move some data between 2 global memory before running a kernel on it.
Here buffer contains data that needs to be written in array, but sadly not contiguously:
void exchange_2_halo_write(
__global float2 *array,
__global float *buffer,
const unsigned int im,
const unsigned int jm,
const unsigned int km
) {
const unsigned int v_dim = 2;
unsigned int i, j, k, v, i_buf = 0;
// Which vector component, ie along v_dim
for (v = 0; v < v_dim; v++) {
// top halo
for (k = 0; k < km; k++) {
for (i = 0; i < im; i++) {
((__global float*)&array[i + k*im*jm])[v] = buffer[i_buf];
i_buf++;
}
}
// bottom halo
for (k = 0; k < km; k++) {
for (i = 0; i < im; i++) {
((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[i_buf];
i_buf++;
}
}
// left halo
for (k = 0; k < km; k++) {
for (j = 1; j < jm-1; j++) {
((__global float*)&array[j*im + k*im*jm])[v] = buffer[i_buf];
i_buf++;
}
}
// right halo
for (k = 0; k < km; k++) {
for (j = 1; j < jm-1; j++) {
((__global float*)&array[j*im + k*im*jm + (im-1)])[v] = buffer[i_buf];
i_buf++;
}
}
}
}
This works really fine in C (with a few minor changes), and for the data size I need (im = 150, jm = 150, km = 90, buf_sz = 107280), it runs in about 0.02s.
I had expected the same code to be slower on the GPU, but not that slower, it actually takes about 90 minutes to do the same thing (that's about 250000x slower!).
Simply doing a straight allocation takes about 15 minutes, which clearly shows it is not the way to go.
for (i = 0; i < buf_sz; i++) {
array[i] = buffer[i];
}
In that case, I have seen that I can do something like this:
int xid = get_global_id(0);
array[xid] = buffer[xid];
which seems to work fine/quickly.
However, I do not know how to adapt this to use the conditions I have in the first code.
The top and bottom_halo parts have im contiguous elements to transfer to array, which I think means it could be ok to transfer easily. Sadly the left and right_halos don't.
Also with better code, can I expect to get somewhat close to the CPU time? If it is impossible to do it in, say, under 1s, it's probably going to be a waste.
Thank you.
Before the answer, 1 remark. When you do a for loop inside a kernel, like this:
for (i = 0; i < buf_sz; i++) {
array[i] = buffer[i];
}
And you launch ie: 512 work items, you are doing the copy 512 times!!, not doing it in parallel with 512 threads. So obviously, it is going to be even slower! more than 512x slower!!!
That said, you can split it in this way:
2D Global size: km x max(im,jm)
void exchange_2_halo_write(
__global float2 *array,
__global float *buffer,
const unsigned int im,
const unsigned int jm
) {
const unsigned int v_dim = 2;
const unsigned int k = get_global_id(0);
const unsigned int i = get_global_id(1);
const unsigned int km = get_global_size(0);
// Which vector component, ie along v_dim
for (unsigned int v = 0; v < v_dim; v++) {
if(i < im){
// top halo
((__global float*)&array[i + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*i];
// bottom halo
((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im+km*i];
}
if(i < jm-1 && i > 0){
// left halo
((__global float*)&array[i*im + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(i-1)];
// right halo
((__global float*)&array[i*im + k*im*jm + (im-1))[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(jm-2)+km*(i-1)];
}
}
}
Other options are possible, like using local memory, but that is a tedious work....

Unhandled exception error with two dimensional array

This dynamic programming algorithm is returning unhandled exception error probably due to the two dimensional arrays that I am using for various (and very large) number of inputs. I can't seem to figure out the issue here. The complete program as follows:
// A Dynamic Programming based solution for 0-1 Knapsack problem
#include<stdio.h>
#include<stdlib.h>
#define MAX 10000
int size;
int Weight;
int p[MAX];
int w[MAX];
// A utility function that returns maximum of two integers
int maximum(int a, int b) { return (a > b) ? a : b; }
// Returns the maximum value that can be put in a knapsack of capacity W
int knapSack(int W, int wt[], int val[], int n)
{
int i, w;
int retVal;
int **K;
K = (int**)calloc(n+1, sizeof(int*));
for (i = 0; i < n + 1; ++i)
{
K[i] = (int*)calloc(W + 1, sizeof(int));
}
// Build table K[][] in bottom up manner
for (i = 0; i <= n; i++)
{
for (w = 0; w <= W; w++)
{
if (i == 0 || w == 0)
K[i][w] = 0;
else if (wt[i - 1] <= w)
K[i][w] = maximum(val[i - 1] + K[i - 1][w - wt[i - 1]], K[i - 1][w]);
else
K[i][w] = K[i - 1][w];
}
}
retVal = K[n][W];
for (i = 0; i < size + 1; i++)
free(K[i]);
free(K);
return retVal;
}
int random_in_range(unsigned int min, unsigned int max)
{
int base_random = rand();
if (RAND_MAX == base_random) return random_in_range(min, max);
int range = max - min,
remainder = RAND_MAX % range,
bucket = RAND_MAX / range;
if (base_random < RAND_MAX - remainder) {
return min + base_random / bucket;
}
else {
return random_in_range(min, max);
}
}
int main()
{
srand(time(NULL));
int val = 0;
int i, j;
//each input set is contained in an array
int batch[] = { 10, 20, 30, 40, 50, 5000, 10000 };
int sizeOfBatch = sizeof(batch) / sizeof(batch[0]);
//algorithms are called per size of the input array
for (i = 0; i < sizeOfBatch; i++){
printf("\n");
//dynamic array allocation (variable length to avoid stack overflow
//calloc is used to avoid garbage values
int *p = (int*)calloc(batch[i], sizeof(int));
int *w = (int*)calloc(batch[i], sizeof(int));
for (j = 0; j < batch[i]; j++){
p[j] = random_in_range(1, 500);
w[j] = random_in_range(1, 100);
}
size = batch[i];
Weight = batch[i] * 25;
printf("| %d ", batch[i]);
printf(" %d", knapSack(Weight, w, p, size));
free(p);
free(w);
}
_getch();
return 0;
}
Change this:
for (i = 0; i < size + 1; i++)
free(K[i]);
free(K);
return K[size][Weight];
To this:
int retVal;
...
retVal = K[size][Weight];
for (i = 0; i < size + 1; i++)
free(K[i]);
free(K);
return retVal;

Running out of memory for 2D arrays in C++/CLI?

I'm dealing with 10 of arrays, some of which are doubles 1024x1392.
I've tried to dynamically allocate them on the heap with:
double **x_array;
x_array = new double*[NUM_ROWS];
for(int i=0; i < NUM_ROWS; i++) {
x_array[i] = new double[NUM_COLS];
}
for(int ix=0; ix < NUM_COLS; ix++) {
for(int iy=0; iy < NUM_ROWS; iy++) {
x_array[ix][iy]=(x1y1*(ix+1) + x2y1*(iy+1) + x3y1);
//y_array[ix][iy]=(x1y2*(ix+1) + x2y2*(iy+1) + x3y2);
}
}
}
but I still get errors saying
unhandled exception: System.Runtime.InteropServices.SEGException: External Component has thrown an exception. at line 106
and 106 is where I begin initializing the array in the code above:
x_array = new double*[NUM_ROWS];
Am I really running out of space, or am I doing something wrong?
You have your array indices transposed:
for(int ix=0; ix < NUM_COLS; ix++) {
for(int iy=0; iy < NUM_ROWS; iy++) {
x_array[ix][iy]=(x1y1*(ix+1) + x2y1*(iy+1) + x3y1);
should be:
for(int iy=0; iy < NUM_ROWS; iy++) {
for(int ix=0; ix < NUM_COLS; ix++) {
x_array[iy][ix]=(x1y1*(ix+1) + x2y1*(iy+1) + x3y1);
or if you really have to keep the cache-hostile loop ordering for some reason:
for(int ix=0; ix < NUM_COLS; ix++) {
for(int iy=0; iy < NUM_ROWS; iy++) {
x_array[iy][ix]=(x1y1*(ix+1) + x2y1*(iy+1) + x3y1);

Resources