Quicksort, implementation on MPI - mpi

We have a quicksort algorithm implemented with MPI quicksort. There was a question about the code, why are we doing this
quicksort(chunk, 0, own_chunk_size);
for(int step = 1; step < number_of_process; step = 2 * step)
{
if (rank_of_process % (2 * step) != 0) {
MPI_Send(chunk, own_chunk_size, MPI_INT,
rank_of_process - step, 0,
MPI_COMM_WORLD);
break;
}
if (rank_of_process + step < number_of_process) {
int received_chunk_size
= (number_of_elements
>= chunk_size
* (rank_of_process + 2 * step))
? (chunk_size * step)
: (number_of_elements
- chunk_size
* (rank_of_process + step));
int* chunk_received;
chunk_received = (int*)malloc(
received_chunk_size * sizeof(int));
MPI_Recv(chunk_received, received_chunk_size,
MPI_INT, rank_of_process + step, 0,
MPI_COMM_WORLD, &status);
data = merge(chunk, own_chunk_size,
chunk_received,
received_chunk_size);
free(chunk);
free(chunk_received);
chunk = data;
own_chunk_size
= own_chunk_size + received_chunk_size;
}
}
Questions arose in the lines:
if (rank_of_process % (2 * step) != 0)
and
if (rank_of_process + step < number_of_process)
Why do we need these conditions here, what do they do? And as I understand it, the sorting is implemented using the block method (where the process sorts its data block)

Your code does a local quicksort on each process. That's the first line, what you call the "block method". After that you have a tree-wise implementation of a merge: first you merge with step 1, that is, neighboring processes, then with step 2, step 4, et cetera. In essence this is mergesort.
Time complexity is log2(N) steps of length n, 2n, 4n, .... N/2 where N is the total number of elements and n the local, so this is essentially N. In the limit case of P=N (and therefore n=1) this makes the parallel running time N, so your speedup over the sequential complexity of Nlog(N) is log(N)=log(P) which is very far from optimal.
A correct parallel quicksort (which is very hard to code) would use a prefix operation for the initial red/white/blue split, so that step has complexity log(N), (instead of the N for your local merge), so parallel quicksort has a complexity log(N)-squared, giving a speedup P/log(P) which is a lot better than your merge sort like code.
The problem with your code is that you are not properly thinking in distributed memory terms: you start out with everything on process zero and you bring it back to process zero. Thus you are mixing a true distributed implementation and a manager/worker model. This is not optimally parallel, and so you get complexity that is closer to sequential than parallel.

Related

How do I calculate the time complexity of this recursive function which halves the input value or halves it and then adds the input value?

I am having difficulties determining the time complexity of the code below:
int func(int n) { // n > 0
if (n < 2) {
return 1;
} else if (n % 2 == 0) {
return func(n / 3);
} else {
return func(n / 3) + n;
}
}
I have attempted to approach this question using Master Theorem, so I have tried to break it down into:
n = size of input
a = number of sub-problems in the recursion = 3
n/b = size of each sub-problem
f(n) = cost of the work done outside the recursive call
However, I am struggling to understand how to determine the size of each sub-problem and f(n) - the cost of the work done outside the recursive call. At the moment I am just assuming that we take the greater time complexity of the if/else statement so the time complexity would be O(logn).
Also, does the '+ n' in the else statement affect the time complexity of this function?
Any help to understand this would be greatly appreciated!

Time complexity of reversing a stack with recursion

I have tried to find the time complexity of the following code.
int insert_at_bottom(int x, stack st) {
if(st is empty) {
st.push(x)
}
else {
int a = st.top()
st.pop()
insert_at_bottom(x)
st.push(a)
}
}
void reverse(stack st) {
if(st is not empty) {
int st = st.top()
st.pop()
reverse(st)
insert_at_bottom(x, st)
}
}
// driver function
int[] reverseStack(int[] st) {
reverse(st)
return st
}
For each element on top of the stack we are popping the whole stack out placing that top element at the bottom which takes O(n) operations. And these O(n) operations are performed for every element in the stack, so time complexity should be O(n^2).
However, I want to find the time complexity mathematically. I tried to find the recurrence relation, and I got T(n)=2T(n-1)+1. This is probably wrong, as the time complexity of the second function call should not be taken as T(n-1).
Your argumentation is in general correct. If insertion at the bottom takes O(n) time, then the reverse function takes O(n2) time, because it performs a linear-time operation on the stack for each element.
The recurrence relation for reverse() would look a bit different though. In each step, you do three things:
Call itself on n-1
An O(n) time operation (insert_at_bottom())
Some constant-time stuff
Thus, you can just sum these together. So I would argue it can be written as:
T(n) = T(n-1) + n + c, where c is a constant.
You will find that, due to recursion, T(n-1) = T(n-2) + n-1 + c. So, if you keep expanding the whole series in this fashion under n > 0, you obtain:
T(n) = 1 + ... + n-1 + n + nc
Since 1 + 2 + ... + n) = n(n + 1)/2 (see this), we obtain that
T(n) = n(n+1)/2 + nc = n2/2 + n/2 + nc = O(n2). □
The O(n) time of insert_at_bottom(), you can show in a similar way.

Gaussian Elimination Parallelism

I have sucessfully implemented a single threaded program in CUDA for Gaussian elimination and would like to achieve parallelism. Up to this point the parallel code looks like:
__global__ void ParallelGaussian(double* A)
{
int index = threadIdx.x;
int stride = blockDim.x;
if (index < ROWS) //Skip additional threads
{
for (unsigned int r = index; r < ROWS; r += stride)
{
//Forward elimination to reduce to row echelon form
for (unsigned int k = r + 1; k < ROWS; ++k)
{
double c = -A[(ROWS + 1) * k + r] / A[(ROWS + 1) * r + r];
for (unsigned int j = r; j < ROWS + 1; ++j)
{
if (r == j)
A[(ROWS + 1) * k + j] = 0.0;
else
A[(ROWS + 1) * k + j] += c * A[(ROWS + 1) * r + j];
}
}
}
}
}
As we can see the code on the GPU will transform the 1D-array (matrix) to a lower triangular matrix and then on the CPU I will continue with back substitution to get the final result. There is no pivoting done in this approach as it is not entirely needed but indeed improves the numerical stability of the algorithm.
Launching the kernel with a single thread and a block works and transforms the matrix into row echelon form:
ParallelGaussian << < 1, 1 >> >(dev_a);
However, if I would like to increase the number of threads, like
ParallelGaussian << < 1, 32 >> >(dev_a);
it will fail to produce the lower triangular matrix. Now adding __syncthreads() calls into the code in order to synchronize the threads in a block doesn't improve the situation what so ever and I can't figure out why.
Consider your inner loop. Every thread accesses A, and since k and j run from r to the end of the matrix, there is the potential for multiple threads to modify the same A[(ROWS + 1) * k + j] value.
You also potentially have some threads accessing A[(ROWS + 1) * r + j] while other threads are updating that value.
One possible solution is to have each thread accumulate into individual result arrays, then combine those at the end. This is memory intensive.
Another would be to restructure this so that only one thread will write to a particular value, and storing those values in a new matrix (so that you don't change any value that might be needed by a different thread).

Dynamic programming problems using iteration

I have spent a lot of time to learn about implementing/visualizing dynamic programming problems using iteration but I find it very hard to understand, I can implement the same using recursion with memoization but it is slow when compared to iteration.
Can someone explain the same by a example of a hard problem or by using some basic concepts. Like the matrix chain multiplication, longest palindromic sub sequence and others. I can understand the recursion process and then memoize the overlapping sub problems for efficiency but I can't understand how to do the same using iteration.
Thanks!
Dynamic programming is all about solving the sub-problems in order to solve the bigger one. The difference between the recursive approach and the iterative approach is that the former is top-down, and the latter is bottom-up. In other words, using recursion, you start from the big problem you are trying to solve and chop it down to a bit smaller sub-problems, on which you repeat the process until you reach the sub-problem so small you can solve. This has an advantage that you only have to solve the sub-problems that are absolutely needed and using memoization to remember the results as you go. The bottom-up approach first solves all the sub-problems, using tabulation to remember the results. If we are not doing extra work of solving the sub-problems that are not needed, this is a better approach.
For a simpler example, let's look at the Fibonacci sequence. Say we'd like to compute F(101). When doing it recursively, we will start with our big problem - F(101). For that, we notice that we need to compute F(99) and F(100). Then, for F(99) we need F(97) and F(98). We continue until we reach the smallest solvable sub-problem, which is F(1), and memoize the results. When doing it iteratively, we start from the smallest sub-problem, F(1) and continue all the way up, keeping the results in a table (so essentially it's just a simple for loop from 1 to 101 in this case).
Let's take a look at the matrix chain multiplication problem, which you requested. We'll start with a naive recursive implementation, then recursive DP, and finally iterative DP. It's going to be implemented in a C/C++ soup, but you should be able to follow along even if you are not very familiar with them.
/* Solve the problem recursively (naive)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_rn(int p[], int n, int i, int j) {
// A matrix multiplied by itself needs no operations
if (i == j) return 0;
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
int min = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_rn(p, n, i, k) + solve_rn(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < min) min = tmp;
}
// Return solution for this sub-problem
return min;
}
To compute the result, we starts with the big problem:
solve_rn(p, n, 1, n - 1)
The key of DP is to remember all the solutions to the sub-problems instead of forgetting them, so we don't need to recompute them. It's trivial to make a few adjustments to the above code in order to achieve that:
/* Solve the problem recursively (DP)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_r(int p[], int n, int i, int j) {
/* We need to remember the results for state i..j.
This can be done in a matrix, which we call dp,
such that dp[i][j] is the best solution for the
state i..j. We initialize everything to 0 first.
static keyword here is just a C/C++ thing for keeping
the matrix between function calls, you can also either
make it global or pass it as a parameter each time.
MAXN is here too because the array size when doing it like
this has to be a constant in C/C++. I set it to 100 here.
But you can do it some other way if you don't like it. */
static int dp[MAXN][MAXN] = {{0}};
/* A matrix multiplied by itself has 0 operations, so we
can just return 0. Also, if we already computed the result
for this state, just return that. */
if (i == j) return 0;
else if (dp[i][j] != 0) return dp[i][j];
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
dp[i][j] = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_r(p, n, i, k) + solve_r(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < dp[i][j]) dp[i][j] = tmp;
}
// Return solution for this sub-problem
return dp[i][j];;
}
We start with the big problem as well:
solve_r(p, n, 1, n - 1)
Iterative solution is only to, well, iterate all the states, instead of starting from the top:
/* Solve the problem iteratively
p - matrix dimensions
n - size of p
We don't need to pass state, because we iterate the states. */
int solve_i(int p[], int n) {
// But we do need our table, just like before
static int dp[MAXN][MAXN];
// Multiplying a matrix by itself needs no operations
for (int i = 1; i < n; ++i)
dp[i][i] = 0;
// L represents the length of the chain. We go from smallest, to
// biggest. Made L capital to distinguish letter l from number 1
for (int L = 2; L < n; ++L) {
// This double loop goes through all the states in the current
// chain length.
for (int i = 1; i <= n - L + 1; ++i) {
int j = i + L - 1;
dp[i][j] = std::numeric_limits<int>::max();
for (int k = i; k <= j - 1; ++k) {
int tmp = dp[i][k] + dp[k+1][j] + p[i-1] * p[k] * p[j];
if (tmp < dp[i][j])
dp[i][j] = tmp;
}
}
}
// Return the result of the biggest problem
return dp[1][n-1];
}
To compute the result, just call it:
solve_i(p, n)
Explanation of the loop counters in the last example:
Let's say we need to optimize the multiplication of 4 matrices: A B C D. We are doing an iterative approach, so we will first compute the chains with the length of two: (A B) C D, A (B C) D, and A B (C D). And then chains of three: (A B C) D, and A (B C D). That is what L, i and j are for.
L represents the chain length, it goes from 2 to n - 1 (n is 4 in this case, so that is 3).
i and j represent the starting and ending position of the chain. In case L = 2, i goes from 1 to 3, and j goes from 2 to 4:
(A B) C D A (B C) D A B (C D)
^ ^ ^ ^ ^ ^
i j i j i j
In case L = 3, i goes from 1 to 2, and j goes from 3 to 4:
(A B C) D A (B C D)
^ ^ ^ ^
i j i j
So generally, i goes from 1 to n - L + 1, and j is i + L - 1.
Now, let's continue with the algorithm assuming that we are at the step where we have (A B C) D. We now need to take into account the sub-problems (which are already calculated): ((A B) C) D and (A (B C)) D. That is what k is for. It goes through all the positions between i and j and computes the sub problems.
I hope I helped.
The problem with recursion is the high number of stack frames that need to be pushed/popped. This can quickly become the bottle-neck.
The Fibonacci Series can be calculated with iterative DP or recursion with memoization. If we calculate F(100) in DP all we need is an array of length 100 e.g. int[100] and that's the guts of our used memory. We calculate all entries of the array pre-filling f[0] and f[1] as they are defined to be 1. and each value just depends on the previous two.
If we use a recursive solution we start at fib(100) and work down. Every method call from 100 down to 0 is pushed onto the stack, AND checked if it's memoized. These operations add up and iteration doesn't suffer from either of these. In iteration (bottom-up) we already know all of the previous answers are valid. The bigger impact is probably the stack frames; and given a larger input you may get a StackOverflowException for what was otherwise trivial with an iterative DP approach.

Race condition in opencl kernel threads

If multiple threads are simultaneously writing a single memory location.,there will be a race condition,right??
In my case same is happening..
Consider a module from 'reduce.cl'
int i = get_global_id(0);
int n,j;
n = keyMobj[i]; // this n is the key..It can be either 0 or 1.
for(j=0; j<2; j++)
sumMobj[n*2+j] += dataMobj[i].dattr[j]; //summing operation.
Here, The memory locations
sumMobj===> [...0..., ....1...] is accessed 4 threads simultaneously &
sumMobj===> [....3..., ....4...] is accessed 6 threads simultaneously..
Is there any way to still make it parallely,like using locking or semaphore? As this summing is a very big part in my algorithm...
I can give you some hint as I was also facing similar problem.
I can think of three different methods for achieving similar goal:
Consider a simple kernel, assuming you launched 4 (0-3) threads
_kernel void addition (int *p)
{
int i = get_local_id(0);
p[4]+= p[i];
}
You want to add values p[0], p[1], p[2], p[3], p[4], and store the final sum in p[4]. right? i.e:
p[4]= p[0] + p[1] + p[2] + p[3] + p[4]
Method -1 (no parallelism)
Assign this job to only 1 thread (no parallelism):
int i = get_local_id(0);
if (i==0)
{
p[4]+= p[i];
}
Method-2 (with parallelism)
Express your problem as follows:
p[4]= p[0] + p[1] + p[2] + p[3] + p[4] + 0
This is a reduction problem
So launch 3 threads: i=0 to i=2. In first iteration
i=0 finds p[0] + p[1]
i=1 finds p[2] + p[3]
i=2 finds p[4] + 0
Now you have three numbers, you apply the same logic as above and add these numbers (with suitable padding of 0 to make it in power of two)
Method -3 Atomic operations
If you still need to implement this atomically, you can use atomic_add():
int fsfunc atomic_add ( volatile __global int *p ,int val)
Description
Read the 32-bit value (referred to as old) stored at location pointed
by p. Compute (old + val) and store result at location pointed by p.
The function returns old.
This is assuming the data is int type. Otherwise you can see the link as suggested above.

Resources