Race condition in opencl kernel threads

Race condition in opencl kernel threads - opencl

If multiple threads are simultaneously writing a single memory location.,there will be a race condition,right??
In my case same is happening..
Consider a module from 'reduce.cl'
int i = get_global_id(0);
int n,j;
n = keyMobj[i]; // this n is the key..It can be either 0 or 1.
for(j=0; j<2; j++)
sumMobj[n*2+j] += dataMobj[i].dattr[j]; //summing operation.
Here, The memory locations
sumMobj===> [...0..., ....1...] is accessed 4 threads simultaneously &
sumMobj===> [....3..., ....4...] is accessed 6 threads simultaneously..
Is there any way to still make it parallely,like using locking or semaphore? As this summing is a very big part in my algorithm...

I can give you some hint as I was also facing similar problem.
I can think of three different methods for achieving similar goal:
Consider a simple kernel, assuming you launched 4 (0-3) threads
_kernel void addition (int *p)
{
int i = get_local_id(0);
p[4]+= p[i];
}
You want to add values p[0], p[1], p[2], p[3], p[4], and store the final sum in p[4]. right? i.e:
p[4]= p[0] + p[1] + p[2] + p[3] + p[4]
Method -1 (no parallelism)
Assign this job to only 1 thread (no parallelism):
int i = get_local_id(0);
if (i==0)
{
p[4]+= p[i];
}
Method-2 (with parallelism)
Express your problem as follows:
p[4]= p[0] + p[1] + p[2] + p[3] + p[4] + 0
This is a reduction problem
So launch 3 threads: i=0 to i=2. In first iteration
i=0 finds p[0] + p[1]
i=1 finds p[2] + p[3]
i=2 finds p[4] + 0
Now you have three numbers, you apply the same logic as above and add these numbers (with suitable padding of 0 to make it in power of two)
Method -3 Atomic operations
If you still need to implement this atomically, you can use atomic_add():
int fsfunc atomic_add ( volatile __global int *p ,int val)
Description
Read the 32-bit value (referred to as old) stored at location pointed
by p. Compute (old + val) and store result at location pointed by p.
The function returns old.
This is assuming the data is int type. Otherwise you can see the link as suggested above.

Related

Gaussian Elimination Parallelism

I have sucessfully implemented a single threaded program in CUDA for Gaussian elimination and would like to achieve parallelism. Up to this point the parallel code looks like:
__global__ void ParallelGaussian(double* A)
{
int index = threadIdx.x;
int stride = blockDim.x;
if (index < ROWS) //Skip additional threads
{
for (unsigned int r = index; r < ROWS; r += stride)
{
//Forward elimination to reduce to row echelon form
for (unsigned int k = r + 1; k < ROWS; ++k)
{
double c = -A[(ROWS + 1) * k + r] / A[(ROWS + 1) * r + r];
for (unsigned int j = r; j < ROWS + 1; ++j)
{
if (r == j)
A[(ROWS + 1) * k + j] = 0.0;
else
A[(ROWS + 1) * k + j] += c * A[(ROWS + 1) * r + j];
}
}
}
}
}
As we can see the code on the GPU will transform the 1D-array (matrix) to a lower triangular matrix and then on the CPU I will continue with back substitution to get the final result. There is no pivoting done in this approach as it is not entirely needed but indeed improves the numerical stability of the algorithm.
Launching the kernel with a single thread and a block works and transforms the matrix into row echelon form:
ParallelGaussian << < 1, 1 >> >(dev_a);
However, if I would like to increase the number of threads, like
ParallelGaussian << < 1, 32 >> >(dev_a);
it will fail to produce the lower triangular matrix. Now adding __syncthreads() calls into the code in order to synchronize the threads in a block doesn't improve the situation what so ever and I can't figure out why.

Consider your inner loop. Every thread accesses A, and since k and j run from r to the end of the matrix, there is the potential for multiple threads to modify the same A[(ROWS + 1) * k + j] value.
You also potentially have some threads accessing A[(ROWS + 1) * r + j] while other threads are updating that value.
One possible solution is to have each thread accumulate into individual result arrays, then combine those at the end. This is memory intensive.
Another would be to restructure this so that only one thread will write to a particular value, and storing those values in a new matrix (so that you don't change any value that might be needed by a different thread).

grid unique paths - recursion

I am trying to solve the grid unique paths problem. The problem involves finding the number of possible unique paths in a 2D grid starting from top left (0,0) to the bottom right (say A,B). One can only move right or down. Here is my initial attempt:
#include <stdio.h>
int count=0;
void uniquePathsRecur(int r, int c, int A, int B){
if(r==A-1 & c==B-1){
count++;
return;
}
if(r<A-1){
return uniquePathsRecur(r++,c,A,B);
}
if(c<B-1){
return uniquePathsRecur(r,c++,A,B);
}
}
int uniquePaths(int A, int B) {
if(B==1 | A==1){
return 1;
}
uniquePathsRecur(0,0,A,B);
return count;
}
int main(){
printf("%d", uniquePaths(5,3));
return 0;
}
I end up getting segmentation fault: 11 with my code. I tried to debug in gdb and i get the following:
lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) r
Process 12171 launched: '<path to process>/a.out' (x86_64)
Process 12171 stopped
* thread #1: tid = 0x531b2e, 0x0000000100000e38 a.out`uniquePathsRecur + 8, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7fff5f3ffffc)
frame #0: 0x0000000100000e38 a.out`uniquePathsRecur + 8
a.out`uniquePathsRecur:
-> 0x100000e38 <+8>: movl %edi, -0x4(%rbp)
0x100000e3b <+11>: movl %esi, -0x8(%rbp)
0x100000e3e <+14>: movl %edx, -0xc(%rbp)
0x100000e41 <+17>: movl %ecx, -0x10(%rbp)
(lldb)
What is wrong with the above code?

I don't know the problem of your code. But you can solve the problem without using recursion.
Method 1: We can solve this problem with simple math skill. The
requirement is that you can only move either down or right at any
point. So it requires exact (m + n) steps from S to D and n out of (m
+ n) steps go down. Thus, the answer is C(m + n, n).
Method 2: Let us solve the issue in computer science way. This is a typical dynamic
programming problem. Let us assume the robot is standing at (i, j).
How did the robot arrive at (i, j)? The robot could move down from (i
- 1, j) or move right from (i, j - 1). So the path to (i, j) is equal to the sum of path to (i - 1, j) and path to (i, j - 1). We can use
another array to store the path to all node and use the equation
below: paths(i, j) = 1 // i == 0 or j == 0 paths(i, j) = paths(i - 1,
j) + paths(i, j - 1) // i != 0 and j != 0 However, given more
thoughts, you will find out that you don't actually need a 2D array to
record all the values since when the robot is at row i, you only need
the paths at (i - 1). So the equation is: paths(j) = 1 //j == 0 for
any i paths(j) = paths(j - 1) + paths(j) // j != 0 for any i
For more information, please see here: https://algorithm.pingzhang.io/DynamicProgramming/unique_path.html

Dynamic programming problems using iteration

I have spent a lot of time to learn about implementing/visualizing dynamic programming problems using iteration but I find it very hard to understand, I can implement the same using recursion with memoization but it is slow when compared to iteration.
Can someone explain the same by a example of a hard problem or by using some basic concepts. Like the matrix chain multiplication, longest palindromic sub sequence and others. I can understand the recursion process and then memoize the overlapping sub problems for efficiency but I can't understand how to do the same using iteration.
Thanks!

Dynamic programming is all about solving the sub-problems in order to solve the bigger one. The difference between the recursive approach and the iterative approach is that the former is top-down, and the latter is bottom-up. In other words, using recursion, you start from the big problem you are trying to solve and chop it down to a bit smaller sub-problems, on which you repeat the process until you reach the sub-problem so small you can solve. This has an advantage that you only have to solve the sub-problems that are absolutely needed and using memoization to remember the results as you go. The bottom-up approach first solves all the sub-problems, using tabulation to remember the results. If we are not doing extra work of solving the sub-problems that are not needed, this is a better approach.
For a simpler example, let's look at the Fibonacci sequence. Say we'd like to compute F(101). When doing it recursively, we will start with our big problem - F(101). For that, we notice that we need to compute F(99) and F(100). Then, for F(99) we need F(97) and F(98). We continue until we reach the smallest solvable sub-problem, which is F(1), and memoize the results. When doing it iteratively, we start from the smallest sub-problem, F(1) and continue all the way up, keeping the results in a table (so essentially it's just a simple for loop from 1 to 101 in this case).
Let's take a look at the matrix chain multiplication problem, which you requested. We'll start with a naive recursive implementation, then recursive DP, and finally iterative DP. It's going to be implemented in a C/C++ soup, but you should be able to follow along even if you are not very familiar with them.
/* Solve the problem recursively (naive)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_rn(int p[], int n, int i, int j) {
// A matrix multiplied by itself needs no operations
if (i == j) return 0;
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
int min = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_rn(p, n, i, k) + solve_rn(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < min) min = tmp;
}
// Return solution for this sub-problem
return min;
}
To compute the result, we starts with the big problem:
solve_rn(p, n, 1, n - 1)
The key of DP is to remember all the solutions to the sub-problems instead of forgetting them, so we don't need to recompute them. It's trivial to make a few adjustments to the above code in order to achieve that:
/* Solve the problem recursively (DP)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_r(int p[], int n, int i, int j) {
/* We need to remember the results for state i..j.
This can be done in a matrix, which we call dp,
such that dp[i][j] is the best solution for the
state i..j. We initialize everything to 0 first.
static keyword here is just a C/C++ thing for keeping
the matrix between function calls, you can also either
make it global or pass it as a parameter each time.
MAXN is here too because the array size when doing it like
this has to be a constant in C/C++. I set it to 100 here.
But you can do it some other way if you don't like it. */
static int dp[MAXN][MAXN] = {{0}};
/* A matrix multiplied by itself has 0 operations, so we
can just return 0. Also, if we already computed the result
for this state, just return that. */
if (i == j) return 0;
else if (dp[i][j] != 0) return dp[i][j];
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
dp[i][j] = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_r(p, n, i, k) + solve_r(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < dp[i][j]) dp[i][j] = tmp;
}
// Return solution for this sub-problem
return dp[i][j];;
}
We start with the big problem as well:
solve_r(p, n, 1, n - 1)
Iterative solution is only to, well, iterate all the states, instead of starting from the top:
/* Solve the problem iteratively
p - matrix dimensions
n - size of p
We don't need to pass state, because we iterate the states. */
int solve_i(int p[], int n) {
// But we do need our table, just like before
static int dp[MAXN][MAXN];
// Multiplying a matrix by itself needs no operations
for (int i = 1; i < n; ++i)
dp[i][i] = 0;
// L represents the length of the chain. We go from smallest, to
// biggest. Made L capital to distinguish letter l from number 1
for (int L = 2; L < n; ++L) {
// This double loop goes through all the states in the current
// chain length.
for (int i = 1; i <= n - L + 1; ++i) {
int j = i + L - 1;
dp[i][j] = std::numeric_limits<int>::max();
for (int k = i; k <= j - 1; ++k) {
int tmp = dp[i][k] + dp[k+1][j] + p[i-1] * p[k] * p[j];
if (tmp < dp[i][j])
dp[i][j] = tmp;
}
}
}
// Return the result of the biggest problem
return dp[1][n-1];
}
To compute the result, just call it:
solve_i(p, n)
Explanation of the loop counters in the last example:
Let's say we need to optimize the multiplication of 4 matrices: A B C D. We are doing an iterative approach, so we will first compute the chains with the length of two: (A B) C D, A (B C) D, and A B (C D). And then chains of three: (A B C) D, and A (B C D). That is what L, i and j are for.
L represents the chain length, it goes from 2 to n - 1 (n is 4 in this case, so that is 3).
i and j represent the starting and ending position of the chain. In case L = 2, i goes from 1 to 3, and j goes from 2 to 4:
(A B) C D A (B C) D A B (C D)
^ ^ ^ ^ ^ ^
i j i j i j
In case L = 3, i goes from 1 to 2, and j goes from 3 to 4:
(A B C) D A (B C D)
^ ^ ^ ^
i j i j
So generally, i goes from 1 to n - L + 1, and j is i + L - 1.
Now, let's continue with the algorithm assuming that we are at the step where we have (A B C) D. We now need to take into account the sub-problems (which are already calculated): ((A B) C) D and (A (B C)) D. That is what k is for. It goes through all the positions between i and j and computes the sub problems.
I hope I helped.

The problem with recursion is the high number of stack frames that need to be pushed/popped. This can quickly become the bottle-neck.
The Fibonacci Series can be calculated with iterative DP or recursion with memoization. If we calculate F(100) in DP all we need is an array of length 100 e.g. int[100] and that's the guts of our used memory. We calculate all entries of the array pre-filling f[0] and f[1] as they are defined to be 1. and each value just depends on the previous two.
If we use a recursive solution we start at fib(100) and work down. Every method call from 100 down to 0 is pushed onto the stack, AND checked if it's memoized. These operations add up and iteration doesn't suffer from either of these. In iteration (bottom-up) we already know all of the previous answers are valid. The bigger impact is probably the stack frames; and given a larger input you may get a StackOverflowException for what was otherwise trivial with an iterative DP approach.

create a random sequence, skip to any part of the sequence

In Linux. There is an srand() function, where you supply a seed and it will guarantee the same sequence of pseudorandom numbers in subsequent calls to the random() function afterwards.
Lets say, I want to store this pseudo random sequence by remembering this seed value.
Furthermore, let's say I want the 100 thousandth number in this pseudo random sequence later.
One way, would be to supply the seed number using srand(), and then calling random() 100 thousand times, and remembering this number.
Is there a better way of skipping all 99,999 other numbers in the pseudo random list and directly getting the 100 thousandth number in the list.
thanks,
m

I'm not sure there's a defined standard for implementing rand on any platform; however, picking this one from the GNU Scientific Library:
— Generator: gsl_rng_rand
This is the BSD rand generator. Its sequence is
xn+1 = (a xn + c) mod m
with a = 1103515245, c = 12345 and m = 231. The seed specifies the initial value, x1. The period of this generator is 231, and it uses 1 word of storage per generator.
So to "know" xn requires you to know xn-1. Unless there's some obvious pattern I'm missing, you can't jump to a value without computing all the values before it. (But that's not necessarily the case for every rand implementation.)
If we start with x1...
x2 = (a * x1 + c) % m
x3 = (a * ((a * x1 + c) % m) + c) % m
x4 = (a * ((a * ((a * x1 + c) % m) + c) % m) + c) % m
x5 = (a * (a * ((a * ((a * x1 + c) % m) + c) % m) + c) % m) + c) % m
It gets out of hand pretty quickly. Is that function easily reducible? I don't think it is.
(There's a statistics phrase for a series where xn depends on xn-1 -- can anyone remind me what that word is?)

If they're available on your system, you can use rand_r instead of rand & srand, or use initstate and setstate with random. rand_r takes an unsigned * as an argument, where it stores its state. After calling rand_r numerous times, save the value of this unsigned integer and use it as the starting value the next time.
For random(), use initstate rather than srandom. Save the contents of the state buffer for any state that you want to restore. To restore a state, fill a buffer with and call setstate. If a buffer is already the current state buffer, you can skip the call to setstate.

This is developed from #Mark's answer using the BSD rand() function.
rand1() computes the nth random number, starting at seed, by stepping through n times.
rand2() computes the same using a shortcut. It can step up to 2^24-1 steps in one go. Internally it requires only 24 steps.
If the BSD random number generator is good enough for you then this will suffice:
#include <stdio.h>
const unsigned int m = (1<<31)-1;
unsigned int a[24] = {
1103515245, 1117952617, 1845919505, 1339940641, 1601471041,
187569281 , 1979738369, 387043841 , 1046979585, 1574914049,
1073647617, 285024257 , 1710899201, 1542750209, 2011758593,
1876033537, 1604583425, 1061683201, 2123366401, 2099249153,
2051014657, 1954545665, 1761607681, 1375731713
};
unsigned int b[24] = {
12345, 1406932606, 1449466924, 1293799192, 1695770928, 1680572000,
422948032, 910563712, 519516928, 530212352, 98880512, 646551552,
940781568, 472276992, 1749860352, 278495232, 556990464, 1113980928,
80478208, 160956416, 321912832, 643825664, 1287651328, 427819008
};
unsigned int rand1(unsigned int seed, unsigned int n)
{
int i;
for (i = 0; i<n; ++i)
{
seed = (1103515245U*seed+12345U) & m;
}
return seed;
}
unsigned int rand2(unsigned int seed, unsigned int n)
{
int i;
for (i = 0; i<24; ++i)
{
if (n & (1<<i))
{
seed = (a[i]*seed+b[i]) & m;
}
}
return seed;
}
int main()
{
printf("%u\n", rand1 (10101, 100000));
printf("%u\n", rand2 (10101, 100000));
}
It's not hard to adapt to any linear congruential generator. I computed the tables in a language with a proper integer type (Haskell), but I could have computed them another way in C using only a few lines more code.

If you always want the 100,000th item, just store it for later.
Or you could gen the sequence and store that... and query for the particular element by index later.

OR-multiplication on big integers

Multiplication of two n-bit numbers A and B can be understood as a sum of shifts:
(A << i1) + (A << i2) + ...
where i1, i2, ... are numbers of bits that are set to 1 in B.
Now lets replace PLUS with OR to get new operation I actually need:
(A << i1) | (A << i2) | ...
This operation is quite similar to regular multiplication for which there exists many faster algorithms (Schönhage-Strassen for example).
Is a similar algorithm for operation I presented here?
The size of the numbers is 6000 bits.
edit:
For some reason I have no link/button to post comments (any idea why?) so I will edit my question insead.
I indeed search for faster than O(n^2) algorithm for the operation defined above.
And yes, I am aware that it is not ordinary multiplication.

Is there a similar algorithm? I think probably not.
Is there some way to speed things up beyond O(n^2)? Possibly. If you consider a number A to be the analogue of A(x) = Σanxn where an are the binary digits of A, then your operation with bitwise ORs (let's call it A ⊕ B ) can be expressed as follows, where "⇔" means "analogue"
A ⇔ A(x) = Σanxn
B ⇔ B(x) = Σbnxn
C = A ⊕ B ⇔ C(x) = f(A(x)B(x)) = f(V(x)) where f(V(x)) = f(Σvnxn) = Σu(vn)xn where u(vn) = 0 if vn = 0, u(vn) = 1 otherwise.
Basically you are doing the equivalent of taking two polynomials and multiplying them together, then identifying all the nonzero terms. From a bit-string standpoint, this means treating the bitstring as an array of samples of zeros or ones, convolving the two arrays, and collapsing the resulting samples that are nonzero. There are fast convolution algorithms that are O(n log n), using FFTs for instance, and the "collapsing" step here is O(n)... but somehow I wonder if the O(n log n) evaluation of fast convolution treats something (like multiplication of large integers) as O(1) so you wouldn't actually get a faster algorithm. Either that, or the constants for orders of growth are so large that you'd have to have thousands of bits before you got any speed advantage. ORing is so simple.
edit: there appears to be something called "binary convolution" (see this book for example) that sounds awfully relevant here, but I can't find any good links to the theory behind it and whether there are fast algorithms.
edit 2: maybe the term is "logical convolution" or "bitwise convolution"... here's a page from CPAN (bleah!) talking a little about it along with Walsh and Hadamard transforms which are kind of the bitwise equivalent to Fourier transforms... hmm, no, that seems to be the analog for XOR rather than OR.

You can do this O(#1-bits in A * #1-bits in B).
a-bitnums = set(x : ((1<<x) & A) != 0)
b-bitnums = set(x : ((1<<x) & B) != 0)
c-set = 0
for a-bit in a-bitnums:
for b-bit in b-bitnums:
c-set |= 1 << (a-bit + b-bit)
This might be worthwhile if A and B are sparse in the number
of 1 bits present.

I presume, you are asking the name for the additive technique you have given
when you write "Is a similar algorithm for operation I presented here?"...
Have you looked at the Peasant multiplication technique?
Please read up the Wikipedia description if you do not get the 3rd column in this example.
B X A
27 X 15 : 1
13 30 : 1
6 60 : 0
3 120 : 1
1 240 : 1
B is 27 == binary form 11011b
27x15 = 15 + 30 + 120 + 240
= 15<<0 + 15<<1 + 15<<3 + 15<<4
= 405
Sounds familiar?
Here is your algorithm.
Choose the smaller number as your A
Initialize C as your result area
while B is not zero,
if lsb of B is 1, add A to C
left shift A once
right shift B once
C has your multiplication result (unless you rolled over sizeof C)
Update If you are trying to get a fast algorithm for the shift and OR operation across 6000 bits,
there might actually be one. I'll think a little more on that.
It would appear like 'blurring' one number over the other. Interesting.
A rather crude example here,
110000011 X 1010101 would look like
110000011
110000011
110000011
110000011
---------------
111111111111111
The number of 1s in the two numbers will decide the amount of blurring towards a number with all its bits set.
Wonder what you want to do with it...
Update2 This is the nature of the shift+OR operation with two 6000 bit numbers.
The result will be 12000 bits of course
the operation can be done with two bit streams; but, need not be done to its entirety
the 'middle' part of the 12000 bit stream will almost certainly be all 1s (provided both numbers are non-zero)
the problem will be in identifying the depth to which we need to process this operation to get both ends of the 12000 bit stream
the pattern at the two ends of the stream will depend on the largest consecutive 1s present in both the numbers
I have not yet got to a clean algorithm for this yet. Have updated for anyone else wanting to recheck or go further from here. Also, describing the need for such an operation might motivate further interest :-)

The best I could up with is to use a fast out on the looping logic. Combined with the possibility of using the Non-Zero approach as described by themis, you can answer you question by inspecting less than 2% of the N^2 problem.
Below is some code that gives the timing for numbers that are between 80% and 99% zero.
When the numbers get around 88% zero, using themis' approach switches to being better (was not coded in the sample below, though).
This is not a highly theoretical solution, but it is practical.
OK, here is some "theory" of the problem space:
Basically, each bit for X (the output) is the OR summation of the bits on the diagonal of a grid constructed by having the bits of A along the top (MSB to LSB left to right) and the bits of B along the side (MSB to LSB from top to bottom). Since the bit of X is 1 if any on the diagonal is 1, you can perform an early out on the cell traversal.
The code below does this and shows that even for numbers that are ~87% zero, you only have to check ~2% of the cells. For more dense (more 1's) numbers, that percentage drops even more.
In other words, I would not worry about tricky algorithms and just do some efficient logic checking. I think the trick is to look at the bits of your output as the diagonals of the grid as opposed to the bits of A shift-OR with the bits of B. The trickiest thing is this case is keeping track of the bits you can look at in A and B and how to index the bits properly.
Hopefully this makes sense. Let me know if I need to explain this a bit further (or if you find any problems with this approach).
NOTE: If we knew your problem space a bit better, we could optimize the algorithm accordingly. If your numbers are mostly non-zero, then this approach is better than themis since his would result is more computations and storage space needed (sizeof(int) * NNZ).
NOTE 2: This assumes the data is basically bits, and I am using .NET's BitArray to store and access the data. I don't think this would cause any major headaches when translated to other languages. The basic idea still applies.
using System;
using System.Collections;
namespace BigIntegerOr
{
class Program
{
private static Random r = new Random();
private static BitArray WeightedToZeroes(int size, double pctZero, out int nnz)
{
nnz = 0;
BitArray ba = new BitArray(size);
for (int i = 0; i < size; i++)
{
ba[i] = (r.NextDouble() < pctZero) ? false : true;
if (ba[i]) nnz++;
}
return ba;
}
static void Main(string[] args)
{
// make sure there are enough bytes to hold the 6000 bits
int size = (6000 + 7) / 8;
int bits = size * 8;
Console.WriteLine("PCT ZERO\tSECONDS\t\tPCT CELLS\tTOTAL CELLS\tNNZ APPROACH");
for (double pctZero = 0.8; pctZero < 1.0; pctZero += 0.01)
{
// fill the "BigInts"
int nnzA, nnzB;
BitArray a = WeightedToZeroes(bits, pctZero, out nnzA);
BitArray b = WeightedToZeroes(bits, pctZero, out nnzB);
// this is the answer "BigInt" that is at most twice the size minus 1
int xSize = bits * 2 - 1;
BitArray x = new BitArray(xSize);
int LSB, MSB;
LSB = MSB = bits - 1;
// stats
long cells = 0;
DateTime start = DateTime.Now;
for (int i = 0; i < xSize; i++)
{
// compare using the diagonals
for (int bit = LSB; bit < MSB; bit++)
{
cells++;
x[i] |= (b[MSB - bit] && a[bit]);
if (x[i]) break;
}
// update the window over the bits
if (LSB == 0)
{
MSB--;
}
else
{
LSB--;
}
//Console.Write(".");
}
// stats
TimeSpan elapsed = DateTime.Now.Subtract(start);
double pctCells = (cells * 100.0) / (bits * bits);
Console.WriteLine(pctZero.ToString("p") + "\t\t" +elapsed.TotalSeconds.ToString("00.000") + "\t\t" +
pctCells.ToString("00.00") + "\t\t" + cells.ToString("00000000") + "\t" + (nnzA * nnzB).ToString("00000000"));
}
Console.ReadLine();
}
}
}

Just use any FFT Polynomial Multiplication Algorithm and transform all resulting coefficients that are greater than or equal 1 into 1.
Example:
10011 * 10001
[1 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0] * [1 x^4 + 0 x^3 + 0 x^2 + 0 x^1 + 1 x^0]
== [1 x^8 + 0 x^7 + 0 x^6 + 1 x^5 + 2 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0]
-> [1 x^8 + 0 x^7 + 0 x^6 + 1 x^5 + 1 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0]
-> 100110011
For an example of the algorithm, check:
http://www.cs.pitt.edu/~kirk/cs1501/animations/FFT.html
BTW, it is of linearithmic complexity, i.e., O(n log(n))
Also see:
http://everything2.com/title/Multiplication%2520using%2520the%2520Fast%2520Fourier%2520Transform

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex