How is uint overflow defined in OpenCL?

How is uint overflow defined in OpenCL? - opencl

What happens when the result of a multiplication or sum in OpenCL overflows? Does it wrap?
In particular I'd like to know if I can catch an overflow in
uint4 x = ( get_global_id( 0 ) * 4 + (uint4)(0, 1, 2, 3) ) * q + r;
with
int4 invalid = x < get_global_id( 0 ) * 4;
or how else that would be possible. (Assuming r >= 0 && q > r && q < (1 << 20) and the id will be at most just big enough to cause an overflow.)
Context: I want to check every 32 bit uint x for which x % q == r , where q and r are known. With vectors I can check 4 at a time, but the number of tests may not be divisible by 4.
I'm targeting the GPU, but that shouldn't be relevant, right?

OpenCL 1.2 standard (section 6.2.3.3) refers to C99 standard (section 6.3.1.3):
...if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
Generally, get_global_id returns size_t, so narrowing conversion is bad idea IMO. Though, I never faced NDRange big enough to exceed uint range.

Related

Implementing equality function with basic arithmetic operations

Given positive-integer inputs x and y, is there a mathematical formula that will return 1 if x==y and 0 otherwise? I am in the unfortunate position of having to use a tool that only allows me to use the following symbols: numerals 0-9; decimal point .; parentheses ( and ); and the four basic arithmetic operations +, -, /, and *.
Currently I am relying on the fact that the tool that evaluates division by zero to be zero. (I can't tell if this is a bug or a feature.) Because of this, I have been able to use ((x-y)/(y-x))+1. Obviously, this is ugly and unideal, especially in the case that it is a bug and they fix it in a future version.

Taking advantage of integer division in C truncates toward 0, the follows works well. No multiplication overflow. Well defined for all "positive-integer inputs x and y".
(x/y) * (y/x)
#include <stdio.h>
#include <limits.h>
void etest(unsigned x, unsigned y) {
unsigned ref = x == y;
unsigned z = (x/y) * (y/x);
if (ref != z) {
printf("%u %u %u %u\n", x,y,z,ref);
}
}
void etests(void) {
unsigned list[] = { 1,2,3,4,5,6,7,8,9,10,100,1000, UINT_MAX/2 , UINT_MAX - 1, UINT_MAX };
for (unsigned x = 0; x < sizeof list/sizeof list[0]; x++) {
for (unsigned y = 0; y < sizeof list/sizeof list[0]; y++) {
etest(list[x], list[y]);
}
}
}
int main(void) {
etests();
printf("Done\n");
return 0;
}
Output (No difference from x == y)
Done

If division is truncating and the numbers are not too big, then:
((x - y) ^ 2 + 2) / ((x - y) ^ 2 + 1) - 1
The division has the value 2 if x = y and otherwise truncates to 1.
(Here x^2 is an abbreviation for x*x.)
This will fail if (x-y)^2 overflows. In that case, you need to independently check x/k = y/k and x%k = y%k where (k-1)*(k-1) doesn't overflow (which will work if k is ceil(sqrt(INT_MAX))). x%k can be computed as x-k*(x/k) and A&&B is simply A*B.
That will work for any x and y in the range [-k*k, k*k].
A slightly incorrect computation, using lots of intermediate values, which assumes that x - y won't overflow (or at least that the overflow won't produce a false 0).
int delta = x - y;
int delta_hi = delta / K;
int delta_lo = delta - K * delta_hi;
int equal_hi = (delta_hi * delta_hi + 2) / (delta_hi * delta_hi + 1) - 1;
int equal_lo = (delta_lo * delta_lo + 2) / (delta_lo * delta_lo + 1) - 1;
int equals = equal_hi * equal_lo;
or written out in full:
((((x-y)/K)*((x-y)/K)+2)/(((x-y)/K)*((x-y)/K)+1)-1)*
((((x-y)-K*((x-y)/K))*((x-y)-K*((x-y)/K))+2)/
(((x-y)-K*((x-y)/K))*((x-y)-K*((x-y)/K))+1)-1)
(For signed 31-bit integers, use K=46341; for unsigned 32-bit integers, 65536.)
Checked with #chux's test harness, adding the 0 case: live on coliru and with negative values also on coliru.
On a platform where integer subtraction might produce something other than the 2s-complement wraparound, a similar technique could be used, but dividing the numbers into three parts instead of two.

So the problem is that if they fix division by zero, it means that you cannot use any divisor that contains input variables anymore (you'd have to check that the divisor != 0, and implementing that check would solve the original x-y == 0 problem!); hence, division cannot be used at all.
Ergo, only +, -, * and the association operator () can be used. It's not hard to see that with only these operators, the desired behaviour cannot be implemented.

Determining the big Oh for (n-1)+(n-1)

I have been trying to get my head around this perticular complexity computation but everything i read about this type of complexity says to me that it is of type big O(2^n) but if i add a counter to the code and check how many times it iterates per given n it seems to follow the curve of 4^n instead. Maybe i just misunderstood as i placed an count++; inside the scope.
Is this not of type big O(2^n)?
public int test(int n)
{
if (n == 0)
return 0;
else
return test(n-1) + test(n-1);
}
I would appreciate any hints or explanation on this! I completely new to this complexity calculation and this one has thrown me off the track.
//Regards

int test(int n)
{
printf("%d\n", n);
if (n == 0) {
return 0;
}
else {
return test(n - 1) + test(n - 1);
}
}
With a printout at the top of the function, running test(8) and counting the number of times each n is printed yields this output, which clearly shows 2n growth.
$ ./test | sort | uniq -c
256 0
128 1
64 2
32 3
16 4
8 5
4 6
2 7
1 8
(uniq -c counts the number of times each line occurs. 0 is printed 256 times, 1 128 times, etc.)
Perhaps you mean you got a result of O(2n+1), rather than O(4n)? If you add up all of these numbers you'll get 511, which for n=8 is 2n+1-1.
If that's what you meant, then that's fine. O(2n+1) = O(2⋅2n) = O(2n)

First off: the 'else' statement is obsolete since the if already returns if it evaluates to true.
On topic: every iteration forks 2 different iterations, which fork 2 iterations themselves, etc. etc. As such, for n=1 the function is called 2 times, plus the originating call. For n=2 it is called 4+1 times, then 8+1, then 16+1 etc. The complexity is therefore clearly 2^n, since the constant is cancelled out by the exponential.
I suspect your counter wasn't properly reset between calls.

Let x(n) be a number of total calls of test.
x(0) = 1
x(n) = 2 * x(n - 1) = 2 * 2 * x(n-2) = 2 * 2 * ... * 2
There is total of n twos - hence 2^n calls.

The complexity T(n) of this function can be easily shown to equal c + 2*T(n-1). The recurrence given by
T(0) = 0
T(n) = c + 2*T(n-1)
Has as its solution c*(2^n - 1), or something like that. It's O(2^n).
Now, if you take the input size of your function to be m = lg n, as might be acceptable in this scenario (the number of bits to represent n, the true input size) then this is, in fact, an O(m^4) algorithm... since O(n^2) = O(m^4).

Calculate bessel function in MATLAB using Jm+1=2mj(m) -j(m-1) formula

I tried to implement bessel function using that formula, this is the code:
function result=Bessel(num);
if num==0
result=bessel(0,1);
elseif num==1
result=bessel(1,1);
else
result=2*(num-1)*Bessel(num-1)-Bessel(num-2);
end;
But if I use MATLAB's bessel function to compare it with this one, I get too high different values.
For example if I type Bessel(20) it gives me 3.1689e+005 as result, if instead I type bessel(20,1) it gives me 3.8735e-025 , a totally different result.

such recurrence relations are nice in mathematics but numerically unstable when implementing algorithms using limited precision representations of floating-point numbers.
Consider the following comparison:
x = 0:20;
y1 = arrayfun(#(n)besselj(n,1), x); %# builtin function
y2 = arrayfun(#Bessel, x); %# your function
semilogy(x,y1, x,y2), grid on
legend('besselj','Bessel')
title('J_\nu(z)'), xlabel('\nu'), ylabel('log scale')
So you can see how the computed values start to differ significantly after 9.
According to MATLAB:
BESSELJ uses a MEX interface to a Fortran library by D. E. Amos.
and gives the following as references for their implementation:
D. E. Amos, "A subroutine package for Bessel functions of a complex
argument and nonnegative order", Sandia National Laboratory Report,
SAND85-1018, May, 1985.
D. E. Amos, "A portable package for Bessel functions of a complex
argument and nonnegative order", Trans. Math. Software, 1986.

The forward recurrence relation you are using is not stable. To see why, consider that the values of BesselJ(n,x) become smaller and smaller by about a factor 1/2n. You can see this by looking at the first term of the Taylor series for J.
So, what you're doing is subtracting a large number from a multiple of a somewhat smaller number to get an even smaller number. Numerically, that's not going to work well.
Look at it this way. We know the result is of the order of 10^-25. You start out with numbers that are of the order of 1. So in order to get even one accurate digit out of this, we have to know the first two numbers with at least 25 digits precision. We clearly don't, and the recurrence actually diverges.
Using the same recurrence relation to go backwards, from high orders to low orders, is stable. When you start with correct values for J(20,1) and J(19,1), you can calculate all orders down to 0 with full accuracy as well. Why does this work? Because now the numbers are getting larger in each step. You're subtracting a very small number from an exact multiple of a larger number to get an even larger number.

You can just modify the code below which is for the Spherical bessel function. It is well tested and works for all arguments and order range. I am sorry it is in C#
public static Complex bessel(int n, Complex z)
{
if (n == 0) return sin(z) / z;
if (n == 1) return sin(z) / (z * z) - cos(z) / z;
if (n <= System.Math.Abs(z.real))
{
Complex h0 = bessel(0, z);
Complex h1 = bessel(1, z);
Complex ret = 0;
for (int i = 2; i <= n; i++)
{
ret = (2 * i - 1) / z * h1 - h0;
h0 = h1;
h1 = ret;
if (double.IsInfinity(ret.real) || double.IsInfinity(ret.imag)) return double.PositiveInfinity;
}
return ret;
}
else
{
double u = 2.0 * abs(z.real) / (2 * n + 1);
double a = 0.1;
double b = 0.175;
int v = n - (int)System.Math.Ceiling((System.Math.Log(0.5e-16 * (a + b * u * (2 - System.Math.Pow(u, 2)) / (1 - System.Math.Pow(u, 2))), 2)));
Complex ret = 0;
while (v > n - 1)
{
ret = z / (2 * v + 1.0 - z * ret);
v = v - 1;
}
Complex jnM1 = ret;
while (v > 0)
{
ret = z / (2 * v + 1.0 - z * ret);
jnM1 = jnM1 * ret;
v = v - 1;
}
return jnM1 * sin(z) / z;
}
}

Divide by 10 using bit shifts?

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.

Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses.
Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in eax and quotient returned in eax)
div10 proc
mov edx,1999999Ah ; load 1/10 * 2^32
imul eax ; edx:eax = dividend / 10 * 2 ^32
mov eax,edx ; eax = dividend / 10
ret
endp
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.

Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I think that this is the best solution for architectures that lack a multiply instruction.

Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
temp = (ms * 205) >> 11; // 205/2048 is nearly the same as /10

to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:
y = (ms * 1) >> 3 // first error 8
y = (ms * 2) >> 4 // 8
y = (ms * 4) >> 5 // 8
y = (ms * 7) >> 6 // 19
y = (ms * 13) >> 7 // 69
y = (ms * 26) >> 8 // 69
y = (ms * 52) >> 9 // 69
y = (ms * 103) >> 10 // 179
y = (ms * 205) >> 11 // 1029
y = (ms * 410) >> 12 // 1029
y = (ms * 820) >> 13 // 1029
y = (ms * 1639) >> 14 // 2739
y = (ms * 3277) >> 15 // 16389
y = (ms * 6554) >> 16 // 16389
y = (ms * 13108) >> 17 // 16389
y = (ms * 26215) >> 18 // 43699
y = (ms * 52429) >> 19 // 262149
y = (ms * 104858) >> 20 // 262149
y = (ms * 209716) >> 21 // 262149
y = (ms * 419431) >> 22 // 699059
y = (ms * 838861) >> 23 // 4194309
y = (ms * 1677722) >> 24 // 4194309
y = (ms * 3355444) >> 25 // 4194309
y = (ms * 6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869
each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.
if you want to calculate more of these, the following (Python) code can be used:
def mul_from_shift(shift):
mid = 2**shift + 5.
return int(round(mid / 10.))
and I did the obvious thing for calculating when this approximation starts to go wrong with:
def first_err(mul, shift):
i = 1
while True:
y = (i * mul) >> shift
if y != i // 10:
return i
i += 1
(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)
the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:
where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1

Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find x where x = v / 10.
We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.
Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).
We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.
This yields the following algorithm to divide v by 10.
uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
uint16_t t = x10 + b10;
if (t <= v) {
x10 = t;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes
uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
if (b10 <= v) {
v -= b10;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
The loop may be unwinded and the different values of b and b10 may be precomputed as constants.

On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
uint16_t div10(uint16_t dividend) {
uint16_t quotient = 0;
#define div10_step(n) \
do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
div10_step(0x1000);
div10_step(0x0800);
div10_step(0x0400);
div10_step(0x0200);
div10_step(0x0100);
div10_step(0x0080);
div10_step(0x0040);
div10_step(0x0020);
div10_step(0x0010);
div10_step(0x0008);
div10_step(0x0004);
div10_step(0x0002);
div10_step(0x0001);
#undef div10_step
if (dividend >= 5) ++quotient; // round the result (optional)
return quotient;
}

Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.

I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc

elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"

OR-multiplication on big integers

Multiplication of two n-bit numbers A and B can be understood as a sum of shifts:
(A << i1) + (A << i2) + ...
where i1, i2, ... are numbers of bits that are set to 1 in B.
Now lets replace PLUS with OR to get new operation I actually need:
(A << i1) | (A << i2) | ...
This operation is quite similar to regular multiplication for which there exists many faster algorithms (Schönhage-Strassen for example).
Is a similar algorithm for operation I presented here?
The size of the numbers is 6000 bits.
edit:
For some reason I have no link/button to post comments (any idea why?) so I will edit my question insead.
I indeed search for faster than O(n^2) algorithm for the operation defined above.
And yes, I am aware that it is not ordinary multiplication.

Is there a similar algorithm? I think probably not.
Is there some way to speed things up beyond O(n^2)? Possibly. If you consider a number A to be the analogue of A(x) = Σanxn where an are the binary digits of A, then your operation with bitwise ORs (let's call it A ⊕ B ) can be expressed as follows, where "⇔" means "analogue"
A ⇔ A(x) = Σanxn
B ⇔ B(x) = Σbnxn
C = A ⊕ B ⇔ C(x) = f(A(x)B(x)) = f(V(x)) where f(V(x)) = f(Σvnxn) = Σu(vn)xn where u(vn) = 0 if vn = 0, u(vn) = 1 otherwise.
Basically you are doing the equivalent of taking two polynomials and multiplying them together, then identifying all the nonzero terms. From a bit-string standpoint, this means treating the bitstring as an array of samples of zeros or ones, convolving the two arrays, and collapsing the resulting samples that are nonzero. There are fast convolution algorithms that are O(n log n), using FFTs for instance, and the "collapsing" step here is O(n)... but somehow I wonder if the O(n log n) evaluation of fast convolution treats something (like multiplication of large integers) as O(1) so you wouldn't actually get a faster algorithm. Either that, or the constants for orders of growth are so large that you'd have to have thousands of bits before you got any speed advantage. ORing is so simple.
edit: there appears to be something called "binary convolution" (see this book for example) that sounds awfully relevant here, but I can't find any good links to the theory behind it and whether there are fast algorithms.
edit 2: maybe the term is "logical convolution" or "bitwise convolution"... here's a page from CPAN (bleah!) talking a little about it along with Walsh and Hadamard transforms which are kind of the bitwise equivalent to Fourier transforms... hmm, no, that seems to be the analog for XOR rather than OR.

You can do this O(#1-bits in A * #1-bits in B).
a-bitnums = set(x : ((1<<x) & A) != 0)
b-bitnums = set(x : ((1<<x) & B) != 0)
c-set = 0
for a-bit in a-bitnums:
for b-bit in b-bitnums:
c-set |= 1 << (a-bit + b-bit)
This might be worthwhile if A and B are sparse in the number
of 1 bits present.

I presume, you are asking the name for the additive technique you have given
when you write "Is a similar algorithm for operation I presented here?"...
Have you looked at the Peasant multiplication technique?
Please read up the Wikipedia description if you do not get the 3rd column in this example.
B X A
27 X 15 : 1
13 30 : 1
6 60 : 0
3 120 : 1
1 240 : 1
B is 27 == binary form 11011b
27x15 = 15 + 30 + 120 + 240
= 15<<0 + 15<<1 + 15<<3 + 15<<4
= 405
Sounds familiar?
Here is your algorithm.
Choose the smaller number as your A
Initialize C as your result area
while B is not zero,
if lsb of B is 1, add A to C
left shift A once
right shift B once
C has your multiplication result (unless you rolled over sizeof C)
Update If you are trying to get a fast algorithm for the shift and OR operation across 6000 bits,
there might actually be one. I'll think a little more on that.
It would appear like 'blurring' one number over the other. Interesting.
A rather crude example here,
110000011 X 1010101 would look like
110000011
110000011
110000011
110000011
---------------
111111111111111
The number of 1s in the two numbers will decide the amount of blurring towards a number with all its bits set.
Wonder what you want to do with it...
Update2 This is the nature of the shift+OR operation with two 6000 bit numbers.
The result will be 12000 bits of course
the operation can be done with two bit streams; but, need not be done to its entirety
the 'middle' part of the 12000 bit stream will almost certainly be all 1s (provided both numbers are non-zero)
the problem will be in identifying the depth to which we need to process this operation to get both ends of the 12000 bit stream
the pattern at the two ends of the stream will depend on the largest consecutive 1s present in both the numbers
I have not yet got to a clean algorithm for this yet. Have updated for anyone else wanting to recheck or go further from here. Also, describing the need for such an operation might motivate further interest :-)

The best I could up with is to use a fast out on the looping logic. Combined with the possibility of using the Non-Zero approach as described by themis, you can answer you question by inspecting less than 2% of the N^2 problem.
Below is some code that gives the timing for numbers that are between 80% and 99% zero.
When the numbers get around 88% zero, using themis' approach switches to being better (was not coded in the sample below, though).
This is not a highly theoretical solution, but it is practical.
OK, here is some "theory" of the problem space:
Basically, each bit for X (the output) is the OR summation of the bits on the diagonal of a grid constructed by having the bits of A along the top (MSB to LSB left to right) and the bits of B along the side (MSB to LSB from top to bottom). Since the bit of X is 1 if any on the diagonal is 1, you can perform an early out on the cell traversal.
The code below does this and shows that even for numbers that are ~87% zero, you only have to check ~2% of the cells. For more dense (more 1's) numbers, that percentage drops even more.
In other words, I would not worry about tricky algorithms and just do some efficient logic checking. I think the trick is to look at the bits of your output as the diagonals of the grid as opposed to the bits of A shift-OR with the bits of B. The trickiest thing is this case is keeping track of the bits you can look at in A and B and how to index the bits properly.
Hopefully this makes sense. Let me know if I need to explain this a bit further (or if you find any problems with this approach).
NOTE: If we knew your problem space a bit better, we could optimize the algorithm accordingly. If your numbers are mostly non-zero, then this approach is better than themis since his would result is more computations and storage space needed (sizeof(int) * NNZ).
NOTE 2: This assumes the data is basically bits, and I am using .NET's BitArray to store and access the data. I don't think this would cause any major headaches when translated to other languages. The basic idea still applies.
using System;
using System.Collections;
namespace BigIntegerOr
{
class Program
{
private static Random r = new Random();
private static BitArray WeightedToZeroes(int size, double pctZero, out int nnz)
{
nnz = 0;
BitArray ba = new BitArray(size);
for (int i = 0; i < size; i++)
{
ba[i] = (r.NextDouble() < pctZero) ? false : true;
if (ba[i]) nnz++;
}
return ba;
}
static void Main(string[] args)
{
// make sure there are enough bytes to hold the 6000 bits
int size = (6000 + 7) / 8;
int bits = size * 8;
Console.WriteLine("PCT ZERO\tSECONDS\t\tPCT CELLS\tTOTAL CELLS\tNNZ APPROACH");
for (double pctZero = 0.8; pctZero < 1.0; pctZero += 0.01)
{
// fill the "BigInts"
int nnzA, nnzB;
BitArray a = WeightedToZeroes(bits, pctZero, out nnzA);
BitArray b = WeightedToZeroes(bits, pctZero, out nnzB);
// this is the answer "BigInt" that is at most twice the size minus 1
int xSize = bits * 2 - 1;
BitArray x = new BitArray(xSize);
int LSB, MSB;
LSB = MSB = bits - 1;
// stats
long cells = 0;
DateTime start = DateTime.Now;
for (int i = 0; i < xSize; i++)
{
// compare using the diagonals
for (int bit = LSB; bit < MSB; bit++)
{
cells++;
x[i] |= (b[MSB - bit] && a[bit]);
if (x[i]) break;
}
// update the window over the bits
if (LSB == 0)
{
MSB--;
}
else
{
LSB--;
}
//Console.Write(".");
}
// stats
TimeSpan elapsed = DateTime.Now.Subtract(start);
double pctCells = (cells * 100.0) / (bits * bits);
Console.WriteLine(pctZero.ToString("p") + "\t\t" +elapsed.TotalSeconds.ToString("00.000") + "\t\t" +
pctCells.ToString("00.00") + "\t\t" + cells.ToString("00000000") + "\t" + (nnzA * nnzB).ToString("00000000"));
}
Console.ReadLine();
}
}
}

Just use any FFT Polynomial Multiplication Algorithm and transform all resulting coefficients that are greater than or equal 1 into 1.
Example:
10011 * 10001
[1 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0] * [1 x^4 + 0 x^3 + 0 x^2 + 0 x^1 + 1 x^0]
== [1 x^8 + 0 x^7 + 0 x^6 + 1 x^5 + 2 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0]
-> [1 x^8 + 0 x^7 + 0 x^6 + 1 x^5 + 1 x^4 + 0 x^3 + 0 x^2 + 1 x^1 + 1 x^0]
-> 100110011
For an example of the algorithm, check:
http://www.cs.pitt.edu/~kirk/cs1501/animations/FFT.html
BTW, it is of linearithmic complexity, i.e., O(n log(n))
Also see:
http://everything2.com/title/Multiplication%2520using%2520the%2520Fast%2520Fourier%2520Transform

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex