I'm a new for new for openCL.
I know how to sum a 1D array. But my question is how to get a sum array from 1 1D array in openCL.
int a[1000];
int b[1000];
.... //save data to a
for(int i = 0 ;i < 1000; i ++){
int sum = 0;
for(int j = 0 ;j < i; j ++){
sum += a[j];
}
b[i] = sum;
}
Any suggestion is welcome.
As others have mentioned - what you want to do is use inclusive parallel prefix sum. If you're allowed to use OpenCL 2, they have a workgroup function for it - they should have had it in there from the start because of how often it is used - so now we have everybody implementing it themselves, often poorly in one way or another.
See Parallel Prefix Sum (Scan) with CUDA for the typical algorithms for teaching this.
At the number you mention really it makes no sense to use multiple compute units meaning you will attack it with a single compute unit - so just repeat the loop twice or so - at 64-256, you'll have the sum of so many elements very quickly. Building on workgroup functions to get the generic reduction functions for any size is an exercise to the reader.
This is a sequential problem. Expressed in another way
b[1] = a[0]
b[2] = b[1] + a[1]
b[3] = b[2] + a[2]
...
b[1000] = b[9999] + a[999]
Therefore, having multiple threads wont help you at all.
The most optimal way of doing that is using a single CPU. And not OpenCL/CUDA/OpenMP...
This problem is completely different from a reduction, were every step can be divided in 2 smaller steps that can be run in parallel.
Related
Original problem:
Let N be a positive integer (actually, N <= 2000) and P - set of all possible partitions of the N, where with and . Let A be the number of partitions . Find the A.
Input: N. Output: A - the number of partitions .
What have I tried:
I think that this problem can be solved by dynamic-based algorithm. Let p(n,a,b) be the function, which returns the number of partitons of n using only numbers a. . .b. Then we can compute the A with the code like:
int Ans = 2; // the 1+1+...+1=N & N=N partitions
for(int a = 2; a <= N/2; a += 1){ //a - from 2 to N/2
int b = a*2-1;
Ans += p[N][a][b]; // add all partitions using a..b to Answer
if(a < (a-1)*2-1){ // if a < previous b [ (a-1)*2-1 ]
Ans -= p[N][a][(a-1)*2-1]; // then we counted number of partitions
} // using numbers a..prev_b twice.
}
Next I tried to find the dynamic algorithm computing p(n,a,b) for any integer a <= b <= n. This paper (.pdf) provides the folowing algorithm:
, were I(n<=b) = 1 if n<=b and =0 otherwise.
Question(s):
How should I realize the algorithm from the paper? I'm new at d-p problems and as I can see, this problem has 3 dimensions (n,a & b), which is quite tricky for me.
How actually that algorithm works? I know how work the algorithms for computing p(n,0,b) or p(n,a,n), but a little explanation for p(n,a,b) will be very helpful.
Does original problem have simpler solution? I'm quite sure that there's another clean solution, but I didn't found it.
I calculated all A(1)-A(600) in 23 seconds with memoization approach (top-down dynamic programming). 3D table requires 1.7 GB of memory.
For reference: A[50] = 278, A(200)=465202, A(600)=38860513616
N=2000 requires too large table for 32-bit environment, and map approach worked too slow.
I can make 2D table with reasonable size, but this approach requires table zeroing at every iteration of external loop - slow again.
A(1000) = 107292471486730 in 131 sec. And I think that long arithmetic might be needed for larger values to avoid Int64 overflow.
I am trying to write a program to check whether a number N can be expressed as the sum of two cubes i.e. N = a^3 + b^3
This is my code with complexity O(n):
#include <iostream>
#include<math.h>
#define ll unsigned long long
using namespace std;
int main()
{
ios_base::sync_with_stdio(false);
bool flag=false;
ll t,N;
cin>>t;
while(t--)
{
cin>>N;
flag=false;
for(int i=1; i<=(ll)cbrtl(N/2); i++)
{
if(!(cbrtl(N-i*i*i)-(ll)cbrtl(N-i*i*i))) {flag=true; break;}
}
if(flag) cout<<"Yes\n"; else cout<<"No\n";
}
return 0;
}
As the time limit for code is 2s, This program is giving TLE? can anyone suggest a faster approch
I posted this also in StackExchange, so sorry if you consider duplicate, but I really don´t know if these are the same or different boards (Exchange and Overflow). My profile appears different here.
==========================
There is a faster algorithm to check if a given integer is a sum (or difference) of two cubes n=a^3+b^3
I don´t know if this algorithm is already known (probably yes, but I can´t find it on books or internet). I discovered and use it to compute integers until n < 10^18
This process uses a single trick
4(a^3+b^3)/(a+b) = (a+b)^2 + 3(a-b)^2)
We don´t know in advance what would be "a" and "b" and so what also would be "(a+b)", but we know that "(a+b)" should certainly divide (a^3+b^3) , so if you have a fast primes factorizing routine, you can quickly compute each one of divisors of (a^3+b^3) and then check if
(4(a^3+b^3)/divisor - divisor^2)/3 = square
When (and if) found a square, you have divisor=(a+b) and sqrt(square)=(a-b) , so you have a and b.
If not square found, the number is not sum of two cubes.
We know divisor < (4(a^3+b^3)^(1/3) and this limit improves the task, because when you are assembling divisors of (a^3+b^3) immediately discard those greater than limit.
Now some comparisons with other algorithms - for n = 10^18, by using brute force you should test all numbers below 10^6 to know the answer. On the other hand, to build all divisors of 10^18 you need primes until 10^9.
The max quantity of different primes you could fit into 10^9 is 10 (2*3*5*7*11*13*17*19*23*29 = 5*10^9) so we have 2^10-1 different combinations of primes (which assemble the divisors) to check in worst case, many of them discared because limit.
To compute prime factors I use a table with first 60.000.000 primes which works very well on this range.
Miguel Velilla
To find all the pairs of integers x and y that sum to n when cubed, set x to the largest integer less than the cube root of n, set y to 0, then repeatedly add 1 to y if the sum of the cubes is less than n, subtract 1 from x if the sum of the cubes is greater than n, and output the pair otherwise, stopping when x and y cross. If you only want to know whether or not such a pair exists, you can stop as soon as you find one.
Let us know if you have trouble coding this algorithm.
I'm working on an embedded project where I have to write a time-out value into two byte registers of some micro-chip.
The time-out is defined as:
timeout = REG_a * (REG_b +1)
I want to program these registers using an integer in the range of 256 to lets say 60000. I am looking for an algorithm which, given a timeout-value, calculates REG_a and REG_b.
If an exact solution is impossible, I'd like to get the next possible larger time-out value.
What have I done so far:
My current solution calculates:
temp = integer_square_root (timeout) +1;
REG_a = temp;
REG_b = temp-1;
This results in values that work well in practice. However I'd like to see if you guys could come up with a more optimal solution.
Oh, and I am memory constrained, so large tables are out of question. Also the running time is important, so I can't simply brute-force the solution.
You could use the code used in that answer Algorithm to find the factors of a given Number.. Shortest Method? to find a factor of timeout.
n = timeout
initial_n = n
num_factors = 1;
for (i = 2; i * i <= initial_n; ++i) // for each number i up until the square root of the given number
{
power = 0; // suppose the power i appears at is 0
while (n % i == 0) // while we can divide n by i
{
n = n / i // divide it, thus ensuring we'll only check prime factors
++power // increase the power i appears at
}
num_factors = num_factors * (power + 1) // apply the formula
}
if (n > 1) // will happen for example for 14 = 2 * 7
{
num_factors = num_factors * 2 // n is prime, and its power can only be 1, so multiply the number of factors by 2
}
REG_A = num_factor
The first factor will be your REG_A, so then you need to find another value that multiplied equals timeout.
for (i=2; i*num_factors != timeout;i++);
REG_B = i-1
Interesting problem, Nils!
Suppose you start by fixing one of the values, say Reg_a, then compute Reg_b by division with roundup: Reg_b = ((timeout + Reg_a-1) / Reg_a) -1.
Then you know you're close, but how close? Well the upper bound on the error would be Reg_a, right? Because the error is the remainder of the division.
If you make one of factors as small as possible, then compute the other factor, you'd be making that upper bound on the error as small as possible.
On the other hand, by making the two factors close to the square root, you're making the divisor as large as possible, and therefore making the error as large as possible!
So:
First, what is the minimum value for Reg_a? (timeout + 255) / 256;
Then compute Reg_b as above.
This won't be the absolute minimum combination in all cases, but it should be better than using the square root, and faster, too.
I am trying to find the nth( n <= 2000000) square free semi prime. I have the following code to do so.
int k = 0;
for(int i = 0; i <= 1000; i++)
{
for(int j = i +1 ; j <= 2500; j++ )
{
semiprimes[k++] = (primes[i]*primes[j]);
}
}
sort(semiprimes,semiprimes+k);
primes[] is a list of primes.
My problem is, i get different values for n = 2000000, with different limits on the for loops. Could someone tell a way to correctly calculate these limits?
Thanks in advance..
You want to calculate the nth first semi-prime square-free numbers. "first" means that you have to generate all of them under a certain value. Your method consist of generating a lot of those numbers, sort them and extract the nth first values.
This can be a good approach but you must have all the numbers generated. Having two different limits in your nested loops is a good way to miss some of them (in your example, you are not calculating primes[1001]*primes[1002] which should be in semiprimes).
To avoid this problem, you have to compute all the semi-prime numbers in a square, say [1,L]*[1,L], where L is your limit for both loops.
To determine L, all you need is it to count.
Let N be the number of semi-prime square-free numbers under primes[L-1]*primes[L-1].
N = (L * L - L) / 2
L*L is the total number of pairwise multiplications. L is the number of squares. This has two be divided by two to get the right number (because primes[i]*primes[j] = primes[j]*primes[i]).
You want to pick L such that n<=N. So for n = 2000000 :
int L = 2001, k = 0;
for(int i = 0; i < L; i++)
{
for(int j = i+1 ; j < L; j++ )
{
semiprimes[k++] = (primes[i]*primes[j]);
}
}
sort(semiprimes,semiprimes+k);
I don't believe an approach that works by computing all semiprimes inside a box will work in any reasonable amount of time. Say we graph the factors (p,q) of the first 2 million semiprimes. To make the graph more symmetric, let's plot a point for both (p,q) and (q,p). The graph does not form a nice rectangular region, but instead looks more like the hyperbola y=1/x. This hyperbola stretches out quite far, and iterating over the entire rectangle containing these will be a lot of wasted computation.
You may want to consider first solving the problem "how many semiprimes are there below N?" and then using a binary search. Each query can be done in about sqrt(N) steps (hint: binary search strikes again). You will need a fairly large table of primes, certainly at least up to 1 million, probably more. Although this can be trimmed down by an arbitrarily large constant factor with some precomputation.
Given the range [1, 2 Million], for each number in this range I need to generate
and store the number of the divisors of each integer in an array.
So if x=p1^(a1)*p2^a2*p3^a3, where p1, p2, p3 are primes,
the total number of divisors of x is given by (p1+1)(p2+1)(p3+1). I generated all
the primes below 2000 and for each integer in the range, I did trial division
to get the power of each prime factor and then used the formula above to calculate
the number of divisors and stored in an array.
But, doing this is quite slow and takes around 5 seconds to generate the number of divsors
for all the numbers in the given range.
Can we do this sum in some other efficient way, may be without factorizing each
of the numbers?
Below is the code that I use now.
typedef unsigned long long ull;
void countDivisors(){
ull PF_idx=0, PF=0, ans=1, N=0, power;
for(ull i=2; i<MAX; ++i){
if (i<SIEVE_SIZE and isPrime[i]) factors[i]=2;
else{
PF_idx=0;
PF=primes[PF_idx];
ans=1;
N=i;
while(N!=1 and (PF*PF<=N)){
power = 0;
while(N%PF==0){ N/=PF; ++power;}
ans*=(power+1);
PF = primes[++PF_idx];
}
if (N!=1) ans*=2;
factors[i] = ans;
}
}
}
First of all your formula is wrong. According to your formula, the sum of the divisors of 12 should be 12. In fact it is 28. The correct formula is (p1a1 - 1)*(p2a2 - 1) * ... * (pkak - 1)/( (p1 - 1) * (p2 - 1) * ... * (pk - 1) ).
That said, the easiest approach is probably just to do a sieve. One can get clever with offsets, but for simplicity just make an array of 2,000,001 integers, from 0 to 2 million. Initialize it to 0s. Then:
for (ull i = 1; i < MAX; ++i) {
for (ull j = i; j < MAX; j += i) {
factors[j] += i;
}
}
This may feel inefficient, but it is not that bad. The total work taken for the numbers up to N is N + N/2 + N/3 + ... + N/N = O(N log(N)) which is orders of magnitude less than trial division. And the operations are all addition and comparison, which are fast for integers.
If you want to proceed with your original idea and formula, you can make that more efficient by using a modified sieve of Eratosthenes to create an array from 1 to 2 million listing a prime factor of each number. Building that array is fairly fast, and you can take any number and factorize it much, much more quickly than you could with trial division.