OpenCL modulo of large numbers - math

I'm trying to calculate a mod b in OpenCL, where a is an array of ulong elements, and is twice the length of b.
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
// length = len(a) = 2 * len(b)
...
}
What I want is something like a %= b, but with arrays. The arrays represent numbers of course, with their last element representing the least significant bits.
Is it possible to do this in-place (i.e. without allocating extra memory)? What is a good algorithm for calculating the medulus for large numbers?
Note that neither of the two numbers can be easily represented in another way (e.g. using exponents). Most of the times they will be pseudoprimes. Also, having some concurrency would be nice.
Pointers to any useful material on this are welcome.
EDIT: if that helps, length can be known at compile time.
EDIT: I'm sorry I wasn't clear here. I'm not working on an array of integers, I'm working on two big integers, for example a is 8Mb (a 67108864-bit number) and b is 4Mb (a 33554432-bit number). I work them in base 2^64, hence the arrays of ulong integers. Basically, those are just the digits of the number.

You just do:
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
ulong id = get_global_id(0) ;
a[id] = a[id] % b[id];
}
I don't really understand your problem, the arrays size difers? Or maybe you want a more special calculation?

Related

Can this parallelism be implemented in OpenCL

This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.
I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.
1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].
2) Outputs: Two 1D arrays of length (n): C, D.
C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])
So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:
__kernel void test (__global float* A, __global float* B, __global float* C,
__global float* D, int n, float C0, float D0) {
int i, j=get_global_id(0);
if (j==0) {
C[0] = C0;
for (i=1;i<=n-1;i++) {
C[i] = function1(C[i-1]);
[WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
}
return;
}
else {
D[0] = D0;
for (i=1;i<=n-1;i++) {
D[i] = function2(C[i-1],D[i-1]);
[WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
}
return;
}
}
Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?
This can be implemented in opencl, but like the other answer says, you're going to be limited to 2 threads at best.
My version of your function should be called with a single work group having two work items.
__kernel void test (__global float* A, __global float* B, __global float* C, __global float* D, int n, float C0, float D0)
{
int i;
int gid = get_global_id(0);
local float prevC;
local float prevD;
if (gid == 0) {
C[0] = prevC = C0;
D[0] = prevD = D0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (i=1;i<=n-1;i++) {
if(gid == 0){
C[i] = function1(prevC);
}else if (gid == 1){
D[i] = function2(prevC, prevD);
}
barrier(CLK_LOCAL_MEM_FENCE);
prevC = C[i];
prevD = D[i];
}
}
This should run on any opencl hardware. If you don't care about saving all of the C and D values, you can simply return prevC and prevD in two floats rather than the entire list. This would also make it much faster due to sticking to a lower cache level (ie local memory) for all reading and writing of the intermediate values. The local memory boost should also apply to all opencl hardware.
So is there a point to running this on a GPU? Not for the parallelism. You are stuck with 2 threads. But if you don't need all values of C and D returned, you would probably see a significant speed up because of the much faster memory of GPUs.
All of this assumes that function1 and function2 aren't overly complex. If they are, just stick to CPUs -- and probably another multiprocessing technique such as OpenMP.
Dependency in your case is completely linear/recursive (i needs i-1). Not even logaritmic like other problems (reduction, sum, sort, etc.). And therefore this problem does not fit well in a SIMD device.
The best you can do is go a 2 threads approach in CPU. Thread 1 will "produce" data (C value), for thread 2.
A very naive approach for example:
Thread 1:
for(){
ProcessC(i);
atomic_inc(counter); //This function should unlock
}
Thread 2:
for(){
atomic_dec(counter); //This function should lock
ProcessD(i);
}
Where atomic_inc and atomic_dec can be implemented with counting semaphores for example.

Conversion with Pointsers in C

I need to implement but I am not sure how can I as I am completely new into this. A function called get_values that has the prototype:
void get_values(unsigned int value, unsigned int *p_lsb, unsigned int *p_msb,
unsigned int *p_combined)
The function computes the least significant byte and the most significant byte of the value
parameter. In addition, both values are combined. For this problem:
a. You may not use any loop constructs.
b. You may not use the multiplication operator (* or *=).
c. Your code must work for unsigned integers of any size (4 bytes, 8 bytes, etc.).
d. To combine the values, append the least significant byte to the most significant one.
e. Your implementation should be efficient.
The following driver (and associated output) provides an example of using the function you are
expected to write. Notice that in this example an unsigned int is 4 bytes, but your function
needs to work with an unsigned int of any size.
Driver
int main() {
unsigned int value = 0xabcdfaec, lsb, msb, combined;
get_values(value, &lsb, &msb, &combined);
printf("Value: %x, lsb: %x, msb: %x, combined: %x\n", value, lsb, msb, combined);
return 0;
}
Output
Value: abcdfaec, lsb: ec, msb: ab, combined: abec
I think you want to look into bitwise and and bit shifting operators. The last piece of the puzzle might be the sizeof() operator if the question is asking that the code should work with platforms with different sized int types.

OpenCL matrix vector multiplication code gives correct and incorrect solutions from run to run

I am working on OpenCL code for sparse matrix operations and I find that it works when the code including the kernel is executed once or twice. But every few runs the answer is slightly off. Here is the very simple kernel I am using:
__kernel void dsmv( int N, __global int * IA,
__global int * JA, __global float * A,
__global float * X, __global float * Y){
int IBGN, ICOL, IEND, ii;
ICOL = get_global_id(0);
if(ICOL < N)
{
IBGN = JA[ICOL]-1;
IEND = JA[ICOL+1]-1-1;
for (ii = IBGN; ii <= IEND; ii++)
{
Y[IA[ii]-1] += A[ii]*X[ICOL];
}
}
}
I can also post the fortran code that uses this kernel. I am using FortranCL.
What could cause the multiplication to give different answers from run to run?
This line looks suspicious:
Y[IA[ii]-1] += A[ii]*X[ICOL];
It seems that two work items may increment the same memory location, so there is a potential race condition here, and since += is not an atomic operation this is a problem.
Unfortunately you can't use the built-in atomic_add instead because it doesn't support floats, but atomic_cmpxchg does, so you can use it to implement a floating-point atomic add - or just look at this existing implementation of an atomic add for floats.

Avoiding data alignment in OpenCL

I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)

OpenCL scalar vs vector

I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?
I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.
I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}

Resources