This page says that vloadn(size_t offset, const gentype *p) "returns sizeof (gentypen) bytes of data read from address (p + (offset * n))". Does it imply that short4 m = vload4(1920, p) will read four 16-bit values starting from address p+1920*4 or will it read one 16-bit value each from locations p+1920*0, p+1920*1, p+1920*2 and p+1920*3?
p+1920*0, p+1920*1, p+1920*2 and p+1920*3
has a strided pattern but definition says it's a vector load and it doesn't say sparse vector so it has to be a
four 16-bit values starting from address p+1920*4
So it shouldn't be different than loading a struct with exception of alignment handling(maybe).
For strided copies, you can use
event_t async_work_group_strided_copy ( __local gentype *dst,
const __global gentype *src,
size_t num_gentypes,
size_t src_stride,
event_t event)
Related
I have two functions:
void sum1(short * a, short * b, short * res, int size);
void sum2(float * a, float * b, float * res, int size);
and I have a single generic kernel
__kernel void sum(__global const T * a, __global const T * b, __global T * res, int size)
{
int x = get_global_id(0);
if (x < size) res[x] = a[x] + b[x];
}
is it safely to invoke this generic kernel from functions presented above with compile options -D T=short and -D T=float respectively? Do I need to use alignment or does OpenCL automatically align kernel arguments in this case to 2 and 4 bytes respectively?
In general, when I am passing cl_mem object to a kernel OpenCL does not know about a data type that stored in this cl_mem object and I could understand how OpenCL "transforms" cl_mem object to appropriate pointer in kernel arg.. I need help with that
1- Yes. It is safe to use directly -D T=short or float at compile time. Since it will generate 2 proper kernels.
2- OpenCL (and other programming languages with pointers) understand that the pointer you are passing has a type. And they stick to this type when addressing the memory.
At least in C, this is not a problem since automatic pointer casting is not allowed. And the programmer gets an error if the pointer doesn't match.
However in OpenCL, the memory zones of a buffer are considered generic or void pointers. When you assign them to a kernel, the cast is implicit in the assignment. But this doesn't mean it is correct !
For example. If you create a float buffer, fill it with floats, use it as an argument to short kernel. The result will be wrong, since the kernel will interpret the buffer wrong. However if you do it wrong by passing a short array to a float kernel, the result will be a SEG_FAULT.
Could somebody tell me how to convert double precision into network byte ordering.
I tried
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
functions and they worked well but none of them does double (float) conversion because these types are different on every architecture. And through the XDR i found double-float precision format representations (http://en.wikipedia.org/wiki/Double_precision) but no byte ordering there.
So, I would much appreciate if somebody helps me out on this (C code would be great!).
NOTE: OS is Linux kernel (2.6.29), ARMv7 CPU architecture.
You could look at IEEE 754 at the interchanging formats of floating points.
But the key should be to define a network order, ex. 1. byte exponent and sign, bytes 2 to n as mantissa in msb order.
Then you can declare your functions
uint64_t htond(double hostdouble);
double ntohd(uint64_t netdouble);
The implementation only depends of your compiler/plattform.
The best should be to use some natural definition,
so you could use at the ARM-platform simple transformations.
EDIT:
From the comment
static void htond (double &x)
{
int *Double_Overlay;
int Holding_Buffer;
Double_Overlay = (int *) &x;
Holding_Buffer = Double_Overlay [0];
Double_Overlay [0] = htonl (Double_Overlay [1]);
Double_Overlay [1] = htonl (Holding_Buffer);
}
This could work, but obviously only if both platforms use the same coding schema for double and if int has the same size of long.
Btw. The way of returning the value is a bit odd.
But you could write a more stable version, like this (pseudo code)
void htond (const double hostDouble, uint8_t result[8])
{
result[0] = signOf(hostDouble);
result[1] = exponentOf(hostDouble);
result[2..7] = mantissaOf(hostDouble);
}
This might be hacky (the char* hack), but it works for me:
double Buffer::get8AsDouble(){
double little_endian = *(double*)this->cursor;
double big_endian;
int x = 0;
char *little_pointer = (char*)&little_endian;
char *big_pointer = (char*)&big_endian;
while( x < 8 ){
big_pointer[x] = little_pointer[7 - x];
++x;
}
return big_endian;
}
For brevity, I've not include the range guards. Though, you should include range guards when working at this level.
I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)
I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?
I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.
I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}
I'm trying to calculate a mod b in OpenCL, where a is an array of ulong elements, and is twice the length of b.
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
// length = len(a) = 2 * len(b)
...
}
What I want is something like a %= b, but with arrays. The arrays represent numbers of course, with their last element representing the least significant bits.
Is it possible to do this in-place (i.e. without allocating extra memory)? What is a good algorithm for calculating the medulus for large numbers?
Note that neither of the two numbers can be easily represented in another way (e.g. using exponents). Most of the times they will be pseudoprimes. Also, having some concurrency would be nice.
Pointers to any useful material on this are welcome.
EDIT: if that helps, length can be known at compile time.
EDIT: I'm sorry I wasn't clear here. I'm not working on an array of integers, I'm working on two big integers, for example a is 8Mb (a 67108864-bit number) and b is 4Mb (a 33554432-bit number). I work them in base 2^64, hence the arrays of ulong integers. Basically, those are just the digits of the number.
You just do:
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
ulong id = get_global_id(0) ;
a[id] = a[id] % b[id];
}
I don't really understand your problem, the arrays size difers? Or maybe you want a more special calculation?