Avoiding data alignment in OpenCL - opencl

I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?

If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)

Related

Using async_work_group_copy() with pointer?

__kernel void kmp(__global char pattern[1*4], __global char* string, __global int failure[1*4], __global int ret[1], int g_length, int l_length, int thread_num){
int pattern_num = 1;
int pattern_size = 4;
int gid = get_group_id(0);
int glid = get_global_id(0);
int lid = get_local_id(0);
int i, j, x = 0;
__local char *tmp_string;
event_t event;
if(l_length < pattern_size){
return;
}
event = async_work_group_copy(tmp_string, string+gid*g_length, g_length, 0);
wait_group_events(1, &event);
Those are some part of my code.
I want to find the matched pattern in the text.
First, initialize all my patterns and string(I read string from text and experimentally use one pattern only) on CPU side.
Second, transfer them to kernel named kmp.
(parameters l_length and g_length are the size of string which will be copied to lid and glid each. In other words, the pieces of string)
And lastly, I want to copy the divided string to local memory.
But there is a problem. I cannot get any valid result when I copy them using async_work_group_copy().
When I change __local char*tmp_string to array, the problem still remains.
What I want to do is 1)divide the string 2)copy them to each thread 3)and compute the matching number.
I wonder what's wrong in this code. Thanks!
OpenCL spec has this:
The async copy is performed by all work-items in a work-group and this
built-in function must therefore be encountered by all work-items in a
work-group executing the kernel with the same argument values;
otherwise the results are undefined.
so you shouldn't return early for any workitems in a group. Early return is better suited to CPU anyway. If this is GPU, just compute the last overflowing part using augmented/padded input-output buffers.
Otherwise, you can early return whole group(this should work since no workitem hitting any async copy instruction) and do the remaining work on the cpu, unless the device doesn't use any workitems(but a dedicated secret pipeline) for the async copy operation.
Maybe you can enqueue a second kernel(in another queue concurrently) to compute remaining latest items with workgroupsize=remaining_size instead of having extra buffer size or control logic.
tmp_string needs to be initialized/allocated if you are going to copy something to/from it. So you probably will need the array version of it.
async_work_group_copy is not a synchronization point so needs a barrier before it to get latest bits of local memory to use for async copy to global.
__kernel void foo(__global int *a, __global int *b)
{
int i=get_global_id(0);
int g=get_group_id(0);
int l=get_local_id(0);
int gs=get_local_size(0);
__local int tmp[256];
event_t evt=async_work_group_copy(tmp,&a[g*gs],gs,0);
// compute foobar here in async to copies
wait_group_events(1,&evt);
tmp[l]=tmp[l]+3; // compute foobar2 using local memory
barrier(CLK_LOCAL_MEM_FENCE);
event_t evt2=async_work_group_copy(&b[g*gs],tmp,gs,0);
// compute foobar3 here in async to copies
wait_group_events(1,&evt2);
}

memcpy in opencl 1.1

Is there any mechanisms like memcpy in opencl. I want to copy a struct of pointers on the gpu, and as we cannot copy them directly from host to device, so I decide to create sepeprate buffer for each pointer and then put all of them together in one struct on the device. So, I wanted to have a mechanism like memcpy to copy the data from one buffer to other one. Is there anything like that?
struct Grid_gpu {
cl_uint3 dims;
cl_uint* elements_beg_index;
cl_uint n_element_cells;
cl_uint* element_cells;
};
So, to do that I defined a struct on my device as follows:
typedef struct {
uint3 dims;
__global uint* element_begIndices;
__global uint n_element_cells;
__global uint* element_cells;
} Grid;
Then I just used the following kernel to fill the memory which is of type Grid:
// a kernel to fill a grid acceleration structure.
__kernel void fill_grid_accel(
uint3 dims,
__global uint* element_begIndices,
uint n_element_cells,
__global uint* element_cells,
__global Grid* grid){
grid->dims.x = dims.x;
grid->dims.y = dims.y;
grid->dims.z = dims.z;
grid->element_begIndices = element_begIndices;
grid->n_element_cells = n_element_cells;
grid->element_cells = element_cells;
}
Now, the grid memory contains the pointer is filled with required data.s

OpenCL kernels arguments ambiguity

I have two functions:
void sum1(short * a, short * b, short * res, int size);
void sum2(float * a, float * b, float * res, int size);
and I have a single generic kernel
__kernel void sum(__global const T * a, __global const T * b, __global T * res, int size)
{
int x = get_global_id(0);
if (x < size) res[x] = a[x] + b[x];
}
is it safely to invoke this generic kernel from functions presented above with compile options -D T=short and -D T=float respectively? Do I need to use alignment or does OpenCL automatically align kernel arguments in this case to 2 and 4 bytes respectively?
In general, when I am passing cl_mem object to a kernel OpenCL does not know about a data type that stored in this cl_mem object and I could understand how OpenCL "transforms" cl_mem object to appropriate pointer in kernel arg.. I need help with that
1- Yes. It is safe to use directly -D T=short or float at compile time. Since it will generate 2 proper kernels.
2- OpenCL (and other programming languages with pointers) understand that the pointer you are passing has a type. And they stick to this type when addressing the memory.
At least in C, this is not a problem since automatic pointer casting is not allowed. And the programmer gets an error if the pointer doesn't match.
However in OpenCL, the memory zones of a buffer are considered generic or void pointers. When you assign them to a kernel, the cast is implicit in the assignment. But this doesn't mean it is correct !
For example. If you create a float buffer, fill it with floats, use it as an argument to short kernel. The result will be wrong, since the kernel will interpret the buffer wrong. However if you do it wrong by passing a short array to a float kernel, the result will be a SEG_FAULT.

With the Arduino Ethernet shield, what's the difference between `write()`, `print()`, and `printIn()`?

Using the Arduino Ethernet Server Library, what is the difference between:
server.write(data);,
server.print(data);, and
server.println(data);
I know that the printIn adds a new line, where print does not. I cannot find any examples for server.write();.
(Long answer, skip to TL;DR at the bottom if unwieldy)
Where print() and write() come from
To find out, we can look at the source. Server is an instance of the EthernetServer class defined in arduino/libraries/Ethernet/EthernetServer.h (selected lines only)
#include "Server.h"
class EthernetClient;
class EthernetServer :
public Server {
private:
public:
virtual size_t write(uint8_t);
virtual size_t write(const uint8_t *buf, size_t size);
using Print::write;
};
Ok, so it is a Server. Server is defined in /usr/share/arduino/hardware/arduino/cores/arduino/Server.h, and there is very little to it:
class Server : public Print {
public:
virtual void begin() =0;
};
This means that server is a subclass of Print so we can look for differences between write() and print() there.
print() and write() parameters
We see that this class (i.e. Print) defines a number of overloaded print() methods:
size_t print(const __FlashStringHelper *);
size_t print(const String &);
size_t print(const char[]);
size_t print(char);
size_t print(unsigned char, int = DEC);
size_t print(int, int = DEC);
size_t print(unsigned int, int = DEC);
size_t print(long, int = DEC);
size_t print(unsigned long, int = DEC);
size_t print(double, int = 2);
size_t print(const Printable&);
and three overloaded write() methods:
virtual size_t write(uint8_t) = 0;
size_t write(const char *str) { return write((const uint8_t *)str, strlen(str)); }
virtual size_t write(const uint8_t *buffer, size_t size);
As you can see the C-string write uses the block write (the third method), and in the default implementation, the block write uses a byte write (the first method), which is pure virtual method: virtual size_t write(uint8_t) = 0;. It must be overriden in every class that derives from Print. Additionally the block write() may be overriden as well in order to write multi-byte data more efficiently.
So, parameter-wise:
write(): on bytes (uint8_t), byte buffers, and char array pointers (= regular C strings)
print(): Arduino Strings, ints and longs (in whatever base), floats, and any class derived from Printable, in addition to chars and C strings.
As you can see, formally, there is little overlap between the parameters write() and print() takes. For instance only write() takes uint8_t, but only print() can take a char. The only area of overlap is the C-style strings: there is print(const char[]); and write(const char *str);. However, even in cases like char the print() function simply calls the write(uint8_t):
size_t Print::print(char c)
{
return write(c);
}
The same is true for print(char[])
write() in `EthernetServer
The EthernetServer class introduces a block write method
size_t EthernetServer::write(const uint8_t *buffer, size_t size)
and in the EthernetServer the write(uint8_t) simply thunks to the block write:
size_t EthernetServer::write(uint8_t b)
{
return write(&b, 1);
}
Since all the print() calls and non-uint8_t write() calls use either write(uint8_t) or write(uint8_t*, size_t), in the EthernetServer class every print/write call is made using the block write.
Performance and choosing between print() and write()
The thunking print() functions (such as print(char c)) will be most-likely be inlined by the gcc compiler, though if you are worried about this you can call write() instead of print().
One case where you might want to call write() instead of print() to save a couple of clock cycles is where you are holding a byte/uint8_t and you need to print that. Using print() your data will need to be converted to a 4-byte value (int) and then printed using more code. In this case write() will be a tiny bit faster.
On the other hand, code consistency is probably worth something too. From this perspective it might make sense to make all print() calls.
Most of the time, however, your types will dictate calling the print() function: write can only take three types of input.
TL;DR: The answer to your question then is that there isn't much difference between print() and write() except:
The write() methods (byte or block) are the methods that do the actual work of sending characters somewhere, in every case.
write() can take bytes (uint8_t), byte buffers, and char array pointers (= regular C strings) as parameters, whereas print() takes Arduino Strings, ints and longs (in whatever base), floats, and any class derived from Printable, in addition to chars and C strings. So we might say that write() is lower level than print(), given the fact that it only takes low-level types.
Most of the time your output types will dictate which one to use. To make the fastest code use write() for printing byte/uint8_t types, but print everywhere makes your code look a teensy bit better IMHO (mainly because it doesn't raise the print() versus write() questions).

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

Resources