Calling OpenCL kernel from another OpenCL kernel - opencl

I have seen in one post here that we can call a function from an OpenCL kernel. But in my situation, I need that complex function to be parallelized (run by all available threads) as well, so do I have to make that function a kernel too and call it straight away like function from the main kernel ? or whats possible solution for this situation? Thanks in advance

You can call helper functions from your kernel and they will be parallelized in the same manner as the kernel, imagine them as inlined inside your kernel code. So, each work item will invoke the helper function for the working set it handles.
float4 helper_function(float4 input)
{
return input.x + input.y + input.z + input.w;
}
__kernel kernel_function(const float4* arr, float4* out)
{
id = get_global_id(0);
out[id] = helper_function(arr[id]);
}

OpenCL 2.0 spec added a new feature for dynamic paralelism.
6.13.17 Enqueuing Kernels
OpenCL 2.0 allows a kernel to independently enqueue to the same device, without host
interaction. ...
In the example below my_func_B enqueus my_func_A on the device:
kernel void
my_func_A(global int *a, global int *b, global int *c)
{
...
}
kernel void
my_func_B(global int *a, global int *b, global int *c)
{
ndrange_t ndrange;
// build ndrange information
...
// example – enqueue a kernel as a block
enqueue_kernel(get_default_queue(), ndrange, ^{my_func_A(a, b, c);});
...
}

If I understand your question correctly, you want to do a separate full pass over a buffer from inside the kernel. I don't think that is possible from within the kernel, so you'd have to create the code for the "inner" pass as a separate kernel and also call that kernel separately from your host code. The output from that kernel doesn't have to be read back to the host memory, but can stay in device memory between your kernel calls.

Related

OpenCL: Passing a pointer to local memory

I have the following example code:
int compute_stuff(int *array)
{
/* do stuff with array */
...
return x;
}
__kernel void my_kernel()
{
__local int local_mem_block[LENGTH*MY_LOCAL_WORK_SIZE];
int result;
/* do stuff with local memory block */
result = compute_stuff(local_mem_block + (LENGTH*get_local_id(0)));
...
}
The above example compiles and executes fine on my NVIDIA card (RTX 2080).
But when I try to compile on a Macbook with AMD card, I get the following error:
error: passing '__local int *' to parameter of type '__private int *' changes address space of pointer
OK, so then I change the "compute_stuff" function to the following:
int compute_stuff(__local int *array)
Now both NVIDIA and AMD compile it fine, no problem...
But then I have one more test, to compile it on the same Macbook using WINE (rather than boot to Windows in bootcamp), and it gives the following error:
error: parameter may not be qualified with an address space
So it seems as though one is not supposed to qualify a function parameter with an address space. Fair enough. But if I do not do that, then the AMD on native Windows thinks that I am trying to change the address space of the pointer to private (I guess because it assumes that all function arguments will be private?).
What is a good way to handle this so that all three environments are happy to compile it? As a last resort, I am thinking of simply having the program check to see if the build failed without qualifier, and if so, substitute in the "__local" qualifier and build a second time... Seems like a hack, but it could work.
I agree with ProjectPhysX that it appears to be a bug with the WINE implementation. I also found the following appears to satisfy all three environments:
int compute_stuff(__local int * __private array)
{
...
}
__kernel void my_kernel()
{
__local int local_mem_block[LENGTH*MY_LOCAL_WORK_SIZE];
__local int * __private samples;
samples = local_mem_block + (LENGTH*get_local_id(0));
result = compute_stuff(samples);
}
The above is explicitly stating that the pointer itself is private while the memory it is pointing to is kept in local address space. So this removes any ambiguity.
The int* in int compute_stuff(int *array) is __generic address space. The call result = compute_stuff(local_mem_block+...); implicitly converts it to __local, which is allowed according to the OpenCL 2.0 Khronos specification.
It could be that AMD defaults to OpenCL 1.2. Maybe explicitely set –cl-std=CL2.0 in clBuildProgram() or clCompileProgram().
To keep the code compatible with OpenCL 1.2, you can explicitly set the pointer in the function to __local: int compute_stuff(__local int *array). OpenCL allows to set function parameters to the address spaces __global and __local. WINE seems to have a bug here. Maybe inlining the function can solve it: int __attribute__((always_inline)) compute_stuff(__local int *array).
As a last resort, you can do your proposed method. You can detect if it runs on WINE system like this. With that, you could switch between the two code variants without compiling twice and detecting the error.

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int id=get_global_id(0);
atomic_add(&data[0],2);
if (id == 0) {
queue_t q = get_default_queue();
ndrange_t ndrange = ndrange_1D(1,1);
void (^my_block_A)(void) = ^{testDynamic(data);};
enqueue_kernel(q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange,
my_block_A);
}
}
I tested below code to be sure OpenCL 2.0 compiler is working.
__kernel void test2(__global int *data)
{
int id=get_global_id(0);
data[id]=work_group_scan_inclusive_add(id);
}
scan function gives 0,1,3,6 as outputs so OpenCL 2.0 reduction functions are working.
Is dynamic parallelism an extension to OpenCL 2.0? If I remove enqueue_kernel command, results are equal the the expected values(omitting child kernel).
Device: Amd RX550, driver: 17.6.2
Is there a special command that needs to be run on host side, to run child kernel on get_default_queue queue? For now, command queue is created with an OpenCL 1.2 way as below:
commandQueue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
Does get_default_queue() have to be the same command queue which calls the parent kernel? Asking this because I'm using same command queue to upload data to GPU and then download results, in a single synchronization.
Moved solution from question to answer:
Edit: below API command was the solution:
commandQueue = cl::CommandQueue(context, device,
CL_QUEUE_ON_DEVICE|
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
CL_QUEUE_ON_DEVICE_DEFAULT, &err);
after creating this queue(only 1 per device), didn't use it for anything else and also the parent kernel is enqueued on any other host queue so it looks like get_default_queue() doesn't have to be the parent-calling queue.
Documentation says CL_INVALID_QUEUE_PROPERTIES will be thrown if CL_QUEUE_ON_DEVICE is specified but for my machine, dynamic parallelism works with it and doesn't throw that error(as the upper commandQueue constructor parameters).

When I invoke an asynchronous CUDA kernel, how are its arguments copied?

Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?
Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.
The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

Resources