I have some general parameters declared as a global (__constant) struct, like so:
typedef struct
{
int a;
int b;
float c;
/// blah blah
} SomeParams;
__constant SomeParams Parameters;
in the kernel, I need to use it like so:
__kernel void Foo()
{
int a = Parameters.a;
/// do something useful...
}
I'm not sure how I can initialize the value of Parameters from the host before I execute the kernel.
I have no problem creating buffers, etc for kernel arguments, but since this isn't a kernel argument, what do I need to do?
I'm using the Cloo C#/OpenCL bindings, but even a raw CL API would be helpful.
As far as I know (but I wouldn't swear by this), you can't initialize variables from the host code that are declared in that way (with one exception, see below). You could declare a variable and initialize it like this:
__constant float pi = 3.14f;
You could also do something like this:
Kernel: __constant float width = WIDTH
Host: Build the kernel with a -D build parameter defining the value of WIDTH.
What I have done in the past is have the constant variable as a kernel parameter.
__kernel void Foo(__constant SomeParams Parameters)
{
int a = Parameters.a;
/// do something useful...
}
Then you can allocate and set the value just like any other kernel argument.
Related
I have the following example code:
int compute_stuff(int *array)
{
/* do stuff with array */
...
return x;
}
__kernel void my_kernel()
{
__local int local_mem_block[LENGTH*MY_LOCAL_WORK_SIZE];
int result;
/* do stuff with local memory block */
result = compute_stuff(local_mem_block + (LENGTH*get_local_id(0)));
...
}
The above example compiles and executes fine on my NVIDIA card (RTX 2080).
But when I try to compile on a Macbook with AMD card, I get the following error:
error: passing '__local int *' to parameter of type '__private int *' changes address space of pointer
OK, so then I change the "compute_stuff" function to the following:
int compute_stuff(__local int *array)
Now both NVIDIA and AMD compile it fine, no problem...
But then I have one more test, to compile it on the same Macbook using WINE (rather than boot to Windows in bootcamp), and it gives the following error:
error: parameter may not be qualified with an address space
So it seems as though one is not supposed to qualify a function parameter with an address space. Fair enough. But if I do not do that, then the AMD on native Windows thinks that I am trying to change the address space of the pointer to private (I guess because it assumes that all function arguments will be private?).
What is a good way to handle this so that all three environments are happy to compile it? As a last resort, I am thinking of simply having the program check to see if the build failed without qualifier, and if so, substitute in the "__local" qualifier and build a second time... Seems like a hack, but it could work.
I agree with ProjectPhysX that it appears to be a bug with the WINE implementation. I also found the following appears to satisfy all three environments:
int compute_stuff(__local int * __private array)
{
...
}
__kernel void my_kernel()
{
__local int local_mem_block[LENGTH*MY_LOCAL_WORK_SIZE];
__local int * __private samples;
samples = local_mem_block + (LENGTH*get_local_id(0));
result = compute_stuff(samples);
}
The above is explicitly stating that the pointer itself is private while the memory it is pointing to is kept in local address space. So this removes any ambiguity.
The int* in int compute_stuff(int *array) is __generic address space. The call result = compute_stuff(local_mem_block+...); implicitly converts it to __local, which is allowed according to the OpenCL 2.0 Khronos specification.
It could be that AMD defaults to OpenCL 1.2. Maybe explicitely set –cl-std=CL2.0 in clBuildProgram() or clCompileProgram().
To keep the code compatible with OpenCL 1.2, you can explicitly set the pointer in the function to __local: int compute_stuff(__local int *array). OpenCL allows to set function parameters to the address spaces __global and __local. WINE seems to have a bug here. Maybe inlining the function can solve it: int __attribute__((always_inline)) compute_stuff(__local int *array).
As a last resort, you can do your proposed method. You can detect if it runs on WINE system like this. With that, you could switch between the two code variants without compiling twice and detecting the error.
I would like to have a Variable with Read-Access to all kernels/functions inside a CL Program. For this i have created a variable at the top of the File and prefixed it with __global.
typedef struct{
/* whatever */
} GlobalParameters;
__global GlobalParameters params;
how can i set the Values inside that Struct from the Host code now? Is that even Possible, or how can i edit it else? Or do i have to pass it as Parameter to the kernel every time i need it?
Program scope variables are meant to be constants and need to be initialized.
So, this works like:
typedef struct{
float whatever;
} GlobalParameters;
__constant GlobalParameters params=(GlobalParameters){3.14f};
then you can use it anywhere. But if opencl-compile-time is ok for it, you can alter it with string replacement after preaparing the host-side constant buffer:
typedef struct{
float whatever;
} GlobalParameters;
__constant GlobalParameters params=(GlobalParameters){##replace_0##};
if this is used for minutes per change, you can re-compile it using new string replacement before device-kernel-compiling. If there are non-changing sets, you can compile N times for different kernel programs and switch between them using different contexts.
When writing code for kernels, is it possible to specify a generic data type so that copying the kernel for each used data type is not necessary? Currently I'm using preprocessor macros to define the whole function with various data types:
#define REDUCTION(type) __kernel void reduce_##type##_f(__global __read_only type* a) \
{ \
// do something
} \
REDUCTION(float)
REDUCTION(float2)
This however is not very comfortable. Is there some type specifier like gentype available?
You should be able to do that starting with OpenCL 2.1 which let you use C++ and templates in kernel code Knronos's OpenCL page.
With that, you can simply write:
template <class T>
void reduce_f(__global __read_only T* a) {
// do something
}
However I am not 100% sure templates would be available in the definition of __kernel functions. If that is not the case, you would still need to wrap the kernel declaration within preprocessing macros as so:
#define REDUCTION(type) __kernel void reduce_##type##_f(__global __read_only type* a) \
{ \
return reduce_t(a); \
}
REDUCTION(float)
Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?
Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.
The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);
I have seen in one post here that we can call a function from an OpenCL kernel. But in my situation, I need that complex function to be parallelized (run by all available threads) as well, so do I have to make that function a kernel too and call it straight away like function from the main kernel ? or whats possible solution for this situation? Thanks in advance
You can call helper functions from your kernel and they will be parallelized in the same manner as the kernel, imagine them as inlined inside your kernel code. So, each work item will invoke the helper function for the working set it handles.
float4 helper_function(float4 input)
{
return input.x + input.y + input.z + input.w;
}
__kernel kernel_function(const float4* arr, float4* out)
{
id = get_global_id(0);
out[id] = helper_function(arr[id]);
}
OpenCL 2.0 spec added a new feature for dynamic paralelism.
6.13.17 Enqueuing Kernels
OpenCL 2.0 allows a kernel to independently enqueue to the same device, without host
interaction. ...
In the example below my_func_B enqueus my_func_A on the device:
kernel void
my_func_A(global int *a, global int *b, global int *c)
{
...
}
kernel void
my_func_B(global int *a, global int *b, global int *c)
{
ndrange_t ndrange;
// build ndrange information
...
// example – enqueue a kernel as a block
enqueue_kernel(get_default_queue(), ndrange, ^{my_func_A(a, b, c);});
...
}
If I understand your question correctly, you want to do a separate full pass over a buffer from inside the kernel. I don't think that is possible from within the kernel, so you'd have to create the code for the "inner" pass as a separate kernel and also call that kernel separately from your host code. The output from that kernel doesn't have to be read back to the host memory, but can stay in device memory between your kernel calls.