When I invoke an asynchronous CUDA kernel, how are its arguments copied? - asynchronous

Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?

Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.

The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

Related

Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

Memory leak, Pointer changing reference

I'm writing some signal processing routine, using the PortAudio library. I'm using a
stucture which contains a pointer to float which is intended to be used as a buffer. I then pass it to an audio callback function.
My problem is that after callback processing is finished, my pointer has changed reference and thus cannot be freed. This is not such a big deal but the thing is that I don't understand when and how the pointer reference is changed and I'm getting a feel like I'm missing something important.
Here is a simplified version of the code :
typedef struct{
float* tmp;
//other stuff
} Data;
Data data;
data.tmp = NULL;
data.tmp = (float*) calloc(N,sizeof(float));// N is the size of the buffer
Pa_OpenDefaultStream(some args, //opens a PortAudio stream and passes tmp to callback
callback,
&data );
A stream is then started in another high priority thread and the callback is being executed as many times as needed. During callback tmp is being used as a ring buffer and is constantly being copied new data to.
static int callback(args,void* data){
Data* x = (Data*) tmp;
x->tmp = update();
}
where update() returns a pointer to a float which is initialized the same way as tmp is (calloc).
float* update(){
//do stuff
return m_tmp2;
}
float* m_tmp2 = (float*) calloc(N,sizeof(float));//same N as before
But after the stream is closed I get an error when calling free before quitting.
free(data.tmp);//throws a SIGABRT error
Some breakpoint debugging showed me that the reference of the pointer is being changed during the callback processing, but I don't get when and how it happens because everything else runs smoothly. It must be something during the callback execution, but I'm sure update() returns a pointer that is the same size as tmp. Or is it link with PortAudio ?
Please, any clues ?
Not really sure if I understand it right. You allocated the float (x.tmp) every time the callback function is called..
static int callback(args,void* data){
Data* x = (Data*) tmp;
x->tmp = update();
}
I assume the above is typo, you actually mean
static int callback(args,void* data){
Data* x = (Data*) data;
x->tmp = update();
}
Well, you're actually change the pointer value of tmp by assigning it update() because it's reallocate a new memory location in heap and changed the pointing location of the tmp..
float* update(){
//do stuff
return m_tmp2;
}
The data.tmp must have pointed to a new location every time the callback function is called.. So, I don't see why it doesn't behave as you described..
That's the correct behavior already.. Maybe I miss anything?
and maybe you should provide a mechanism to keep track of the buffer.. so all tmp (float *) you allocate for your circular buffer can be freed (not just the first one before the first callback is called..

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

Copying a struct containing pointers to CUDA device

I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.
typedef struct StructA {
int* arr;
} StructA;
When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:
#define N 10
int main() {
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA *h_a = (StructA*)malloc(sizeof(StructA));
StructA *d_a;
int *d_arr;
// 1. Allocate device struct.
cudaMalloc((void**) &d_a, sizeof(StructA));
// 2. Allocate device pointer.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 3. Copy pointer content from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 4. Point to device pointer in host struct.
h_a->arr = d_arr;
// 5. Copy struct from host to device.
cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);
// 6. Call kernel.
kernel<<<N,1>>>(d_a);
// 7. Copy struct from device to host.
cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);
// 8. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 9. Point to host pointer in host struct.
h_a->arr = h_arr;
}
My question is: Is this the way to do it?
It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.
Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.
Don't forget that you can pass structures by value to kernels. This code works:
// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
in.arr[threadIdx.x] *= 2;
}
Doing so means you only have to copy the array to the device, not the structure:
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;
// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 3. Point to device pointer in host struct.
h_a.arr = d_arr;
// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);
// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 6. Point to host pointer in host struct
// (or do something else with it if this is not needed)
h_a.arr = h_arr;
As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.
Consider the following example
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* TEST STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor (wrong version)
//~Lock(void) {
// printf("Calling destructor\n");
// gpuErrchk(cudaFree(d_state));
//}
// --- Destructor (correct version)
// __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
// gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
// }
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {
if (threadIdx.x == 0) {
lock.lock();
*nblocks = *nblocks + 1;
lock.unlock();
}
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output
Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37
There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.
struct of arrays is a nightmare in cuda. You will have to copy each of the pointer to a new struct which the device can use. Maybe you instead could use an array of structs? If not the only way I have found is to attack it the way you do, which is in no way pretty.
EDIT:
since I can't give comments on the top post: Step 9 is redundant, since you can change step 8 and 9 into
// 8. Copy pointer from device to host.
cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

Calling OpenCL kernel from another OpenCL kernel

I have seen in one post here that we can call a function from an OpenCL kernel. But in my situation, I need that complex function to be parallelized (run by all available threads) as well, so do I have to make that function a kernel too and call it straight away like function from the main kernel ? or whats possible solution for this situation? Thanks in advance
You can call helper functions from your kernel and they will be parallelized in the same manner as the kernel, imagine them as inlined inside your kernel code. So, each work item will invoke the helper function for the working set it handles.
float4 helper_function(float4 input)
{
return input.x + input.y + input.z + input.w;
}
__kernel kernel_function(const float4* arr, float4* out)
{
id = get_global_id(0);
out[id] = helper_function(arr[id]);
}
OpenCL 2.0 spec added a new feature for dynamic paralelism.
6.13.17 Enqueuing Kernels
OpenCL 2.0 allows a kernel to independently enqueue to the same device, without host
interaction. ...
In the example below my_func_B enqueus my_func_A on the device:
kernel void
my_func_A(global int *a, global int *b, global int *c)
{
...
}
kernel void
my_func_B(global int *a, global int *b, global int *c)
{
ndrange_t ndrange;
// build ndrange information
...
// example – enqueue a kernel as a block
enqueue_kernel(get_default_queue(), ndrange, ^{my_func_A(a, b, c);});
...
}
If I understand your question correctly, you want to do a separate full pass over a buffer from inside the kernel. I don't think that is possible from within the kernel, so you'd have to create the code for the "inner" pass as a separate kernel and also call that kernel separately from your host code. The output from that kernel doesn't have to be read back to the host memory, but can stay in device memory between your kernel calls.

Resources