CL_OUT_OF_RESOURCES GPU error - opencl

I am running multiple iterations of an OpenCL program, and after a few, I get the following error:
ERROR: Read Result (-5)
CL_OUT_OF_RESOURCES
when running this command
err = clEnqueueReadBuffer( commands, d_c, CL_TRUE, 0,
sizeof(char) * result_size,
result_buffer, 0, NULL, NULL );
checkErr(err,"Read Result");
The kernel allocates 3 global memory buffers, which I release
clReleaseMemObject(d_a);
clReleaseMemObject(d_b)
clReleaseMemObject(d_c);
clReleaseKernel(ko_smat);
But I also allocate local and private memory, the private memory is allocated in the kernel (char tmp_array) and local memory.
My kernel has definition:
__kernel void mmul(
__global char* C,
__global char* A,
__global char* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
The local memory is created in the kernel via
clSetKernelArg(ko_smat,6, sizeof(char) * local_mem_size, NULL);
I'm guessing that the out of memory error is caused by me failing to free either the private memory or the local memory, but I don't know how to?

Since I don't have enough reputation to comment, I have to use an answer.
To properly address your problem it will be helpful, if you post a working example of your code.
How much local memory do you actually allocate? It might very well possible that you allocate more than your device is capable of. If your "local_mem_size" variable is not fixed but calculated dynamically, find out the worst case scenario.
You can query how much local memory your device can provide, just call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE.
As DarkZeros already mentioned, CL_OUT_OF_RESOURCES is an error that occurs on NVIDIA GPUs when addressing memory out of range. This can happen for both local and global memory.

Related

Device-side enqueue causes CL_OUT_OF_RESOURCES

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

read_only vs const on non-image OpenCL parameters

Reading the OpenCL documentation, I know that the access qualifiers read_only and write_only are intended for image memory.
However, I'm noticing some people use these qualifiers on regular, non-image-memory, parameters, e.g.:
void foo(unsigned n, __global read_only int* data)
Note the lack of const.
My questions:
Does read_only imply, in particular, const?
Does read_only imply anything other than const? Something else that an OpenCL compiler can utilize?
... or is it just meaningless for non-image-memory, and ignored?

OpenCL - Storing a large array in private memory

I have a large array of float called source_array with the size of around 50.000. I am current trying to implement a collections of modifications on the array and evaluate it. Basically in pseudo code:
__kernel void doSomething (__global float *source_array, __global boolean *res. __global int *mod_value) {
// Modify values of source_array with mod_value;
// Evaluate the modified array.
}
So in the process I would need to have a variable to hold modified array, because source_array should be a constant for all work item, if i modify it directly it might interfere with another work item (not sure if I am right here).
The problem is the array is too big for private memory therefore I can't initialize in kernel code. What should I do in this case ?
I considered putting another parameter into the method, serves as place holder for modified array, but again it would intefere with another work items.
Private "memory" on GPUs literally consists of registers, which generally are in short supply. So the __private address space in OpenCL is not suitable for this as I'm sure you've found.
Victor's answer is correct - if you really need temporary memory for each work item, you will need to create a (global) buffer object. If all work items need to independently mutate it, it will need a size of <WORK-ITEMS> * <BYTES-PER-ITEM> and each work-item will need to use its own slice of the buffer. If it's only temporary, you never need to copy it back to host memory.
However, this sounds like an access pattern that will work very inefficiently on GPUs. You will do much better if you decompose your problem differently. For example, you may be able to make whole work-groups coordinate work on some subrange of the array - copy the subrange into local (group-shared) memory, the work is divided between the work items in the group, and the results are written back to global memory, and the next subrange is read to local, etc. Coordinating between work-items in a group is much more efficient than each work item accessing a huge range of global memory We can only help you with this algorithmic approach if you are more specific about the computation you are trying to perform.
Why not to initialize this array in OpenCL host memory buffer. I.e.
const size_t buffer_size = 50000 * sizeof(float);
/* cl_malloc, malloc or new float [50000] or = {0.1f,0.2f,...} */
float *host_array_ptr = (float*)cl_malloc(buffer_size);
/*
put your data into host_array_ptr hear
*/
cl_int err_code;
cl_mem my_array = clCreateBuffer( my_cl_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, buffer_size, host_array_ptr, &err_code );
Then you can use this cl_mem my_array in OpenCL kernel
Find out more

Arduino Zero - Region Ram overflowed with stack

I have some code that uses nested Structs to store device parameters see below:
This is using an Ardunio Zero ( Atmel SAMD21)
The declares Storeage with up to 3 networks each network with 64 devices.
I would like to use 5 networks however when I increase the networks to 4 the code will not compile.
I get region RAM overflowed with stack / RAM overflowed by 4432 bytes.
I understand that this is taking more ram then I have? I am looking to see if there is a solution using a different method to achieve the same thing but get it to fit?
struct device {
int stat;
bool changed;
char data[51];
char state[51];
char atime[14];
char btime[14];
};
struct outputs {
device fitting[64];
};
struct storage {
int deviceid =0;
int addstore =0;
bool set;
bool run_events = false;
char authkey[10];
outputs network[3];
} ;
storage data_store;
Well, the usual approches are:
Consider if all or any of the data is actually read-only, and thus can be made const (which should move it to read-only memory, if that fails you can usually force it by adding compiler-specific magic).
Figure out means of representing the data using fewer bits. For instance using 14 bytes for each of three timestamps might seem excessive; switching these to 32-bit timestamps and generating the strings when needed would save around 70%.
If there are duplicates, then perhaps each storage doesn't need three unique outputs, but can instead store pointers into a shared "pool" of unique configurations.
If not all 64 fittings are used, that array could also be refactored into having non-constant length.
It's hard to be more specific since I don't know your data or application well enough.
Your struct is taking too much place. That's all. Assuming chars, ints and bools are internally 1 byte each, your device struct takes 132 bytes. Then, your outputs struct takes 8448 bytes or 8.25Kb. Your unit has 32Kb of RAM...

CUDA/C++: Passing __device__ pointers in C++ code

I am developing a Windows 64-bit application that will manage concurrent execution of different CUDA-algorithms on several GPUs.
My design requires a way of passing pointers to device memory
around c++ code. (E.g. remember them as members in my c++ objects).
I know that it is impossible to declare class members with __device__ qualifiers.
However I couldn't find a definite answer whether assigning __device__ pointer to a normal C pointer and then using the latter works. In other words: Is the following code valid?
__device__ float *ptr;
cudaMalloc(&ptr, size);
float *ptr2 = ptr
some_kernel<<<1,1>>>(ptr2);
For me it compiled and behaved correctly but I would like to know whether it is guaranteed to be correct.
No, that code isn't strictly valid. While it might work on the host side (more or less by accident), if you tried to dereference ptr directly from device code, you would find it would have an invalid value.
The correct way to do what your code implies would be like this:
__device__ float *ptr;
__global__ void some_kernel()
{
float val = ptr[threadIdx.x];
....
}
float *ptr2;
cudaMalloc(&ptr2, size);
cudaMemcpyToSymbol("ptr", ptr2, sizeof(float *));
some_kernel<<<1,1>>>();
for CUDA 4.x or newer, change the cudaMemcpyToSymbol to:
cudaMemcpyToSymbol(ptr, ptr2, sizeof(float *));
If the static device symbol ptr is really superfluous, you can just to something like this:
float *ptr2;
cudaMalloc(&ptr2, size);
some_kernel<<<1,1>>>(ptr2);
But I suspect that what you are probably looking for is something like the thrust library device_ptr class, which is a nice abstraction wrapping the naked device pointer and makes it absolutely clear in code what is in device memory and what is in host memory.

Resources