read_only vs const on non-image OpenCL parameters - opencl

Reading the OpenCL documentation, I know that the access qualifiers read_only and write_only are intended for image memory.
However, I'm noticing some people use these qualifiers on regular, non-image-memory, parameters, e.g.:
void foo(unsigned n, __global read_only int* data)
Note the lack of const.
My questions:
Does read_only imply, in particular, const?
Does read_only imply anything other than const? Something else that an OpenCL compiler can utilize?
... or is it just meaningless for non-image-memory, and ignored?

Related

OpenCL - Storing a large array in private memory

I have a large array of float called source_array with the size of around 50.000. I am current trying to implement a collections of modifications on the array and evaluate it. Basically in pseudo code:
__kernel void doSomething (__global float *source_array, __global boolean *res. __global int *mod_value) {
// Modify values of source_array with mod_value;
// Evaluate the modified array.
}
So in the process I would need to have a variable to hold modified array, because source_array should be a constant for all work item, if i modify it directly it might interfere with another work item (not sure if I am right here).
The problem is the array is too big for private memory therefore I can't initialize in kernel code. What should I do in this case ?
I considered putting another parameter into the method, serves as place holder for modified array, but again it would intefere with another work items.
Private "memory" on GPUs literally consists of registers, which generally are in short supply. So the __private address space in OpenCL is not suitable for this as I'm sure you've found.
Victor's answer is correct - if you really need temporary memory for each work item, you will need to create a (global) buffer object. If all work items need to independently mutate it, it will need a size of <WORK-ITEMS> * <BYTES-PER-ITEM> and each work-item will need to use its own slice of the buffer. If it's only temporary, you never need to copy it back to host memory.
However, this sounds like an access pattern that will work very inefficiently on GPUs. You will do much better if you decompose your problem differently. For example, you may be able to make whole work-groups coordinate work on some subrange of the array - copy the subrange into local (group-shared) memory, the work is divided between the work items in the group, and the results are written back to global memory, and the next subrange is read to local, etc. Coordinating between work-items in a group is much more efficient than each work item accessing a huge range of global memory We can only help you with this algorithmic approach if you are more specific about the computation you are trying to perform.
Why not to initialize this array in OpenCL host memory buffer. I.e.
const size_t buffer_size = 50000 * sizeof(float);
/* cl_malloc, malloc or new float [50000] or = {0.1f,0.2f,...} */
float *host_array_ptr = (float*)cl_malloc(buffer_size);
/*
put your data into host_array_ptr hear
*/
cl_int err_code;
cl_mem my_array = clCreateBuffer( my_cl_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, buffer_size, host_array_ptr, &err_code );
Then you can use this cl_mem my_array in OpenCL kernel
Find out more

Pointer to a register on a 16 bit controller

How do you declare a pointer on a 16 bit Renesas RL78 microcontroller using IAR's EWB RL78 compiler to a register which has a 20 bit address?
Ex:
static int *ptr = (int *)0xF1000;
The above does not work because pointers are 16 bit addresses.
If the register in question is an on-chip peripheral, then it is likely that your toolchain already includes a processor header with all registers declared, in which case you should use that. If for some reason you cannot or do not wish to do that, then you could at least look at that to see how it declares such registers.
In any event you should at least declare the address volatile since it is not a regular memory location and may change beyond the control and knowledge of your code as part of the normal peripheral behaviour. Moreover you should use explicit sized data types and it is unlikely that this register is signed.
#include <stdint.h>
...
static volatile uint16_t* ptr = (uint16_t*)0xF1000u ;
Added following clarification of target architecture:
The IAR RL78 compiler supports two data models - near and far. From the IAR compiler manual:
● The Near data model can access data in the highest 64 Kbytes of data
memory
● The Far data model can address data in the entire 1 Mbytes of
data memory.
The near model is the default. The far model may be set using the compiler option: --data_model=far; this will globally change the pointer type to allow 20 bit addressing (pointers are 3 bytes long in this case).
Even without specifying the data model globally it is possible to override the default pointer type by explicitly specifying the pointer type using the keywords __near and __far. So in the example in the question the correct declaration would be:
static volatile uint16_t __far* ptr = (uint16_t*)0xF1000u ;
Note the position of the __far keyword is critical. Its position can be used to declare a pointer to far memory, or a pointer in far memory (or you can even declare both to and in far memory).
On an RL78, 0xF1000 in fact refers to the start of data flash rather then a register as stated in the question. Typically a pointer to a register would not be subject to alteration (which would mean it referred to a different register), so might reasonably be declared const:
static volatile uint16_t __far* const ptr = (uint16_t*)0xF1000u ;
Similarly to __far the position of const is critical to the semantics. The above prevents ptr from being modified but allows what ptr refers to to be modified. Being flash memory, this may not always be desirable or possible, so it is possible that it could reasonably be declared a const pointer to a const value.
Note that for RL78 Special Function Registers (SFR) the IAR compiler has a keyword __sfr specifically for addressing registers in the area 0xFFF00-0xFFFFF:
Example:
#pragma location=0xFFF20
__no_init volatile uint8_t __sfr PORT1; // PORT1 is located at address 0xFFF20
Alternative syntax using IAR specfic compiler extension:
__no_init volatile uint8_t __sfr PORT1 # 0xFFF20 ;

Specifying multiple address spaces for a type in the list of arguments of function

I wrote a function in OpenCL:
void sort(int* array, int size)
{
}
and I need to call the function once over a __private array and once over a __global array. Apparently, it's not allowed in OpenCL to specify multiple address spaces for a type. Therefore, I should duplicate the declaration of function, while they have exactly the same body:
void sort_g(__global int* array, int size)
{
}
void sort_p(__private int* array, int size)
{
}
This is very inefficient for maintaining the code and I am wondering if there is a better way to manage multiple address spaces in OpenCL or not?
P.S.: I don't see why OpenCL doesn't allow multiple address spaces for a type. Compiler could generate multiple instances of the function (one per address space) and use them appropriately once they're called in the kernel.
For OpenCL < 2.0, this is how the language is designed and there is no getting around it, regrettably.
For OpenCL >= 2.0, with the introduction of the generic address space, your first piece of code works as you would expect.
In short, upgrading to 2.0 would solve your problem (and bring in other niceties), otherwise you're out of luck (you could perhaps wrap your function in a macro, but ew, macros).

CL_OUT_OF_RESOURCES GPU error

I am running multiple iterations of an OpenCL program, and after a few, I get the following error:
ERROR: Read Result (-5)
CL_OUT_OF_RESOURCES
when running this command
err = clEnqueueReadBuffer( commands, d_c, CL_TRUE, 0,
sizeof(char) * result_size,
result_buffer, 0, NULL, NULL );
checkErr(err,"Read Result");
The kernel allocates 3 global memory buffers, which I release
clReleaseMemObject(d_a);
clReleaseMemObject(d_b)
clReleaseMemObject(d_c);
clReleaseKernel(ko_smat);
But I also allocate local and private memory, the private memory is allocated in the kernel (char tmp_array) and local memory.
My kernel has definition:
__kernel void mmul(
__global char* C,
__global char* A,
__global char* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
The local memory is created in the kernel via
clSetKernelArg(ko_smat,6, sizeof(char) * local_mem_size, NULL);
I'm guessing that the out of memory error is caused by me failing to free either the private memory or the local memory, but I don't know how to?
Since I don't have enough reputation to comment, I have to use an answer.
To properly address your problem it will be helpful, if you post a working example of your code.
How much local memory do you actually allocate? It might very well possible that you allocate more than your device is capable of. If your "local_mem_size" variable is not fixed but calculated dynamically, find out the worst case scenario.
You can query how much local memory your device can provide, just call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE.
As DarkZeros already mentioned, CL_OUT_OF_RESOURCES is an error that occurs on NVIDIA GPUs when addressing memory out of range. This can happen for both local and global memory.

CUDA/C++: Passing __device__ pointers in C++ code

I am developing a Windows 64-bit application that will manage concurrent execution of different CUDA-algorithms on several GPUs.
My design requires a way of passing pointers to device memory
around c++ code. (E.g. remember them as members in my c++ objects).
I know that it is impossible to declare class members with __device__ qualifiers.
However I couldn't find a definite answer whether assigning __device__ pointer to a normal C pointer and then using the latter works. In other words: Is the following code valid?
__device__ float *ptr;
cudaMalloc(&ptr, size);
float *ptr2 = ptr
some_kernel<<<1,1>>>(ptr2);
For me it compiled and behaved correctly but I would like to know whether it is guaranteed to be correct.
No, that code isn't strictly valid. While it might work on the host side (more or less by accident), if you tried to dereference ptr directly from device code, you would find it would have an invalid value.
The correct way to do what your code implies would be like this:
__device__ float *ptr;
__global__ void some_kernel()
{
float val = ptr[threadIdx.x];
....
}
float *ptr2;
cudaMalloc(&ptr2, size);
cudaMemcpyToSymbol("ptr", ptr2, sizeof(float *));
some_kernel<<<1,1>>>();
for CUDA 4.x or newer, change the cudaMemcpyToSymbol to:
cudaMemcpyToSymbol(ptr, ptr2, sizeof(float *));
If the static device symbol ptr is really superfluous, you can just to something like this:
float *ptr2;
cudaMalloc(&ptr2, size);
some_kernel<<<1,1>>>(ptr2);
But I suspect that what you are probably looking for is something like the thrust library device_ptr class, which is a nice abstraction wrapping the naked device pointer and makes it absolutely clear in code what is in device memory and what is in host memory.

Resources