Extra 32 bytes of local memory allocated for OpenCL Kernel - opencl

I'd like to figure out why I'm receiving the following error for an OpenCL kernel that I'm trying to run:
Context error: [CL_OUT_OF_RESOURCES] :
OpenCL Error : clEnqueueNDRangeKernel failed: local memory usage (16416 bytes) is more than available on the device (16384 bytes)
The kernel is defined as:
__kernel void kernelFun(__read_only image2d_t src,
__global __write_only uchar8 *dst,
__global uchar4 *endpointBuffer,
__local uchar4 *pixelBuffer)
{
...
}
And I'm allocating the local memory using the standard clSetKernelArg routine:
clSetKernelArg(gKernel, 3, kPixelBufferBytes, NULL);
where kPixelBufferBytes is equal to 16384.
My question is, where are these extra 32 bytes coming from?

Some OpenCL implementations are known to store kernel arguments using the same physical memory that is used for local memory. You have 32 bytes worth of kernel arguments, which would explain where this discrepancy is coming from.
For example, NVIDIA GPUs definitely used to do this (see page 25 of NVIDIA's original OpenCL best practices guide).

Related

IN/OUT using STS LDS commands

How do I address (reading/writing) the IO registers in lower IO-block (0x00 - 0x63 ) by using the STS LDS (or equvivalent) instructions only ??
Thanks
Kris
The main question is "where are those registers mapped in memory visible by STS/LDS". And the answer is in the datasheet:
If you want to change IN/OUT to LDS/STS, you have to add 0x20 offset to the address used in IN/OUT
7.5 I/O Memory
The I/O space definition of the ATmega328P is shown in Section “” on page 275. All ATmega328P I/Os and peripherals are placed
in the I/O space. All I/O locations may be accessed by the LD/LDS/LDD
and ST/STS/STD instructions, transferring data between the 32 general
purpose working registers and the I/O space. I/O registers within the
address range 0x00 - 0x1F are directly bit-accessible using the SBI
and CBI instructions. In these registers, the value of single bits can
be checked by using the SBIS and SBIC instructions. Refer to the
instruction set section for more details. When using the I/O specific
commands IN and OUT, the I/O addresses 0x00 - 0x3F must be used. When
addressing I/O registers as data space using LD and ST instructions,
0x20 must be added to these addresses. The ATmega328P is a complex
microcontroller with more peripheral units than can be supported
within the 64 location reserved in opcode for the IN and OUT
instructions. For the extended I/O space from 0x60 - 0xFF in SRAM,
only the ST/STS/STD and LD/LDS/LDD instructions can be used.For
compatibility with future devices, reserved bits should be written to
zero if accessed. Reserved I/O memory addresses should never be
written.Some of the status flags are cleared by writing a logical one
to them. Note that, unlike most other AVR®, the CBI and SBI
instructions will only operate on the specified bit, and can therefore
be used on registers containing such status flags. The CBI and SBI
instructions work with registers 0x00 to 0x1F only.The I/O and
peripherals control registers are explained in later sections.

OpenCL eat host memory and device memory the same time

My OpenCL program is like this:
init_opencl
process
release_opencl
In 1. init_opencl, I pre-create all buffers that will be used, like this:
g_mem_under_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
g_mem_mean_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
then in 2. process, I use clEnqueueWriteBuffer to transfer data from host memory to buffer first, then run kernels. At the last, I read out the data from buffer with clEnqueueReadBuffer.
In the end in 3. release_opencl, I call clReleaseMemObject to release all the buffers.
But when I run the program, I observed that the in the 2. process, the host memory and the device memory are increase about 600MB at the same time.
host memory
device memory
My question is:
Why does the opencl program occupy host memory with it has occupied device memory?
And how to make may opencl program not occupy host memory?
My test platform is GTX 750.
And in snapdragon 820, I also find the system memory increases 600MB. But in this platform, I don't the GPU memory's change.

Do any GPUs support fine grain system SVM?

OpenCL 2.0 introduced Shared Virtual Memory (SVM), allowing virtual memory addresses to be shared between hosts and devices.
There are a number of different SVM capabilities, see this extract from cl.h:
/* cl_device_svm_capabilities */
#define CL_DEVICE_SVM_COARSE_GRAIN_BUFFER (1 << 0)
#define CL_DEVICE_SVM_FINE_GRAIN_BUFFER (1 << 1)
#define CL_DEVICE_SVM_FINE_GRAIN_SYSTEM (1 << 2)
#define CL_DEVICE_SVM_ATOMICS (1 << 3)
According to this article from Intel, the CL_DEVICE_SVM_FINE_GRAIN_SYSTEM capability means that an OpenCL device can share an operating systems' address space, without creating an SVM buffer for it.
Supporting fine grained SVM with a CPU device should be relatively simple. My (6th gen, Skylake) system reports that it supports CL_DEVICE_SVM_FINE_GRAIN_SYSTEM using the Intel Experimental OpenCL 2.1 CPU Only Platform. However, the Skylake GPU and CPU do not support CL_DEVICE_SVM_FINE_GRAIN_SYSTEM using the normal Intel(R) OpenCL platform.
I can imagine that it is very hard (if not impossible!) for a GPU on a graphics card to support fine grained SVM. However, it should be possible for a GPU on an APU, such as an Intel i7 or an AMD A10 to support it.
Do any GPUs support fine grained system Shared Virtual Memory?

How to integrate CPP applications with OpenCL code

I'm devloping on Linux & CPP (Using Eclipse SDK).
I'm novice at OpenCL (GPU Programming)
I want to execute some of my code on the GPU (rewrite some functions with openCL and run them on the GPU).
I'm liitle bit confuse - If I will write some code (.cl files) how can I call them from my cpp application ?
I didnt saw any examples for this need.
There are two parts of code if you want to use opencl.
A. The Kernel code.
consists of 1 to many kernel functions which perform your calculations on a device.
B. The Host code
normal c/c++ code. what happens here:
pick a device for the kernel to be computed on (gpu/cpu/igpu/xeon phi/...)
in opencl you got a set of platforms which can contain several different devices. So you pick a platform AND a device.
example:
platform: intel cpu+gpu opencl 1.2
device: cpu OR IGPU
build your kernel
const char * code = load_program_source("kernel.cl");
cl_program program = clCreateProgramWithSource(context, 1, (const char **)&code, NULL, &err); errWrapper("clCreateProgramWithSource", err);
create buffer for memory transfers to device:
cl_mem devInput1 = clCreateBuffer(context, CL_MEM_READ_ONLY, variable1* sizeof(int), NULL, &err);
transfer to device
errWrapper("setKernel", clSetKernelArg(countKeyCardinality, 0, sizeof (cl_mem), &devInput1));
launch kernel
errWrapper("clEnqueueNDRangeKernel", clEnqueueNDRangeKernel(command_queue, kernel_function1, 1, NULL, &tasksize, NULL, 0, NULL, NULL));
wait for termination
clFinish(command_queue)
Fetch your result from the device
using
clEnqueueReadBuffer
Proceed with your c++ code using the result created by opencl calculations.
Thats the basic idea of using opencl in your code.
Better start doing a full opencl tutorial. (just google it, you will drown in opencl tutorials)
concepts you should be familar with:
opencl host api
command queue
kernel arguments.
work group
local size
local memory
global memory
cl_mem object
Debugging opencl is possible but painfull. I would suggest doing debug with NORMAL C Code and port it to opencl if it works.
The main source for all commands is the offical API documentation, which can be found here: opencl 1.2 api
edit: you do not need a special IDE to code opencl.

Is It Possible To Edit OpenCL Kernels On The GPU?

I am looking to edit (or generate new) OpenCL kernels on the GPU from an executing kernel.
Right now, the only way I can see of doing this is to create a buffer full of numbers representing code on the device, send that buffer to an array on the host, generate/compile a new kernel on the host, and enqueue the new kernel back to the device.
Is there any way to avoid a round trip to the host, and just edit kernels on the device?
Can a kernel access the registers in any way?

Resources