OpenCL eat host memory and device memory the same time - opencl

My OpenCL program is like this:
init_opencl
process
release_opencl
In 1. init_opencl, I pre-create all buffers that will be used, like this:
g_mem_under_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
g_mem_mean_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
then in 2. process, I use clEnqueueWriteBuffer to transfer data from host memory to buffer first, then run kernels. At the last, I read out the data from buffer with clEnqueueReadBuffer.
In the end in 3. release_opencl, I call clReleaseMemObject to release all the buffers.
But when I run the program, I observed that the in the 2. process, the host memory and the device memory are increase about 600MB at the same time.
host memory
device memory
My question is:
Why does the opencl program occupy host memory with it has occupied device memory?
And how to make may opencl program not occupy host memory?
My test platform is GTX 750.
And in snapdragon 820, I also find the system memory increases 600MB. But in this platform, I don't the GPU memory's change.

Related

How to find memory buffer size limits in SONiC OS?

As per info from Juniper about memory buffers here, a switch memory is divided into 2 main parts: Dedicated buffer and Shared buffer. The memory allocated for a Dedicated buffer is equally distributed among all ports, and Shared buffer memory is available for shared consumption among ports based on specific traffic types. There is also a Shared headroom buffer that is used to keep traffic received during PFC Pause time window.
I want to understand how to get the following buffer limits in SONiC OS:
Dedicated buffer size limit for a port
Shared buffer size limit for a given port's queue
Shared headroom size limit

How to integrate CPP applications with OpenCL code

I'm devloping on Linux & CPP (Using Eclipse SDK).
I'm novice at OpenCL (GPU Programming)
I want to execute some of my code on the GPU (rewrite some functions with openCL and run them on the GPU).
I'm liitle bit confuse - If I will write some code (.cl files) how can I call them from my cpp application ?
I didnt saw any examples for this need.
There are two parts of code if you want to use opencl.
A. The Kernel code.
consists of 1 to many kernel functions which perform your calculations on a device.
B. The Host code
normal c/c++ code. what happens here:
pick a device for the kernel to be computed on (gpu/cpu/igpu/xeon phi/...)
in opencl you got a set of platforms which can contain several different devices. So you pick a platform AND a device.
example:
platform: intel cpu+gpu opencl 1.2
device: cpu OR IGPU
build your kernel
const char * code = load_program_source("kernel.cl");
cl_program program = clCreateProgramWithSource(context, 1, (const char **)&code, NULL, &err); errWrapper("clCreateProgramWithSource", err);
create buffer for memory transfers to device:
cl_mem devInput1 = clCreateBuffer(context, CL_MEM_READ_ONLY, variable1* sizeof(int), NULL, &err);
transfer to device
errWrapper("setKernel", clSetKernelArg(countKeyCardinality, 0, sizeof (cl_mem), &devInput1));
launch kernel
errWrapper("clEnqueueNDRangeKernel", clEnqueueNDRangeKernel(command_queue, kernel_function1, 1, NULL, &tasksize, NULL, 0, NULL, NULL));
wait for termination
clFinish(command_queue)
Fetch your result from the device
using
clEnqueueReadBuffer
Proceed with your c++ code using the result created by opencl calculations.
Thats the basic idea of using opencl in your code.
Better start doing a full opencl tutorial. (just google it, you will drown in opencl tutorials)
concepts you should be familar with:
opencl host api
command queue
kernel arguments.
work group
local size
local memory
global memory
cl_mem object
Debugging opencl is possible but painfull. I would suggest doing debug with NORMAL C Code and port it to opencl if it works.
The main source for all commands is the offical API documentation, which can be found here: opencl 1.2 api
edit: you do not need a special IDE to code opencl.

Extra 32 bytes of local memory allocated for OpenCL Kernel

I'd like to figure out why I'm receiving the following error for an OpenCL kernel that I'm trying to run:
Context error: [CL_OUT_OF_RESOURCES] :
OpenCL Error : clEnqueueNDRangeKernel failed: local memory usage (16416 bytes) is more than available on the device (16384 bytes)
The kernel is defined as:
__kernel void kernelFun(__read_only image2d_t src,
__global __write_only uchar8 *dst,
__global uchar4 *endpointBuffer,
__local uchar4 *pixelBuffer)
{
...
}
And I'm allocating the local memory using the standard clSetKernelArg routine:
clSetKernelArg(gKernel, 3, kPixelBufferBytes, NULL);
where kPixelBufferBytes is equal to 16384.
My question is, where are these extra 32 bytes coming from?
Some OpenCL implementations are known to store kernel arguments using the same physical memory that is used for local memory. You have 32 bytes worth of kernel arguments, which would explain where this discrepancy is coming from.
For example, NVIDIA GPUs definitely used to do this (see page 25 of NVIDIA's original OpenCL best practices guide).

Is It Possible To Edit OpenCL Kernels On The GPU?

I am looking to edit (or generate new) OpenCL kernels on the GPU from an executing kernel.
Right now, the only way I can see of doing this is to create a buffer full of numbers representing code on the device, send that buffer to an array on the host, generate/compile a new kernel on the host, and enqueue the new kernel back to the device.
Is there any way to avoid a round trip to the host, and just edit kernels on the device?
Can a kernel access the registers in any way?

OpenCL data transfer and DMA

In the AMD APP programming guide it is written that (p.no 4-15):
For transfers <=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.
Is the above DMA, the CPU DMA engine or the GPU DMA engine?
I believe it is the GPU DMA engine since on some cards (e.g., NVIDIA) you can do simultaneous read and write (so this is a GPU capability and not a CPU capability).

Resources