I'm devloping on Linux & CPP (Using Eclipse SDK).
I'm novice at OpenCL (GPU Programming)
I want to execute some of my code on the GPU (rewrite some functions with openCL and run them on the GPU).
I'm liitle bit confuse - If I will write some code (.cl files) how can I call them from my cpp application ?
I didnt saw any examples for this need.
There are two parts of code if you want to use opencl.
A. The Kernel code.
consists of 1 to many kernel functions which perform your calculations on a device.
B. The Host code
normal c/c++ code. what happens here:
pick a device for the kernel to be computed on (gpu/cpu/igpu/xeon phi/...)
in opencl you got a set of platforms which can contain several different devices. So you pick a platform AND a device.
example:
platform: intel cpu+gpu opencl 1.2
device: cpu OR IGPU
build your kernel
const char * code = load_program_source("kernel.cl");
cl_program program = clCreateProgramWithSource(context, 1, (const char **)&code, NULL, &err); errWrapper("clCreateProgramWithSource", err);
create buffer for memory transfers to device:
cl_mem devInput1 = clCreateBuffer(context, CL_MEM_READ_ONLY, variable1* sizeof(int), NULL, &err);
transfer to device
errWrapper("setKernel", clSetKernelArg(countKeyCardinality, 0, sizeof (cl_mem), &devInput1));
launch kernel
errWrapper("clEnqueueNDRangeKernel", clEnqueueNDRangeKernel(command_queue, kernel_function1, 1, NULL, &tasksize, NULL, 0, NULL, NULL));
wait for termination
clFinish(command_queue)
Fetch your result from the device
using
clEnqueueReadBuffer
Proceed with your c++ code using the result created by opencl calculations.
Thats the basic idea of using opencl in your code.
Better start doing a full opencl tutorial. (just google it, you will drown in opencl tutorials)
concepts you should be familar with:
opencl host api
command queue
kernel arguments.
work group
local size
local memory
global memory
cl_mem object
Debugging opencl is possible but painfull. I would suggest doing debug with NORMAL C Code and port it to opencl if it works.
The main source for all commands is the offical API documentation, which can be found here: opencl 1.2 api
edit: you do not need a special IDE to code opencl.
Related
i am trying to find an equivalent function on ESP-IDF that will be like Stream() of arduino ,
what i am trying to do is to make an MSP function to communicate with the MSP protocol with the ESPs UART, i am using ESP-IDF and Free-Rtos in an Ubuntu environment and cmake to built
https://www.arduino.cc/reference/en/language/functions/communication/stream/
https://github.com/yajo10/MSP-Arduino/blob/master/MSP.cpp
i tryied to use std::ostringstream* but obviously doesnt make the same job
ostringstream is a specialised ostream where the "device" is a memory buffer. Stream is an unspecialised base class for a number of device sub-classes, and it is bi-directional. The standard library equivalent given a suitable stream I/O driver would be std::istream and std::ostream for input and output respectively, each opened on the specific device. In most cases you would use the derived std::ifstream and std::ofstream class s and open the device as a "file stream".
If you require an identical interface in order to use the code unmodified, then implementing Stream as a wrapper around iostream is feasible.
My OpenCL program is like this:
init_opencl
process
release_opencl
In 1. init_opencl, I pre-create all buffers that will be used, like this:
g_mem_under_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
g_mem_mean_down_lap_pyr_32FC3 = clCreateBuffer(g_ctx, CL_MEM_READ_WRITE,
g_sizeMgr->getPyrCapacity() * 3 * sizeof(float), nullptr, &err);
then in 2. process, I use clEnqueueWriteBuffer to transfer data from host memory to buffer first, then run kernels. At the last, I read out the data from buffer with clEnqueueReadBuffer.
In the end in 3. release_opencl, I call clReleaseMemObject to release all the buffers.
But when I run the program, I observed that the in the 2. process, the host memory and the device memory are increase about 600MB at the same time.
host memory
device memory
My question is:
Why does the opencl program occupy host memory with it has occupied device memory?
And how to make may opencl program not occupy host memory?
My test platform is GTX 750.
And in snapdragon 820, I also find the system memory increases 600MB. But in this platform, I don't the GPU memory's change.
I want to run some threads on my CPU ( the localhost host ) and some other on a portable device connected ( like a USB ).
I know that OpenCL supports parallelization, but how do I distribute a work onto a portable devices using OpenCL?
Any other idea to do this other than OpenCL would also help.
Any device which might run an OpenCL task must have an Installable Client Driver associated with it, which can be picked up by the OpenCL Driver on the computer in question. Graphics Cards (especially if they're no older than half a decade) are nearly guaranteed to have a valid ICD, provided their drivers are up-to-date, and many Consumer-level CPUs have ICDs that are provided by their drivers.
However, other devices like a Network Device or a USB device are considerably less guaranteed to have a valid ICD unless they've been specifically designed for use in a Heterogeneous Compute system. If they do have a valid ICD, then it's a mere matter of querying for their platform at runtime and choosing it to use when constructing your OpenCL Context, then using it the same way you'd use OpenCL normally:
//C++ OpenCL API
cl::Platform target_platform;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
for(cl::Platform & platform : platforms) {
std::string name = platform.getInfo<CL_PLATFORM_NAME>();
if(name == /*Whatever the Name of the platform is*/) {
target_platform = platform;
break;
}
}
std::vector<cl::Device> devices;
target_platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
cl::Device target_device;
for(cl::Device & device : devices) {
if(device.getInfo</*...*/>() == /*...*/) {//Whatever properties you need
target_device = device;
break;
}
}
I'd like to figure out why I'm receiving the following error for an OpenCL kernel that I'm trying to run:
Context error: [CL_OUT_OF_RESOURCES] :
OpenCL Error : clEnqueueNDRangeKernel failed: local memory usage (16416 bytes) is more than available on the device (16384 bytes)
The kernel is defined as:
__kernel void kernelFun(__read_only image2d_t src,
__global __write_only uchar8 *dst,
__global uchar4 *endpointBuffer,
__local uchar4 *pixelBuffer)
{
...
}
And I'm allocating the local memory using the standard clSetKernelArg routine:
clSetKernelArg(gKernel, 3, kPixelBufferBytes, NULL);
where kPixelBufferBytes is equal to 16384.
My question is, where are these extra 32 bytes coming from?
Some OpenCL implementations are known to store kernel arguments using the same physical memory that is used for local memory. You have 32 bytes worth of kernel arguments, which would explain where this discrepancy is coming from.
For example, NVIDIA GPUs definitely used to do this (see page 25 of NVIDIA's original OpenCL best practices guide).
I am looking to edit (or generate new) OpenCL kernels on the GPU from an executing kernel.
Right now, the only way I can see of doing this is to create a buffer full of numbers representing code on the device, send that buffer to an array on the host, generate/compile a new kernel on the host, and enqueue the new kernel back to the device.
Is there any way to avoid a round trip to the host, and just edit kernels on the device?
Can a kernel access the registers in any way?