I'm creating an Image2d object on the host using the flag CL_MEM_READ_WRITE. This image is the output of one kernel and I want it to be used as an input to a different kernel. I'm also using cl_image_format = {CL_INTENSITY, CL_FLOAT};
Is this possible in OpenCL 1.2? I've read nowhere that says you can't do this, yet when I try my second kernel returns all zeros, but no error.
I've also tried using clEnqueueCopyImage to copy the output of the first kernel to a different Image2d (also created using CL_MEM_READ_WRITE) and using that as input to the second kernel, but that also does not work.
I've verified the output of my first kernel is correct.
Thanks for any insight.
Yes, the output image from one kernel can be used as input to a subsequent kernel.
As long as the image is CL_MEM_READ_WRITE it can either read __read_only or __write_only in a kernel in OpenCL 1.x.
OpenCL 2.0 further allows images to be __read_write but special rules must be followed (such as barriers) to get correct results.
For more information on read/write image, please see https://software.intel.com/en-us/articles/using-opencl-20-read-write-images
Don't try to cheat (OpenCL - Pass image2d_t twice to get both read and write from kernel?)
Related
I'm building an OpenCL program - using NVIDIA CUDA 11.2's OpenCL library (and its C++ bindings). After invoking cl::Program::build() successfully, for a single device (passing a vector with a single device index), I obtain the generated "binaries" sizes using: built_program.getInfo<CL_PROGRAM_BINARY_SIZES>(), which also succeeds, but gives me 3 values: A non-zero value and two zeros. When I print the first binary, I see the PTX code I expect.
My question: Why am I given two (empty) extra binaries?
Even though the program is built for specific devices you specify (see documentation for clBuildProgram), the binaries are made available for each device in the context. In your case, you probably have three GPUs on your system; you built the program for a single device, so for one of the three devices, you see a non-empty PTX.
Confusing? Sure. Convoluted? Yes. But is it entirely senseless? Admittedly, not really.
Digging around a bit further, it seems this is even officially documented (emphasis mine):
Returns an array that contains the size in bytes of the program binary (could be an executable binary, compiled binary or library binary) for each device associated with program. The size of the array is the number of devices associated with program. If a binary is not available for a device(s), a size of zero is returned.
Not every device for which you built, but every device associated with the program; which is probably every device in the OpenCL context with which you created the program.
I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.
I want to implement an algorithm in openCL which needs to apply a certain transformation on a 3D grayscale image several times. I have an input and an output image for my kernel. Now I would like to simply swap the input and output image and apply the kernel again. However, one image was created with read_only and the other one with write_only. Does this mean I have to use conventional buffers, or is there some trick, how to flip the two images, without first copying them from the device back to the host and back to the device again?
You say: "However, one image was created with read_only and the other one with write_only". The obvious answer is: don't do that, and you'll be fine.
The less obvious subtext is: There's a difference between creating an image with writeonly/readonly flags (which is done on the host-side via clCreateImage(...,CL_MEM_WRITE_ONLY/CL_MEM_READ_ONLY)) and the access-type inside a particular kernel (which is specified with the __read_only/__write_only qualifiers in the kernel's arguments definition).
Unless I'm totally mistaken, you can safely create your image with no restrictions (i.e. CL_MEM_READ_WRITE), then use it as a kernel's input parameter, and for the next kernel run, use it as the output parameter. You just can't mix read/write accesses during a single kernel run.
In my project in the first stage, I generate some vertices then in second stage I read these vertices and then create connectivity array. For my vertices I have used CL_MEM_READ_WRITE. I wanted to know will I have a performance increase if I use a CL_WRITE memory in the first stage then copy it in another CL_READ memory for the second stage? Because probably each of them has its own optimization to get the maximum performance.
The flag passed in the 2nd argument Of CL_CREATEBUFER only specifies how the kernel side can access the memory space.
Probably not. I expect the buffer copy to be far more costly than any optimization.
Also, I looked at the AMD APP OpenCL Programming Guide and I didn't find any indication about optimizations when using a READ_ONLY or WRITE_ONLY buffer.
My understanding is that the access flag is only used by the OpenCL runtime to decide when it needs to copy buffer data between the different memory spaces/areas.
I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.
I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.