OpenCL clEnqueueCopyImageToBuffer with stride - opencl

I have an OpenCL buffer containing an 2D image.
This image have stride bigger than its width.
I need to make OpenCL image from this buffer.
The problem is that function clEnqueueCopyImageToBuffer does not contain stride as an input parameter.
Is it possible to make OpenCL image from OpenCL buffer(with stride bigger than width), with only one copying or faster?
The one way to solve this problem is to write own kernel, but maybe there are much more neat solutions?

Unfortunately, there is no method in the OpenCL specification which allows you to directly create an image from a buffer when the buffer data has a stride not equal to the image width. The most efficient solution would probably be to write your own kernel to do this.
The simplest solution that doesn't involve writing your own kernel would be to copy one line at a time with clEnqueueCopyBufferToImage. If your image is big enough, it might be that the performance of this technique would be reasonably comparable to the hand-written kernel, but you would have to try it out to see.
I didn't include the clEnqueueCopyBufferRect approach in my original answer because my first instinct was that the extra copy would kill performance. However, the comments above got me thinking about it further, and I was interested enough to implement all three approaches to see what the performance was actually like.
As I suspected, the fastest approach was to implement a kernel to do this directly. However, copying the data over line-by-line was significantly slower than I had anticipated. Copying the buffer into an intermediate buffer with clEnqueueCopyBufferRect is actually a pretty good compromise of performance and simplicity, although is still a couple of times slower than the kernel implementation.
The source code for this little experiment can be found here. I was copying a 1020x1020 image with a stride of 1024, and the timings are averaged over 8 runs.

Related

The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.
-work by columns instead of by rows, this is for the way which Julia store the matrix.
-On loops could use #inbounds and #simd macros
-any function, macros or methods you could recommend it's welcome :D
But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome
GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)
Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.
Let me explain a little bit why. When you do the code:
#. A = B*C + D*E
it lowers to
A .= B.*C .+ D.*E
which then lowers to:
broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)
Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.

Is there a way to treat a cl_image as a cl_mem?

I've been working on speeding up some image processing code written in OpenCL, and I've found that for my kernel, buffers (cl_mem) are significantly faster than images (cl_image).
So I want to process my images as cl_mem, but unfortunately I'm stuck with an API that only spits out cl_images. I'm using a OS X API clCreateImageFromIOSurface2DAPPLE that creates an image for me.
Is there any way to take a cl_image and treat it as a cl_mem? When I've tried to do that I get an error when running my kernel.
I've tried copying the image to a buffer using clEnqueueCopyImageToBuffer but that's also too slow. Any ideas? Thanks in advance
PS: I believe my kernel operating on a buffer is much faster because I can do a vload4 and load 4 pixels at a time, vs read_imagei which does just one.
You cannot treat an OpenCL image as memory. The memory layout of an image is private to the implementation and should be considered unknown.
If your code creates the image, however, you could create a buffer and then use cl_khr_image2d_from_buffer. Otherwise write a kernel that copies the data from image to buffer and see if it is faster than clEnqueueCopyImageToBuffer (unlikely).

Reduce 1024 images to one

I read the paper about reducing a 1d array to one value in openCL ( http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/ ) and I understood the concept of associative operators. Extending this concept to ONE 2d array should also be possible.
But my problem is somewhat different: I have ~1000 images of 256x256 pixels with 16bit each and I would like to sum all these images to finally have the average image of them all. The usual GPU should have enough memory (~130Mb) to perform this task, but I don't really see how to implement the kernel.
Just as the 1D problem extends to 2D, it can also extend to 3D (which is what you have: 1000x256x256).
Exactly the same principles would apply:
1. Try to do as much work in parallel as you can without contention with other work groups.
2. Do the reduction in stages so each can be parallel.
Your likely going to be bandwidth limited, churning through 131 MB of memory, but that's not really a problem. Just write the kernels to do coalesced reads for maximum performance.

OpenCl and power iteration method (eigendecomposition)

I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

returning one result from a pyopencl kernel

My pyopencl kernel program is started with global size of (512,512), I assume it will run 512x512=262,144 times. I want to find the minimum value of a function in my 512x512 image but I don't want to return 262,144 floats to my CPU to calculate the min. I want to run another kernel (possibly waiting in the queue ) to find the min value of all 262,144 pixels then just send that one float to the CPU. I think this would be faster. Should my waiting kernel's global size be (1,1), ? I hope the large 262,144 Buffer of floats that I created using mf.COPY_HOST_PTR will not cross the GPU/CPU bus before I call the next kernel.
Thanks
Tim
Andreas is right: reduction is the solution. Here is a nice article from AMD explaining how to implement simple reduction. It discusses different approaches and the gain in terms of performance they bring. The example in the article is about summing all the elements and not to find the minimum, but it's fairly trivial to modify the given codes.
BTW, maybe I don't understand well you first sentence, but a kernel launched with a global size of (512, 512) will not run 262,144 times but only one time with 262,144 threads scheduled.
Use a reduction kernel to find the minimum.

Resources