How would I handle extra-large images in OpenCL? - opencl

I've been working on a PyOpenCL program that will take in an OpenCL kernel (representing an image filter) and an image and apply said filter to generate an output image. The issue is that I need to make this program run on an image of any size.
I've written a similar program before with C# and OpenCL using the Cloo (http://sourceforge.net/projects/cloo/) framework, but I wanted to make something more portable (since the Cloo framework fails to run properly on Linux).
Now, in my C# implementation, I simply split the image up into chunks and ran the kernel on each chunk. I did this by handling the images as plain byte arrays in my kernel. However, the issue I'm having now is that I'm attempting to use the image2d_t datatype in my PyOpenCL implementation, and I'm not sure how to go about breaking the image into chunks and passing them to the kernel.
Does the image2d_t class add padding to the returned images (that I would need to post-process), or perhaps it supports some sort of automated methodology that would handle this for me?
Any resources that would point me in the right direction are greatly appreciated!
Edit: I figured I should mention that the reason why I want to do this is because I run into memory allocation exceptions with my current build (because the images are too large).

I managed to work around it by splitting the image up using the Python Imaging Library's crop and paste functionality to process subimages and replace them into the output image after processing.

Related

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?
There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.
You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/
I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda
There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

Render YUV in JavaFX

I need to render a yuyv422 stream in JavaFX with minimum latency. If I convert it to RGB, I can use an ImageView with a WritableImage with a PixelFormat instance, and it works, but the RGB conversion consumes a lot of CPU, specially with high resolutions. I saw this exact feature request
https://bugs.openjdk.java.net/browse/JDK-8091933
but seems it will not be implemented in Java 9. And if it does, I wonder if it won't introduce latency or demand too much CPU. Is there another way using JavaFX?
In General:
Image processing is always expensive, this is why Vectorization or Hardware Acceleration is used for these tasks. Simple looping through an Image with just one thread is already really slow, especially in java. On top of that people tend to use Color objects for color modifications which is tremendously slow.
Pure Java:
If you want to keep your code in pure Java. You should check which internal format is used for the WriteableImage by calling:
myImage.getPixelWriter().getPixelFormat().getType()
If the internal format isn't RGB adapt your color conversion to the given format to avoid double conversion.
Additionally make sure that your code is optimized as much as possible:
-Don't use any objects except arrays
-Minimize the use of local variables
You can also try to multithread the conversion process via parallel loops.
JNI:
Moving away from Java opens up a lot of possibilities. There are several platform independent libraries for converting YUV to RGB and back:
OpenCV:
Easy to use and coming already with an java API:
byte[] myYuvImage = null; //your image here
byte[] myRgbImage = new byte[width * height * 3]; //the output image
Mat yuvMat = new Mat(height, width, CvType.CV_8UC2); //YUV422 should be 2 channel
Mat rgbMat = new Mat(height, width, CvType.CV_8UC3);
yuvMat.put(0,0, myYuvImage);
Imgproc.cvtColor(yuvMat, rgbMat, Imgproc.COLOR_YUV2RGB_Y422);
rgbMat.get(0, 0, myRgbImage);
Intel IPP:
Only available via JNI. You would use ippiRGBToYUV422_8u_C3C2R see RGBToYUV422 for more information.
SwScale as part of FFmpeg:
Only available via JNI. See this answer and adapt the example.
My personal experience is that IPP offers by far the best performance even on AMD machines. However the license it comes with may be free but it prohibits decompiling which might be an not compatible with LGPL libraries.

What should replace "memcpy" inside OpenCL kernels?

The OpenCL language, which extends C99, does not provide the memcpy function. What should be used instead?
As far as I know, there is nothing like that defined in OpenCL. OpenCL does not provide a concept like dynamic memory and therefore, such functionality is not needed.
You could just run over your array with for and copy the data element by element. But, the target array is of fixed size due to the need to specify the array length at compile time.
On the other side, OpenCL (and OpenGL as a kind of origin) was defined in a more static way. The data needs to be provided to the GPU and the result size needs to be defined. The GPU calculates the input to the pre-defined output location. It is not meant to create more processes within the GPU and it is also not meant to allocate dynamically memory to not disturbed the host doing it.

Implement IP camera

We have a device that has an analog camera. We have a card that samples it and digitizes it. This is all done in directx. At this point in time, replacing hardware is not an option, but we need to code such that we can see this video feed real-time regardless of any hardware or underlying operating system changes occur in the future.
Along this line, we've chosen Qt to implement a GUI to view this camera feed. However, if we move to a linux or other embedded platform in the future and change other hardware (including the physical device where the camera/video sampler lives), we will need to change the camera display software as well, and that's going to be a pain because we need to integrate it into our GUI.
What i proposed was migrating to a more abstract model where data is sent over a socket to the GUI and the video is displayed live after being parsed from the socket stream.
First, is this a good idea or a bad idea?
Secondly, how would you implement such a thing? How do the video samplers usually give usable output? How can I push this output over a socket? Once I am on the receiving end parsing the output, how do I know what to do with the output (as in how to get the output to render)? The only thing I can think of would be to write each sample to a file and then to display the contents of the file every time a new sample arrives. This seems like an inefficient solution to me, if it would work at all.
How do you recommend I handle this? Are there any cross-platform libraries available for such a thing?
Thank you.
edit: i am willing to accept suggestions of something different rather than what is listed above.
Have you looked at QVision? It is a Qt based framework for managing video and video processing. You don't need the processing, but I think it will do what you want.
Anything that duplicates the video stream is going to cost you in performance, especially in an embedded space. In most situations for video, I think you're better off trying to use local hardware acceleration to blast the video directly to the screen. With some proper encapsulation, you should be able to use Qt for the GUI surrounding the video, and have a class that is platform specific that you use to control the actual video drawing to the screen (where to draw, and how big, etc.).
Edit:
You may also want to look at the Phonon library. I haven't looked at it much, but it appears to support showing video that may be acquired from a range of different sources.

Actionscript PNGEncoder performance and UI blocking

I'm trying to use PNGEncoder to encode a bitmapData object into a png ByteArray so I can send the data to the server. Everything would be peachy except the bitmapData is 4000x4000px and when I run the PNGEncoder.encode function on it the whole app stops (UI is blocked) for 5-8 seconds while it runs. Does anybody have any suggestions on how to not make it block so bad, I read about chunking up the process (since you can't multithread in AS3) but can't find any sample code on chunking up the process.
Thanks,
Sam
In addition to Arthur's comment, you could also write it in C/C++ for Alchemy, since alchemy supports green threads. Like PixelBender, Alchemy also requires Flash 10.
There are mainly two ways to do this.
a) Use pixel bender:
You can off load the work to pixel bender (a shade like language in as3). This has the advantage of using the gpu on some cases, but it also is assynchronous and non blocking (runs on another thread). But it does require player 10+. I haven't seen a pixel bender png encoder, and to be honest, it may not be possible (I am not familiar enough with png encoding to tell), but it might be an option. This is, performance wise, the best you can get. More info here
b) Use chuncking. Basically, you rewrite the encoder to encode blocks (lines, columns or a smaller area), and hook that to an enter frame event, each frame you'd call next on your encoder, until there is no more encoding to do. Zeh has a neat LWZ chunked encoder with source code that might give you insights into the details.
Cheers
Arthur
Another shameless plug!
You can use my recently completed PNGEncoder2 library (also requires Flash 10+), which handily supports gigantic images. It does proper asynchronous encoding, with no single compression step at the end. Additionally, it's really fast ;-)
Grab it from GitHub (README), and check out the benchmark comparing it with other encoders on my blog post.
It's highly tuned for speed, and uses the Alchemy opcodes and domain memory to speed it up (thanks to Haxe), so it should be comparable to anything you compile using Alchemy.
You could encode multiple PNG files separately and send them to the server. Once on the server you can reconstruct the larger image.
It's for JPEG encoding, but should be useful - look a this post http://segfaultlabs.com/blog/post/asynchronous-jpeg-encoding/
As Arthur Debert said, you can use chunking. I'd suggest that instead of encoding once/frame, you try a setTimeout( chunkingFunction, 0 ); approach. A timeout with a 0 ms delay will happen as soon as possible, allowing the chunking to process quickly but without crushing the UI.

Resources