FFT and Fast Conv on OpenCL without copying data to host - opencl

I want to perform FFT, FastConv and cross-correlation on GPU and pass the results to other OpenCL kernels without copying the results to host memory.
Cannot you advice me an OpenCL implementation of FFT, FastConv and cross-correlation that can be called as kernels without transferring data from GPU to host?

Many of the OpenCL vendors (e.g., Apple, AMD, NVIDIA) have FFT samples that include kernel source. You can use these to process buffers already on the device and leave results on the device.

Related

Is it possible for Opencl to cache some data while running between kernels?

I currently have a problem scenario where I'm doing graph computation tasks and I always need to update my vertex data on the host side, iterating through the computations to get the results. But in this process, the data about the edge is unchanged. I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs. By the way, I am currently only able to run OpenCL under version 1.2.
Question 1:
Is it possible for Opencl to cache some data while running between kernels
Yes, it is possible in OpenCL programming model. Please check Buffer Objects, Image Objects and Pipes in OpenCL official documentation. Buffer objects can be manipulated by the host using OpenCL API calls.
Also the following OpenCL StackOverflow posts will further clarify your concept regarding caching in OpenCL:
OpenCL execution strategy for tree like dependency graph
OpenCL Buffer caching behaviour
Memory transfer between host and device in OpenCL?
And you need to check with caching techniques like double buffering in OpenCL.
Question 2:
I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs
Yes, it is possible. You can either do it through batch processing or data tiling. Because as the overhead associated with each transfer, batching many small
transfers into one larger transfer performs significantly better than making each
transfer separately. There can be many examples of batching or data tiling. One cane be this:
OpenCL Kernel implementing im2col with batch
Miscellaneous:
If it is possible, please use the latest version of OpenCL. Version 1.2 is old.
Since you have not mentioned, programming model can differ between hardware accelerators like FPGA and GPU.

OpenCL AMD S10000 dual GPU execution

I have the S10000 AMD GPU, which has 2 GPUs inside. When I run clinfo the output looks like these are treated as separate GPUs. To run my kernel across both of these GPUs do I need to create 2 separate openCL queues and partition my work-groups? Do these two GPUs share memory?
Yes, you will need to create separate command queues for each GPU and manually partition the workload between them. The GPUs do not share memory, so you will also have to make sure data is transferred to both GPUs as necessary. If you create a single context containing both GPUs, the implementation will automatically deal with moving buffers between the GPUs as and when needed. However, in my experience it is often better to do this explicitly, as sometimes the implementation will generate false dependencies between kernels that both use the same buffer and will serialise kernel execution.

OpenCL. How to identify which compute device is free and submit jobs accordingly?

I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald
Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.
Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.

Can a GPU be the host of a OpenCL program?

Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.

How to implement a program in openCL using MPI on a single cpu machine

I'm new to GPU programming , I have laptop without graphics card,i want to develop a matrix multiplication program on intel openCL, and implement this application using MPI..
any guidelines and helpfull links can be posted.
I'm confused about the MPI thing, do we have to write code for MPI , or do we have to use some developed MPIs to run our application?
this is the project proposal of what i want to do
GPU cluster computation (C++, OpenCL and MPI)
Study MPI for distributing the problem
Implement OpenCL apps on a single machine (matrix multiplication/ 2D image processing)
Implement apps with MPI (e.g. large 2D image processing)
So the thing to understand is that MPI and OpenCL for your purposes are completely orthogonal. MPI is for communicating between your GPU nodes; OpenCL is for accelerating your local computation on a single node by using the GPU (or multiple CPU cores). For any of these problems, you'd start with writing a serial C++ version of the code. The next step would be to (in any order) work on an OpenCL implementation for a single node, and work on an MPI version which decomposes the problems (you don't want to user master-slave for any of the above listed problems) onto multiple processes, with each process doing their local part of the computation which contributes to the global solution. Once both of those parts are done, you'd merge the two and have a distributed-memory (the MPI part) GPU (the OpenCL part) version of a code to solve this problem.
It won't quite be that easy, of course, and combining the two will take a fair bit of work, but that's the basic approach to keep in mind. Start with one problem, get it working on a single processor in C++, then try it with one or the other. Don't try to do everything at once or you'll never get anywhere.
For problems like matrix multiplication, there are many many examples on the internet of both GPU and MPI implementations to learn from.
Simplified:
MPI is a library for communicating proccesses, but also a platform for running applications in a cluster. You write a program that use MPI library and then that program should be executed with MPI. MPI fork that application N times in the cluster and allow to communicate that applicacion instances with messages.
The tasks that the make the instances, if they are the same or different workers, and the topology is up to you.
I think 3 ways to use (OpenCL and MPI):
MPI start (K+1) instances, one master and K slaves. The master split the data in chunks and the slaves proccess the data in the GPUS using OpenCL. All slaves are the same.
MPI start (k+1) instances, one master and k slaves. Each slave compute a specialized problem (slave 1 matrix multiplication, slave 2 block compression, ...etc) and the master direct the data in a workflow kind of task.
MPI start (k+1) instances, one master and k slaves. Same that case 1, but the master also send to the slaves the OpenCL program to proccess data.

Resources