OpenCL kernel queueing delays - opencl

I have a gigantic pile of data, 100GB. I only have 1GB of Video memory. I need to queue my kernel many times with MaxWorkgroupSize chunks. That's going to be ~10000 kernel queueings and 100 Memory transfers. How badly will this affect my performance time? Also, is there a faster way of processing so much data? Would I just be better off running on my cpu with 8 threads, because then there is no data transfer and kernel delays. I'm asking before I code the thing because I want to make sure I have the right approach.

It depends on the nature of the work. GPUs are SIMD machines. If you are typically doing the same thing for each item (e.g. branches are normally going the same place for each work item), then that bodes well for a GPU. Even so, 8 thread CPU has OpenCL implementations for it as well. Also, in environments like Intel's embedded GPU (AMD too?) you should consider the CL_MEM_USE_HOST_PTR flag on the memory buffer. You can use it to get a zero-copy overhead.

Multiple enqueueing of same kernel doesn't impose any performance hit per enqueue in comparison to single kernel run. More to say, it becomes a little bit faster due to caching.
Also, you can run your code on CPU & GPU simultaneously, as both are OpenCL-compatible devices.
Your Device can use memory objects, allocated from Host's RAM (CL_MEM_ALLOC_HOST_PTR & CL_MEM_USE_HOST_PTR flags in clCreateBuffer() function). Anyway, memory transfers may not be the bottleneck.


OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

OpenCL and multiple video cards

My understanding of the differences between CPUs and GPUs is that the GPUs are not general purpose processors such that if a video card contains 10 GPUs, each GPU actual share the same program pointer and to optimize parallelism on the GPU I need to ensure each GPU is actually running the same code.
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
My question is, how does this work on multiple cards? At the speed at which they operate at, doesn't the hardware make a slight difference in execution times such that a calculation on one GPU on one card may end quicker or slower than the same calculation on another GPU on another card?
This is not true. Different threads on a GPU may complete at different times due to differences in memory access latency, for example. That is why there are synchronization primitives in OpenCL such as the barrier command. You can never assume that your threads are running precisely in parallel.
The same is true for multiple GPUs. There is no guarantee that they are in sync, so you will need to rely on API calls such as clFinish to explicitly synchronize their work.
I think you may be confused about how threads work on a GPU. First to address the issue of multiple GPUs. Multiple GPUs NEVER share the program pointer, so they will almost never complete a kernel at the same time.
On a single GPU, only threads that are executing ON THE SAME COMPUTE UNIT (or SM in NVIDIA parlance) AND are part of the same warp/wavefront are guaranteed to execute in sync.
You can never really count on this, but for some devices the compiler can determine that will be the case (I am specifically thinking about some AMD devices, as long as the worgroup size is hardcoded to 64).
In any case, as #vocaro pointed out, that's why you need to use a barrier for local memory.
To emphasize, even on the same GPU, threads are not executing in parallel across the whole device - only within each compute unit.

OpenCL- waste of host computing power

I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.

How to "stream" data from and to global memory?

The showcase Part 2: OpenCL™ – Memory Spaces states that Global memory should be considered as streaming memory [...] and that the best performance will be achieved when streaming contiguous memory addresses or memory access patterns that can exploit the full bandwidth of the memory subsystem.
My understanding of this sentence is, that for optimal performance one should constantly fill and read global memory while the GPU is working on the kernels. But I have no idea, how I would implement such an concept and I am not able to recognize it in the (rather simple) examples and tutorials I've read.
Do know a good example or can link to one?
Bonus question: Is this analog in the CUDA framework?
I agree with talonmies about his interpretation of that guideline: sequential memory access are fastest. It's pretty obvious (to any OpenCL-capable developer) that sequential memory accesses are the fastest though, so it's funny that NVidia explicitly spells it out like that.
Your interpretation, although not what that document is saying, is also correct. If your algorithm allows it, it is best to upload in reasonably sized chunks asynchronously so it can get started on the compute sooner, overlapping compute with DMA transfers to/from system RAM.
It is also helpful to have more than one wavefront/warp, so the device can interleave them to hide memory latency. Good GPUs are heavily optimized to be able to do this switching extremely fast to stay busy while blocked on memory.
My understanding of this sentence is,
that for optimal performance one
should constantly fill and read global
memory while the GPU is working on the
That isn't really a correct interpretation.
Typical OpenCL devices (ie. GPUs) have extremely high bandwidth, high latency global memory systems. This sort of memory system is highly optimized for access to contiguous or linear memory access. What that piece you quote is really saying is that OpenCL kernels should be designed to access global memory in the sort of contiguous fashion which is optimal for GPU memory. NVIDIA call this sort of optimal, contiguous memory access "coalesced", and discuss memory access pattern optimization for their hardware in some detail in both their CUDA and OpenCL guides.

Sharing the GPU between OpenCL capable programs

Is there a method to share the GPU between two separate OpenCL capable programs, or more specifically between two separate processes that simultaneously both require the GPU to execute OpenCL kernels? If so, how is this done?
It depends what you call sharing.
In general, you can create 2 processes that both create an OpenCL device, on the same GPU. It's then the driver/OS/GPU's responsibility to make sure things just work.
That said, most implementations will time-slice the GPU execution to make that happen (just like it happens for graphics).
I sense this is not exactly what you're after though. Can you expand your question with a use case ?
Current GPUs (except NVidia's Fermi) do not support simultaneous execution of more than one kernel. Moreover, to this date GPUs do not support preemptive multitasking; it's completely cooperative! A kernel's execution cannot be suspended and continued later on. So the granularity of any time-based GPU sharing depends on the kernels' execution times.
If you have multiple programs running that require GPU access, you should therefore make sure that your kernels have short runtimes (< 100ms is a rule of thumb), so that GPU time can be timesliced among the kernels that want GPU cycles. It's also important to do that since otherwise the host system's graphics will become very unresponsive as they need GPU access too. This can go as far that a kernel in an endless or long loop will apparently crash the system.
