Is it possible for Opencl to cache some data while running between kernels? - opencl

I currently have a problem scenario where I'm doing graph computation tasks and I always need to update my vertex data on the host side, iterating through the computations to get the results. But in this process, the data about the edge is unchanged. I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs. By the way, I am currently only able to run OpenCL under version 1.2.

Question 1:
Is it possible for Opencl to cache some data while running between kernels
Yes, it is possible in OpenCL programming model. Please check Buffer Objects, Image Objects and Pipes in OpenCL official documentation. Buffer objects can be manipulated by the host using OpenCL API calls.
Also the following OpenCL StackOverflow posts will further clarify your concept regarding caching in OpenCL:
OpenCL execution strategy for tree like dependency graph
OpenCL Buffer caching behaviour
Memory transfer between host and device in OpenCL?
And you need to check with caching techniques like double buffering in OpenCL.
Question 2:
I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs
Yes, it is possible. You can either do it through batch processing or data tiling. Because as the overhead associated with each transfer, batching many small
transfers into one larger transfer performs significantly better than making each
transfer separately. There can be many examples of batching or data tiling. One cane be this:
OpenCL Kernel implementing im2col with batch
Miscellaneous:
If it is possible, please use the latest version of OpenCL. Version 1.2 is old.
Since you have not mentioned, programming model can differ between hardware accelerators like FPGA and GPU.

Related

OpenCL 2.x pipes - how do they actually work?

I've read this description of the OpenCL 2.x pipe API and leaded through the Pipe API pages at khronos.org. I felt kind of jealous, working in CUDA almost exclusively, of this nifty feature available only in OpenCL (and sorry that CUDA functionality has not been properly subsumed by OpenCL, but that's a different issue), so I thought I'd ask "How come CUDA doesn't have a pipe mechanism". But then I realized I don't even know what that would mean exactly. So, instead, I'll ask:
How do OpenCL pipes work on AMD discrete GPUs / APUs? ...
What info gets written where?
How does the scheduling of kernel workgroups to cores effected by the use of pipes?
Do piped kernels get compiled together (say, their SPIR forms)?
Does the use of pipes allow passing data between different kernels via the core-specific cache ("local memory" in OpenCL parlance, "shared memory" in CUDA parlance)? That would be awesome.
Is there a way pipes are "supposed" to work on a GPU, generally? i.e. something the API authors envisioned or even put in writing?
How do OpenCL pipes work in CPU-based OpenCL implementations?
OpenCL pipes were introduced along with OpenCL 2.0. On GPUs the OpenCL pipes is just like a global memory buffer with controlled access i.e you can limit the number of workgroups that are allowed to write/read to/from a pipe simultaneously. This kind of allows us to re-use the same buffer or pipe without worrying about conflicting reads or writes from multiple workgroups. As far as i know OpenCL pipes do not use GPU local memory. But if you carefully adjust the size of the pipe then you can increase the cache hits thus achieving better overall performance. There is no general rule as to when pipes should be used. I use pipes to pass data between 2 concurrently running kernels so that my program achieves better overall performance due to better cache hit ratio. This is the same way OpenCL pipes work in CPU as well (its just a global buffer which might fit in the system cache if its small enough). But on devices like FPGAs they work in a different manner. The pipes makes use of the local memory instead of the global memory in these devices and hence achieves considerable higher performance over using a global memory buffer.

accessing file system using cpu device in opencl

I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.
It can't. Simply because not every OpenCL device has a file system, or a disk respectively.
You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.
It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

OpenCL: Sending same cl_mem to multiple devices

I am writing a multi-GPU parallel algorithm. One of the issues I am facing is to find out what would happen if I push one cl_mem to multiple devices, and let them run the same kernel at the same time. The kernel will make change to the memory passed to device.
It is very time consuming to code and debug OpenCL code. So before I start doing it I want to take some advices from fellow Stackoverflow users - I want to know the consequence of doing such thing, in both of below scenarios (e.g will there be any exception raised during execution? Are data synchronized? When CL_MEM_COPY_HOST_PTR is used is the same region of memory pointed by this cl_mem get properly copied to device? etc.):
The memory is created with CL_MEM_COPY_HOST_PTR
The memory is created with CL_MEM_USE_HOST_PTR
I don't see anything explicit in the OpenCL specifications that guarantees that data will be synchronised across devices. I don't see how the OpenCL implementation would know how to distribute a buffer across multiple devices and how to aggregate those buffers again later.
The approach I've adopted is to create a separate context, read, write and kernel exec queues for each device. I then create separate buffers on each device and enqueue writes/reads to move data to/from the devices. Hence I explicitly handle all of that myself.
I'd like a better solution, but at least the above method works and doesn't rely on anything that is implementation specific.
Appendix A of the OpenCL Specification explains the required synchronization for objects shared between different command queues.
Basically it says you should use OpenCL events and clFlush to synchronize execution between the command queues. The OpenCL implementation will synchronize the contents of the memory objects between the different devices of the OpenCL context. USE/COPY _HOST_PTR does not make any difference, but USE_HOST_PTR will avoid a couple of extra copies of the data in host memory. Use clEnqueueMapBuffer to synchronize bits with the host at the end.

How to "stream" data from and to global memory?

The codeproject.com showcase Part 2: OpenCL™ – Memory Spaces states that Global memory should be considered as streaming memory [...] and that the best performance will be achieved when streaming contiguous memory addresses or memory access patterns that can exploit the full bandwidth of the memory subsystem.
My understanding of this sentence is, that for optimal performance one should constantly fill and read global memory while the GPU is working on the kernels. But I have no idea, how I would implement such an concept and I am not able to recognize it in the (rather simple) examples and tutorials I've read.
Do know a good example or can link to one?
Bonus question: Is this analog in the CUDA framework?
I agree with talonmies about his interpretation of that guideline: sequential memory access are fastest. It's pretty obvious (to any OpenCL-capable developer) that sequential memory accesses are the fastest though, so it's funny that NVidia explicitly spells it out like that.
Your interpretation, although not what that document is saying, is also correct. If your algorithm allows it, it is best to upload in reasonably sized chunks asynchronously so it can get started on the compute sooner, overlapping compute with DMA transfers to/from system RAM.
It is also helpful to have more than one wavefront/warp, so the device can interleave them to hide memory latency. Good GPUs are heavily optimized to be able to do this switching extremely fast to stay busy while blocked on memory.
My understanding of this sentence is,
that for optimal performance one
should constantly fill and read global
memory while the GPU is working on the
kernels
That isn't really a correct interpretation.
Typical OpenCL devices (ie. GPUs) have extremely high bandwidth, high latency global memory systems. This sort of memory system is highly optimized for access to contiguous or linear memory access. What that piece you quote is really saying is that OpenCL kernels should be designed to access global memory in the sort of contiguous fashion which is optimal for GPU memory. NVIDIA call this sort of optimal, contiguous memory access "coalesced", and discuss memory access pattern optimization for their hardware in some detail in both their CUDA and OpenCL guides.

Resources