I'am new to Xeon Phi Programming and i'am currently trying to learn explicit offload programming ... I have been going through certain tutorials provided by intel but i couldn't properly understand the meaning of nocopy clause if any one know about it please try to explain by giving an example of its usage in different scenarios and it will be great help if you can introduce me to any interactive tutorials on the web.
For a default #pragma offload, these five things happen:
allocate space on Xeon Phi
move data to Xeon Phi
do math
move data from Xeon Phi
free allocated buffers
the nocopy clause tells the pragma to skip steps 2 and steps 4.
An use case for this is when you are doing an asynchronous offload.
Moving data across PCIe of 1st gen Xeon Phi or fabric for 2nd gen Xeon Phi has latency, especially for large array. It would be more efficient if you could do something else on your host machine while do you do the offload transfer.
Asynchronous offload is when you use a combination of #pragma offload_transfer, to only move data without calculation, and #pragma offload, to do your calculations, and of course, do something between the two pragmas on your host machine.
You specify the nocopy clause for #pragma offload because you already transferred the data to the Xeon Phi with the first #pragma offload_transfer.
Related
I currently have a problem scenario where I'm doing graph computation tasks and I always need to update my vertex data on the host side, iterating through the computations to get the results. But in this process, the data about the edge is unchanged. I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs. By the way, I am currently only able to run OpenCL under version 1.2.
Question 1:
Is it possible for Opencl to cache some data while running between kernels
Yes, it is possible in OpenCL programming model. Please check Buffer Objects, Image Objects and Pipes in OpenCL official documentation. Buffer objects can be manipulated by the host using OpenCL API calls.
Also the following OpenCL StackOverflow posts will further clarify your concept regarding caching in OpenCL:
OpenCL execution strategy for tree like dependency graph
OpenCL Buffer caching behaviour
Memory transfer between host and device in OpenCL?
And you need to check with caching techniques like double buffering in OpenCL.
Question 2:
I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs
Yes, it is possible. You can either do it through batch processing or data tiling. Because as the overhead associated with each transfer, batching many small
transfers into one larger transfer performs significantly better than making each
transfer separately. There can be many examples of batching or data tiling. One cane be this:
OpenCL Kernel implementing im2col with batch
Miscellaneous:
If it is possible, please use the latest version of OpenCL. Version 1.2 is old.
Since you have not mentioned, programming model can differ between hardware accelerators like FPGA and GPU.
I want to perform FFT, FastConv and cross-correlation on GPU and pass the results to other OpenCL kernels without copying the results to host memory.
Cannot you advice me an OpenCL implementation of FFT, FastConv and cross-correlation that can be called as kernels without transferring data from GPU to host?
Many of the OpenCL vendors (e.g., Apple, AMD, NVIDIA) have FFT samples that include kernel source. You can use these to process buffers already on the device and leave results on the device.
I want to run a program on Intel Xeon Phi coprocessor. How can I know whether my machine has an Intel Xeon Phi coprocessor or not.
Well if it has one you should probably know as it is large (see pictures here), costly (several times the price of a desktop), and it generates a lot of heat (you need special fans) and drains a lot of electrical power (you need a special PSU).
otherwise lspci | grep Co-processor will tell you whether you have one or not.
if you have a MIC (Xeon Phi) card attached to your machine then you would be able to ssh to it by ssh mic0 command. This will only work if you installed MPSS though.
I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald
Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.
Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.
I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.