On the Nvidia GPU, we can have multiple kernels running concurrently by using the Streams. How about the Xeon Phi? If I offload two part of computation code by different threads, will they run concurrently on the Xeon Phi?
Yes you can have concurrent offload executions on the Xeon Phi, up to 64 by default.
See the --max-connections parameter of the Coprocessor Offload Infrastructure (COI) daemon running on the Xeon Phi /bin/coi_daemon :
--max-connections=<int> The maximum number of connections we allow from host
processes. If this is exceeded, new connections
are temporarily blocked. Defaults to 64.
Related
I can view the Intel HD Graphics Command Queue with VTune, but I cannot the CPU Command Queue. Why? It is the expected behavior, to only capture GPU "events" but not those from the CPU that are independent of the GPU?
The same OpenCL program (a simple vector addition) running in the GPU shows the events (NDRange, etc) but in the CPU not (you only see clWrite,Read Buffer and clBuildProgram). Also, you cannot see any info in the region where CPU is working with OpenCL (clWaitForEvents).
CPU:
GPU:
Is there a possibility to exclusively reserve the GPU for an OpenCL host programme?
No other process shall have access to this device via OpenCL or OpenGL.
Background is that my software calculates real time data on the GPU and therefore it's not good for the performance if the GPU is doing other stuff as well.
any one help i execute opencl kernel program for image processing and want to know the process faster when the kerenl execute on cpu or gpu
i using tool "GPU caps viewer" take picture for specification cpu and gpu and want to know the best for run opencl kernel code to the cpu or gpu . why??
the gpu info
and for cpu
amd info
intel
and i want to know why three option different for cpu (intel , amd) info
constant buffer amd 64KB intel 128KB
max sampler amd 16 intel 480
opencl extention amd 16 intel 14
work group size amd 16 intel 14
any help
thanks
I am trying to measure PCIe Bandwidth on ATI FirePro 8750. The amd app sample PCIeBandwidth in the SDK measures the bandwith of transfers from:
Host to device, using clEnqueueReadBuffer().
Device to host, using clEnqueueWriteBuffer().
On my system (windows 7, Intel Core2Duo 32 bit) the output is coming like this:
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : ATI RV770
Host to device : 0.412435 GB/s
Device to host : 0.792844 GB/s
This particular card has 2 GB DRAM and max clock frequency is 750 Mhz
1- Why is bandwidth different in each direction?
2- Why is the Bandwdith so small?
Also I understand that this communication takes place through DMA, so the Bandwidth may not be affected by CPU.
This paper from Microsoft Research labs give some inkling of why there is asymmetric PCIe data transfer bandwidth between GPU - CPU. The paper describes performance metrics for FPGA - GPU data transfer bandwidth over PCIe. It also includes metrics from CPU - GPU data transfer bandwidth over PCIe.
To quote the relevant section
'it should also be noted that the GPU-CPU transfers themselves also
show some degree of asymmetric behavior. In the case of a GPU to CPU
transfer, where the GPU is initiating bus master writes, the GPU
reaches a maximum of
6.18 GByte/Sec. In the opposite direction from CPU to GPU, the GPU is initiating bus master reads and the resulting bandwidth falls to 5.61
GByte/Sec. In our observations it is typically the case that bus
master writes are more efficient than bus master reads for any PCIe
implementation due to protocol overhead and the relative complexity of
implementation. While a possible solution to this asymmetry would be
to handle the CPU to GPU direction by using CPU initiated bus master
writes, that hardware facility is not available in the PC architecture
in general. '
The answer to the second question on bandwidth could be due units of data transfer size.
See figs 2,3,4 and 5. I have also seen graphs like this at the 1st AMD Fusion Conference. The explanation is that the PCIe transfer of data has overheads due to the protocol and the device latency. The overheads are more significant for small transfer sizes and become less significant for larger sizes.
What levers do you have to control or improve performance?
Getting the right combo of chip/motherboard and GPU is the H/W lever. Chips with the max number of PCIe lanes are better. Using a higher spec PCIe protocol, PCIe 3.0 is better than PCIe 2.0. All components need to support the higher standards.
As a programmer controlling the data transfer size, is a very important lever.
Transfer sizes of 128K - 256K bytes get approx 50% of the max bandwidth. Transfers of 1M - 2M bytes get over 90% of max bandwidth.
In the AMD APP programming guide it is written that (p.no 4-15):
For transfers <=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.
Is the above DMA, the CPU DMA engine or the GPU DMA engine?
I believe it is the GPU DMA engine since on some cards (e.g., NVIDIA) you can do simultaneous read and write (so this is a GPU capability and not a CPU capability).