Memory allocation Nvidia vs AMD - opencl

I know there is a 128MB limit for a single block of GPU memory on AMD GPU's. Is there a similar limit on Nvidia GPU's?

You can query this information at runtime using clGetDeviceInfo and CL_DEVICE_MAX_MEM_ALLOC_SIZE.
See clGetDeviceInfo Man Page for more information.

On GTX 560 clGetDeviceInfo returns 256MiB for CL_DEVICE_MAX_MEM_ALLOC_SIZE, however I can allocate slightly less than 1GiB. See this thread discussing the issue.
On AMD however this limit is enforced. You can raise it by changing GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_SIZE environment variables (see this thread).

Related

Vectorized Code on GPU

I am using OpenCL to execute a procedure on different GPUs and CPUs simultaneously to get a high performance results. The Intel OpenCL is always showing a message that the Kernel is not vectorized, so it will only run on different cores but will not run using SIMD instructions. My question is, if I rewrite the code so that the SIMD instruction can be exploit with the OpenCL code, will it increase the GPU Performance also?
Yes - but beware that this is not necessary on AMD GCN based APU/GPU or Nvidia Fermi or higher GPU hardware for good performance -they do scalar operations with great utilization. CPUs and Intels GPU however can greatly benefit via SIMD instructions which is what the vector operations boil down to.

OpenCL kernel queueing delays

I have a gigantic pile of data, 100GB. I only have 1GB of Video memory. I need to queue my kernel many times with MaxWorkgroupSize chunks. That's going to be ~10000 kernel queueings and 100 Memory transfers. How badly will this affect my performance time? Also, is there a faster way of processing so much data? Would I just be better off running on my cpu with 8 threads, because then there is no data transfer and kernel delays. I'm asking before I code the thing because I want to make sure I have the right approach.
It depends on the nature of the work. GPUs are SIMD machines. If you are typically doing the same thing for each item (e.g. branches are normally going the same place for each work item), then that bodes well for a GPU. Even so, 8 thread CPU has OpenCL implementations for it as well. Also, in environments like Intel's embedded GPU (AMD too?) you should consider the CL_MEM_USE_HOST_PTR flag on the memory buffer. You can use it to get a zero-copy overhead.
Multiple enqueueing of same kernel doesn't impose any performance hit per enqueue in comparison to single kernel run. More to say, it becomes a little bit faster due to caching.
Also, you can run your code on CPU & GPU simultaneously, as both are OpenCL-compatible devices.
Your Device can use memory objects, allocated from Host's RAM (CL_MEM_ALLOC_HOST_PTR & CL_MEM_USE_HOST_PTR flags in clCreateBuffer() function). Anyway, memory transfers may not be the bottleneck.

can we read and program the microcodes of AMD processor?

we can know that microcodes in Intel processors is encrypted (as issued in "Intel® 64 and IA-32 Architectures Software Developer’s Manual"). One cannot programm the Intel microcodes as he wants.
So, does anyone know how about the AMD microcodes? Are the microcodes of AMD CPU encrypted ?
Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Thank you in advance!
(ps: Not the microcodes in GPU, but in CPU).
This article provides information on the microcode of AMD's Opteron (K8) family. It claims that it is not encrypted and provides information on the microcode format and updating the microcode.
Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Not too many people do that kind of work. It's often written with a C compiler tweaked to generate the necessary microcode.
To answer your question in regard "is there other processors accepting microcode?" FPGA's are only programmed using such. These are not CPUs, what you program in them "is written at the hardware level". The microcode changes the doors and the result is your program. It can become very tedious as everything runs in parallel (true hardware parallelism).
AMD microcode for recent processors is, indeed, encrypted and authenticated, much like Intel's. You need to have the proper crypto key to sign a microcode update the processor will accept.
Intel does it by embedding in the processor mask (hardware read-only) microcode a hash of the valid key(s?): the key itself is too large to bother embedding in the processor, so it will be present in the update data itself as seen here. Also, the Intel microcode update is actually an unified processor-package update data, it updates more than just the microcode for the decode unit. It can update all sort of internal processor parameters, as well as control sequences for other units than the decoder... it also has both opcode (and likely microcode) that the processor runs before(?)/after applying the update.

HyperQ support in OpenCL

I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.
In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.

GPU in-use memory in OpenCL

Is there anyway to query GPU device to find in-use memory with OpenCL? I want to allocate as much as memory that I can.
There is no standard way of getting such information. Some alternatives (pretty poor alternatives, but anyway):
CUDA provides such functionality via cuMemGetInfo
GL_ATI_meminfo and NVX_gpu_memory_info OpenGL extensions
nvidia-smi application

Resources