OpenCL Accuracy and Performance Issues when using MacPro (Firepro D500) - opencl

I have run into a strange issue while running the same OpenCL kernel on multiple machines. Please see below:
OS OpenCL version GPU Output Accuracy
LINUX 2.0 AMD-R9 290X Good
Mac 1.2 Nvidia GT-750M Good
Mac 1.2 AMD Firepro D500 Incorrect
LINUX 1.1 Nvidia Tesla K20 Good
I posted on Apple forums, and the only reply I have received is that I should disable fast path math. I am not enabling it anywhere.
In terms of performance, the code runs two times slower on the Firepro when compared to the other discrete GPUs (Tesla and R9) in the list.
Can someone please tell what could be going on? I am happy to share the code if needed.
Here is the OpenCL kernel (some of the variable/function names are not proper): http://pastebin.com/Kt4TinXt
Here is how it is called from the host:
sentence_length = 1024
num_sentences = 6
count = 0
for(sentence in textfile)
{
sentences += sentence
count++
if(count == num_sentences - 1)
enqueuekernel(sentences)
}
A sentence is basically a group of 1024 words. The level of parallelism is at the word level. I chose to use 128 work-items per word, because that allowed me to keep neu1 and neu1e in the shared memory. I tried other combinations like 'layer1_size' work items per word, or 1 wavefront per word, but that did not give good performance at all. Even now, the performance is not that great, but it gives me around 2.8X (compared to 6 core Xeon) on the R9 and Tesla.
Please let me know if more detail is needed!

Related

OpenCL - Same code, Correct on Apple + Xcode incorrect on Win XP + MSVS 2008 + Nvidia CUDA 5

I am running the same OpenCL code on a MacPro equiped with a Nvidia GTX580 running either of the following:
OS X 10.8.2 with Xcode 4.6
Windows XP 32 bit with Visual C++ 2008 enterprise and Nvidia CUDA toolkit 5.0
However I get the wrong results in Win XP.
To define the number of work items used I specify the work group size (192), the number of workgroups (256) and set the global number of work items used as work group size x workgroups (192 x 256 = 49152).
When I run this on the Apple platform all my results are correct however when I run it on the Win XP platform I get a result which is out by a factor of 1/8.
Doing some checks I got the GPU to store what it thinks is the global size and it reports the expected number as 49152 however if I instead get the first work item of each workgroup to atomically add the local size to a counter I only get 6144, exactly 1/8 of the global size.
This problem seems to be a function of the number of work items set and if I set the workgroup size to 32 or 64 I will get the correct answer (when the workgroup size is held constant at 192). However for any other values I have this problem and my result may be off by a factor of 1/8, 1/4 or 1/2 depending on the number of work items used.
Is there any reason of this to occur like 32bit addressing limits or aggressive optimisations in the NVidia library?
In the Apple OpenCL library global write only memory is initialised to 0 for numeric data types, in the windows Nvidia library global write only memory is not initialised.
Therefore when counters in global memory were incremented there starting value was undefined. When this was suspected a quick and foolish initialisation loop was put at the start of the kernel, this of course resulted in workgroups executed later zeroing the results workgroups executed earlier giving the observed reduction in the result in proportion to the number of kernels executed on each compute unit.

Code profiling to improve performance : see CPU cycles inside mscorlib.dll?

I made a small test benchmark comparing .NET's System.Security.Cryptography AES implementation vs BouncyCastle.Org's AES.
Link to GitHub code: https://github.com/sidshetye/BouncyBench
I'm particularly interested in AES-GCM since it's a 'better' crypto algorithm and .NET is missing it. What I noticed was that while the AES implementations are very comparable between .NET an BouncyCastle, the GCM performance is quite poor (see extra background below for more). I suspect it's due to many buffer copies or something. To look deeper, I tried profiling the code (VS2012 => Analyze menu bar option => Launch performance wizard) and noticed that there was a LOT of CPU burn inside mscorlib.dll
Question: How can I figure out what's eating most of the CPU in such a case? Right now all I know is "some lines/calls in Init() burn 47% of CPU inside mscorlib.ni.dll" - but without knowing what specific lines, I don't know where to (try and) optimize. Any clues?
Extra background:
Based on the "The Galois/Counter Mode of Operation (GCM)" paper by David A. McGrew, I read "Multiplication in a binary field can use a variety of time-memory tradeoffs. It can be implemented with no key-dependent memory, in which case it will generally run several times slower than AES. Implementations that are willing to sacrifice modest amounts of memory can easily realize speeds greater than that of AES."
If you look at the results, the basic AES-CBC engine performances are very comparable. AES-GCM adds the GCM and reuses the AES engine beneath it in CTR mode (faster than CBC). However, GCM also adds multiplication in the GF(2^128) field in addition to the CTR mode, so there could be other areas of slowdown. Anyway, that's why I tried profiling the code.
For the interested, where is my quick test performance benchmark. It's inside a Windows 8 VM and YMMV. The test is configurable but currently it's to simulate crypto overhead in encrypting many cells of a database (=> many but small plaintext input)
Creating initial random bytes ...
Benchmark test is : Encrypt=>Decrypt 10 bytes 100 times
Name time (ms) plain(bytes) encypted(bytes) byte overhead
.NET ciphers
AES128 1.5969 10 32 220 %
AES256 1.4131 10 32 220 %
AES128-HMACSHA256 2.5834 10 64 540 %
AES256-HMACSHA256 2.6029 10 64 540 %
BouncyCastle Ciphers
AES128/CBC 1.3691 10 32 220 %
AES256/CBC 1.5798 10 32 220 %
AES128-GCM 26.5225 10 42 320 %
AES256-GCM 26.3741 10 42 320 %
R - Rerun tests
C - Change size(10) and iterations(100)
Q - Quit
This is a rather lame move from Microsoft as they obviously broke a feature that worked well before Windows 8, but no longer, as explained in this MSDN blog post:
:
On Windows 8 the profiler uses a different underlying technology than
what it does on previous versions of Windows, which is why the
behavior is different on Windows 8. With the new technology, the
profiler needs the symbol file (PDB) to know what function is
currently executing inside NGEN’d images.
(...)
It is however on our backlog to implement in the next version of Visual Studio.
The post gives directions to generate the PDB files yourself (thanks!).

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

GPU programming via JOCL uses only 6 out of 80 shader cores?

I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
for n=1 it takes 12.2 seconds
for n=2 it takes 6.3 seconds
for n=3 it takes 4.4 seconds
for n=4 it takes 3.4 seconds
for n=5 it takes 3.1 seconds
for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.

Limit the number of compute units used by OpenCL

I need to limit the number of compute units used by my opencl application.
I'm running it on a CPU that has 8 compute units, I've seen that with CL_DEVICE_MAX_COMPUTE_UNITS.
The execution time I get with OpenCL is much less than 8 times the normal algorithm without OpenCL (is like 600 time faster). I want to use just 1 compute units because I need to see the real improvement with the same code optimized by OpenCL.
It's just for testing, the real application will continue to use all the compute units.
Thanks for your help
If you are using CPUs, Why dont you try using the OpenCL device fission extension ?
Device Fission allows you to split up a computer unit into sub-devices. You can then create a command queue to the subdevice and enqueue kernels only to that subset of your CPU cores,
You can divide your 8 core device into 8 subdevices of 1 core each for example.
Take a look at the Device Fission example in the AMD APP SDK.

Resources