Code profiling to improve performance : see CPU cycles inside mscorlib.dll? - encryption

I made a small test benchmark comparing .NET's System.Security.Cryptography AES implementation vs BouncyCastle.Org's AES.
Link to GitHub code: https://github.com/sidshetye/BouncyBench
I'm particularly interested in AES-GCM since it's a 'better' crypto algorithm and .NET is missing it. What I noticed was that while the AES implementations are very comparable between .NET an BouncyCastle, the GCM performance is quite poor (see extra background below for more). I suspect it's due to many buffer copies or something. To look deeper, I tried profiling the code (VS2012 => Analyze menu bar option => Launch performance wizard) and noticed that there was a LOT of CPU burn inside mscorlib.dll
Question: How can I figure out what's eating most of the CPU in such a case? Right now all I know is "some lines/calls in Init() burn 47% of CPU inside mscorlib.ni.dll" - but without knowing what specific lines, I don't know where to (try and) optimize. Any clues?
Extra background:
Based on the "The Galois/Counter Mode of Operation (GCM)" paper by David A. McGrew, I read "Multiplication in a binary field can use a variety of time-memory tradeoffs. It can be implemented with no key-dependent memory, in which case it will generally run several times slower than AES. Implementations that are willing to sacrifice modest amounts of memory can easily realize speeds greater than that of AES."
If you look at the results, the basic AES-CBC engine performances are very comparable. AES-GCM adds the GCM and reuses the AES engine beneath it in CTR mode (faster than CBC). However, GCM also adds multiplication in the GF(2^128) field in addition to the CTR mode, so there could be other areas of slowdown. Anyway, that's why I tried profiling the code.
For the interested, where is my quick test performance benchmark. It's inside a Windows 8 VM and YMMV. The test is configurable but currently it's to simulate crypto overhead in encrypting many cells of a database (=> many but small plaintext input)
Creating initial random bytes ...
Benchmark test is : Encrypt=>Decrypt 10 bytes 100 times
Name time (ms) plain(bytes) encypted(bytes) byte overhead
.NET ciphers
AES128 1.5969 10 32 220 %
AES256 1.4131 10 32 220 %
AES128-HMACSHA256 2.5834 10 64 540 %
AES256-HMACSHA256 2.6029 10 64 540 %
BouncyCastle Ciphers
AES128/CBC 1.3691 10 32 220 %
AES256/CBC 1.5798 10 32 220 %
AES128-GCM 26.5225 10 42 320 %
AES256-GCM 26.3741 10 42 320 %
R - Rerun tests
C - Change size(10) and iterations(100)
Q - Quit

This is a rather lame move from Microsoft as they obviously broke a feature that worked well before Windows 8, but no longer, as explained in this MSDN blog post:
:
On Windows 8 the profiler uses a different underlying technology than
what it does on previous versions of Windows, which is why the
behavior is different on Windows 8. With the new technology, the
profiler needs the symbol file (PDB) to know what function is
currently executing inside NGEN’d images.
(...)
It is however on our backlog to implement in the next version of Visual Studio.
The post gives directions to generate the PDB files yourself (thanks!).

Related

Rank limit error on GNU APL 64-bit windows

I have an old APL application that runs on DOS that performs FFT and IFFT. It generates a rank limit error on GNU APL. With a workspace of 55 GB there should be no limit on rank or symbols. The only limit that makes sense is a user settable workspace size limit so we don't max out the memory on a 64-bit machine.
To test this, one can observe that a←(n⍴2)⍴2 fails with n>8 on GNU APL 64 bit whereas a 16 bit APL for DOS runs until out of memory.
Is there a way to change the rank limit?
The default the rank limit is indeed 8, but can be configured using the MAX_RANK configuration parameter. You can either use a GNU APL configuration file to do so, or simply use a command line parameter, e.g. MAX_RANK=64.
By the way, all APL implementations I know of have a max rank, and I believe your old DOS APL has such a limit too, only that you happen to run out of memory before you hit it, due to all axes doubling the number of elements (and the elements being at least one byte each) when you do a←(n⍴2)⍴2. Try a←(n⍴1)⍴2 which doesn't add additional elements, and you're likely to find that the maximum rank is 15 or 63 or something like that.

Generating CPU utilization levels

First I would like to let you know that I have recently asked this question already, however it was considered to be unclear, see Linux: CPU benchmark requiring longer time and different CPU utilization levels. This is now a new attempt to formulate the question using a different approach.
What I need: In my research, I look at the CPU utilization of a computer and analyze the CPU utilization pattern within a period of time. For example, a CPU utilization pattern within time period 0 to 10 has the following form:
time, % CPU used
0 , 21.1
1 , 17
2 , 18
3 , 41
4 , 42
5 , 60
6 , 62
7 , 62
8 , 61
9 , 50
10 , 49
I am interested in finding a simple representation for a given CPU utilization pattern. For the evaluation part, I need to create some CPU utilization patterns on my laptop which I will then record and analyse. These CPU utilization patterns that I need to create on my laptop should
be over a time period of more than 5 minutes, ideally of about 20 minutes.
the CPU utilization pattern should have "some kind of dynamic behavior" or in other words, the % CPU used should not be (almost) constant over time, but should vary over time.
My Question: How can I create such a utilization pattern? Of course, I could just run an arbitrary program on my laptop and I will obtain a desired CPU pattern. However, this solution is not ideal since a reader of my work has no means to repeat this experiment if wanted since he has not access to the program I used. Therefore it would be much more beneficial to use something instead of an arbitrary program on my laptop (in my previous post I was thinking about open source CPU benchmarks for example). Can anyone recommend me something?
Many thanks!
I suggest a moving average. Select a window size and use it to average over. You'll need to decide what type of patterns you want to identify since the wider the window, the more smoothing you get and the fewer "features" you'll see. And CPU activity is very bursty. For example, if you are trying to identify cache bottlenecks, you'll want a small window, probably in the 10ms to 100ms range. If instead you want to correlate to longer term features, such as energy or load, you'll want a larger window, perhaps 10sec to minutes.
It looks like you are using OS provided CPU usage and not hardware registers. This means that the OS is already doing some smoothing. It may also be doing estimation for some performance values. Try to find documentation on this if you are integrating over a smaller window. A word of warning: this level of information can be hard to find. You may have to do a lot of digging. Depending upon your familiarity with kernel code, it may be easier to look at the code.

How to make the most of SIMD in OpenCL?

In the optimization guide of Beignet, an open source implementation of OpenCL targeting Intel GPUs
Work group Size should be larger than 16 and be multiple of 16.
As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD
lanes, we need to follow this rule.
Also mentioned in the Compute Architecture of Intel Processor Graphics Gen7.5:
For Gen7.5 based products, each EU has seven threads for a total of 28 Kbytes of general purpose register file (GRF).
...
On Gen7.5 compute architecture, most SPMD programming models employ
this style code generation and EU processor execution. Effectively,
each SPMD kernel instance appears to execute serially and independently within its own SIMD lane.
In actuality, each thread executes a SIMD-Width number of kernel instances >concurrently. Thus for a SIMD-16 compile of a compute
kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances
to be executing concurrently on a single EU. Similarly, for SIMD-32 x
7 threads = 224 kernel instances executing concurrently on a single
EU.
If I understand it correctly, using the SIMD-16 x 7 threads = 112 kernel instances as a example, in order to run 224 threads on one EU, the work group size need to be 16. Then the OpenCL compiler will fold 16 kernel instances into a 16 lane SIMD thread, and do this 7 times on 7 work groups, and run them on a single EU?
Question 1: am I correct until here?
However OpenCL spec also provide vector data types. So it's feasible to make full use of the SIMD-16 computing resources in a EU by conventional SIMD programming(as in NEON and SSE).
Question 2: If this is the case, using vector-16 data type already makes explicit use of the SIMD-16 resources, hence removes the at-least-16-item-per-work-group restrictions. Is this the case?
Question 3: If all above are true, then how does the two approach compare with each other: 1) 112 threads fold into 7 SIMD-16 threads by OpenCL compiler; 2) 7 native threads coded to explicitly use vector-16 data types and SIMD-16 operations?
Almost. You are making the assumptions that there is one thread per workgroup (N.B. thread in this context is what CUDA calls a "wave". In Intel GPU speak a work item is a SIMD channel of a GPU thread). Without subgroups, there is no way to force a workgroup size to be exactly a thread. For instance, if you choose a WG size of 16, the compiler is still free to compile SIMD8 and spread it amongst two SIMD8 threads. Keep in mind that the compiler chooses the SIMD width before the WG size is known to it (clCompileProgram precedes clEnqueueNDRange). The subgroups extension might allow you to force the SIMD width, but is definitely not implemented on GEN7.5.
OpenCL vector types are an optional explicit vectorization step on top of the implicit vectorization that already happens automatically. Were you to use float16 for example. Each of the work items would be processing 16 floats each, but the compiler would still compile at least SIMD8. Hence each GPU thread would be processing (8 * 16) floats (in parallel though). That might be a bit overkill. Ideally we don't want to have to explicitly vectorize our CL by using explicit OpenCL vector types. But it can be helpful sometimes if the kernel is not doing enough work (kernels that are too short can be bad). Somewhere it says float4 is a good rule of thumb.
I think you meant 112 work items? By native thread do you mean CPU threads or GPU threads?
If you meant CPU threads, the usual arguments about GPUs apply. GPUs are good when your program doesn't diverge much (all instances take similar paths) and you use the data enough times to mitigate the cost transferring it to and from the GPU (arithmetic density).
If you meant GPU threads (the GEN SIMD8 or SIMD16 critters). There is no (publicly visible) way to program the GPU threads explicitly at the moment (EDIT see the subgroups extension (not available on GEN7.5)). If you were able to, it'd be a similar trade off to assembly language. The job is harder, and the compiler sometimes just does a better job than we can, but when you are solving a specific problem and have better domain knowledge, you can generally do better with enough programming effort (until hardware changes and your clever program's assumptions becomes invalidated.)

OpenCL Accuracy and Performance Issues when using MacPro (Firepro D500)

I have run into a strange issue while running the same OpenCL kernel on multiple machines. Please see below:
OS OpenCL version GPU Output Accuracy
LINUX 2.0 AMD-R9 290X Good
Mac 1.2 Nvidia GT-750M Good
Mac 1.2 AMD Firepro D500 Incorrect
LINUX 1.1 Nvidia Tesla K20 Good
I posted on Apple forums, and the only reply I have received is that I should disable fast path math. I am not enabling it anywhere.
In terms of performance, the code runs two times slower on the Firepro when compared to the other discrete GPUs (Tesla and R9) in the list.
Can someone please tell what could be going on? I am happy to share the code if needed.
Here is the OpenCL kernel (some of the variable/function names are not proper): http://pastebin.com/Kt4TinXt
Here is how it is called from the host:
sentence_length = 1024
num_sentences = 6
count = 0
for(sentence in textfile)
{
sentences += sentence
count++
if(count == num_sentences - 1)
enqueuekernel(sentences)
}
A sentence is basically a group of 1024 words. The level of parallelism is at the word level. I chose to use 128 work-items per word, because that allowed me to keep neu1 and neu1e in the shared memory. I tried other combinations like 'layer1_size' work items per word, or 1 wavefront per word, but that did not give good performance at all. Even now, the performance is not that great, but it gives me around 2.8X (compared to 6 core Xeon) on the R9 and Tesla.
Please let me know if more detail is needed!

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Resources