Rank limit error on GNU APL 64-bit windows - multidimensional-array

I have an old APL application that runs on DOS that performs FFT and IFFT. It generates a rank limit error on GNU APL. With a workspace of 55 GB there should be no limit on rank or symbols. The only limit that makes sense is a user settable workspace size limit so we don't max out the memory on a 64-bit machine.
To test this, one can observe that a←(n⍴2)⍴2 fails with n>8 on GNU APL 64 bit whereas a 16 bit APL for DOS runs until out of memory.
Is there a way to change the rank limit?

The default the rank limit is indeed 8, but can be configured using the MAX_RANK configuration parameter. You can either use a GNU APL configuration file to do so, or simply use a command line parameter, e.g. MAX_RANK=64.
By the way, all APL implementations I know of have a max rank, and I believe your old DOS APL has such a limit too, only that you happen to run out of memory before you hit it, due to all axes doubling the number of elements (and the elements being at least one byte each) when you do a←(n⍴2)⍴2. Try a←(n⍴1)⍴2 which doesn't add additional elements, and you're likely to find that the maximum rank is 15 or 63 or something like that.

Related

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

OpenCL - Same code, Correct on Apple + Xcode incorrect on Win XP + MSVS 2008 + Nvidia CUDA 5

I am running the same OpenCL code on a MacPro equiped with a Nvidia GTX580 running either of the following:
OS X 10.8.2 with Xcode 4.6
Windows XP 32 bit with Visual C++ 2008 enterprise and Nvidia CUDA toolkit 5.0
However I get the wrong results in Win XP.
To define the number of work items used I specify the work group size (192), the number of workgroups (256) and set the global number of work items used as work group size x workgroups (192 x 256 = 49152).
When I run this on the Apple platform all my results are correct however when I run it on the Win XP platform I get a result which is out by a factor of 1/8.
Doing some checks I got the GPU to store what it thinks is the global size and it reports the expected number as 49152 however if I instead get the first work item of each workgroup to atomically add the local size to a counter I only get 6144, exactly 1/8 of the global size.
This problem seems to be a function of the number of work items set and if I set the workgroup size to 32 or 64 I will get the correct answer (when the workgroup size is held constant at 192). However for any other values I have this problem and my result may be off by a factor of 1/8, 1/4 or 1/2 depending on the number of work items used.
Is there any reason of this to occur like 32bit addressing limits or aggressive optimisations in the NVidia library?
In the Apple OpenCL library global write only memory is initialised to 0 for numeric data types, in the windows Nvidia library global write only memory is not initialised.
Therefore when counters in global memory were incremented there starting value was undefined. When this was suspected a quick and foolish initialisation loop was put at the start of the kernel, this of course resulted in workgroups executed later zeroing the results workgroups executed earlier giving the observed reduction in the result in proportion to the number of kernels executed on each compute unit.

Code profiling to improve performance : see CPU cycles inside mscorlib.dll?

I made a small test benchmark comparing .NET's System.Security.Cryptography AES implementation vs BouncyCastle.Org's AES.
Link to GitHub code: https://github.com/sidshetye/BouncyBench
I'm particularly interested in AES-GCM since it's a 'better' crypto algorithm and .NET is missing it. What I noticed was that while the AES implementations are very comparable between .NET an BouncyCastle, the GCM performance is quite poor (see extra background below for more). I suspect it's due to many buffer copies or something. To look deeper, I tried profiling the code (VS2012 => Analyze menu bar option => Launch performance wizard) and noticed that there was a LOT of CPU burn inside mscorlib.dll
Question: How can I figure out what's eating most of the CPU in such a case? Right now all I know is "some lines/calls in Init() burn 47% of CPU inside mscorlib.ni.dll" - but without knowing what specific lines, I don't know where to (try and) optimize. Any clues?
Extra background:
Based on the "The Galois/Counter Mode of Operation (GCM)" paper by David A. McGrew, I read "Multiplication in a binary field can use a variety of time-memory tradeoffs. It can be implemented with no key-dependent memory, in which case it will generally run several times slower than AES. Implementations that are willing to sacrifice modest amounts of memory can easily realize speeds greater than that of AES."
If you look at the results, the basic AES-CBC engine performances are very comparable. AES-GCM adds the GCM and reuses the AES engine beneath it in CTR mode (faster than CBC). However, GCM also adds multiplication in the GF(2^128) field in addition to the CTR mode, so there could be other areas of slowdown. Anyway, that's why I tried profiling the code.
For the interested, where is my quick test performance benchmark. It's inside a Windows 8 VM and YMMV. The test is configurable but currently it's to simulate crypto overhead in encrypting many cells of a database (=> many but small plaintext input)
Creating initial random bytes ...
Benchmark test is : Encrypt=>Decrypt 10 bytes 100 times
Name time (ms) plain(bytes) encypted(bytes) byte overhead
.NET ciphers
AES128 1.5969 10 32 220 %
AES256 1.4131 10 32 220 %
AES128-HMACSHA256 2.5834 10 64 540 %
AES256-HMACSHA256 2.6029 10 64 540 %
BouncyCastle Ciphers
AES128/CBC 1.3691 10 32 220 %
AES256/CBC 1.5798 10 32 220 %
AES128-GCM 26.5225 10 42 320 %
AES256-GCM 26.3741 10 42 320 %
R - Rerun tests
C - Change size(10) and iterations(100)
Q - Quit
This is a rather lame move from Microsoft as they obviously broke a feature that worked well before Windows 8, but no longer, as explained in this MSDN blog post:
:
On Windows 8 the profiler uses a different underlying technology than
what it does on previous versions of Windows, which is why the
behavior is different on Windows 8. With the new technology, the
profiler needs the symbol file (PDB) to know what function is
currently executing inside NGEN’d images.
(...)
It is however on our backlog to implement in the next version of Visual Studio.
The post gives directions to generate the PDB files yourself (thanks!).

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Increasing (or decreasing) the memory available to R processes

I would like to increase (or decrease) the amount of memory available to R. What are the methods for achieving this?
From:
http://gking.harvard.edu/zelig/docs/How_do_I2.html (mirror)
Windows users may get the error that R
has run out of memory.
If you have R already installed and
subsequently install more RAM, you may
have to reinstall R in order to take
advantage of the additional capacity.
You may also set the amount of
available memory manually. Close R,
then right-click on your R program
icon (the icon on your desktop or in
your programs directory). Select
``Properties'', and then select the
``Shortcut'' tab. Look for the
``Target'' field and after the closing
quotes around the location of the R
executible, add
--max-mem-size=500M
as shown in the figure below. You may
increase this value up to 2GB or the
maximum amount of physical RAM you
have installed.
If you get the error that R cannot
allocate a vector of length x, close
out of R and add the following line to
the ``Target'' field:
--max-vsize=500M
or as appropriate. You can always
check to see how much memory R has
available by typing at the R prompt
memory.limit()
which gives you the amount of available memory in MB. In previous versions of R you needed to use: round(memory.limit()/2^20, 2).
Use memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. You need to be using 64-bit in order to take real advantage of this.
One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
For linux/unix, I can suggest unix package.
To increase the memory limit in linux:
install.packages("unix")
library(unix)
rlimit_as(1e12) #increases to ~12GB
You can also check the memory with this:
rlimit_all()
for detailed information:
https://rdrr.io/cran/unix/man/rlimit.html
also you can find further info here:
limiting memory usage in R under linux
Microsoft Windows accepts any memory request from processes if it could be done.
There is no limit for the memory that can be provided to a process, except the Virtual Memory Size.
Virtual Memory Size is 4GB in 32bit systems for any processes, no matter how many applications you are running. Any processes can allocate up to 4GB memory in 32bit systems.
In practice, Windows automatically allocates some parts of allocated memory from RAM or page-file depending on processes requests and paging file mechanism.
But another limit is the size of paging file. If you have a small paging-file, you cannot allocated large memories. You could increase the size of paging file according to Microsoft to have more memory space.
Buy more ram
Switch to a 64-bit OS. Combine with point 1.
To increase the amount of memory allocated to R you can use memory.limit
memory.limit(size = ...)
Or
memory.size(max = ...)
About the arguments
size - numeric. If NA report the memory limit, otherwise request a new limit, in Mb. Only values of up to 4095 are allowed on 32-bit R builds, but see ‘Details’.
max - logical. If TRUE the maximum amount of memory obtained from the OS is reported, if FALSE the amount currently in use, if NA the memory limit.
In RStudio, to increase:
file.edit(file.path("~", ".Rprofile"))
then in .Rprofile type this and save
invisible(utils::memory.limit(size = 60000))
To decrease:
open .Rprofile
invisible(utils::memory.limit(size = 30000))
save and restart RStudio.

Resources