How does ATLAS tuning work? - linear-algebra

It's fairly well-known that ATLAS uses "blocking" or "tiling" versions of matrix computation algorithms, which substantially improve performance.
It also appears that ATLAS has some architectural defaults, which have been computed manually. And it's possible to do a search to determine other values for NB (a #define macro which I believe stands for number of blocks).
But how does it work? How are the values determined? Do the algorithms just run a bunch of times with different values, Monte Carlo style, until some kind of optimum is found?
Here's a hypothetical, too. Let's say you copied a blocked ATLAS algorithm into C++ templates and had a 128-bit rational type. Could you derive NB for the rational version of the algorithm in some way from an ATLAS-tuned NB value from the double version of the algorithm?

Related

Understanding the complex-step in a physical sense

I think I understand what complex step is doing numerically/algorithmically.
But the questions still linger. First two questions might have the same answer.
1- I replaced the partial derivative calculations of 'Betz_limit' example with complex step and removed the analytical gradients. Looking at the recorded design_var evolution none of the values are complex? Aren't they supposed to be shown as somehow a+bi?
Or it always steps in the real space. ?
2- Tying to picture 'cs', used in a physical concept. For example a design variable of beam length (m), objective of mass (kg) and a constraint of loads (Nm). I could be using an explicit component to calculate these (pure python) or an external code component (pure fortran). Numerically they all can handle complex numbers but obviously the mass is a real value. So when we say capable of handling the complex numbers is it just an issue of handling a+bi (where actual mass is always 'a' and b is always equal to 0?)
3- How about the step size. I understand there wont be any subtractive cancellation errors but what if i have a design variable normalized/scaled to 1 and a range of 0.8 to 1.2. Decreasing the step to 1e-10 does not make sense. I am a bit confused there.
The ability to use complex arithmetic to compute derivative approximations is based on the mathematics of complex arithmetic.
You should read about the theory to get a better understanding of why it works and how the step size issue is resolved with complex-step vs finite-difference.
There is no physical interpretation that you can make for the complex-step method. You are simply taking advantage of the mathematical properties of complex arithmetic to approximate a derivative in a more accurate manner than FD can. So the key is that your code is set up to do complex-arithmetic correctly.
Sometimes, engineering analyses do actually leverage complex numbers. One aerospace example of this is the Jukowski Transformation. In electrical engineering, complex numbers come up all the time for load-flow analysis of ac circuits. If you have such an analysis, then you can not easily use complex-step to approximate derivatives since the analysis itself is already complex. In these cases, it is technically possible to use a more general class of numbers called hyper dual numbers, but this is not supported in OpenMDAO. So if you had an analysis like this you could not use complex-step.
Also, occationally there are implementations of methods that are not complex-step safe which will prevent you from using it unless you define a new complex-step safe version. The simplest example of this is the np.absolute() method in the numpy library for python. The implementation of this, when passed a complex number, will return the asolute magnitude of the number:
abs(a+bj) = sqrt(1^2 + 1^2) = 1.4142
While not mathematically incorrect, this implementation would mess up the complex-step derivative approximation.
Instead you need an alternate version that gives:
abs(a+bj) = abs(a) + abs(b)*j
So in summary, you need to watch out for these kinds of functions that are not implemented correctly for use with complex-step. If you have those functions, you need to use alternate complex-step safe versions of them. Also, if your analysis itself uses complex numbers then you can not use complex-step derivative approximations either.
With regard to your step size question, again I refer you to the this paper for greater detail. The basic idea is that without subtractive cancellation you are free to use a very small step size with complex-step without the fear of lost accuracy due to numerical issues. So typically you will use 1e-20 smaller as the step. Since complex-step accuracy scalea with the order of step^2, using such a small step gives effectively exact results. You need not worry about scaling issues in most cases, if you just take a small enough step.

How correlated are i.d.d. normal numbers in julia

I noticed while doing numerical simulations a pattern in my data when I use normal numbers in julia.
I have an ensemble of random matrices. In order to do my calculations reproducible, I set the srand function per-realization. That is, each time I use the function randn(n,n) I initialize it with srand(j), where j is the number of the realization.
I would like to know how the normal numbers are generated, and if it has meaning that doing what I do, I introduce accidental correlations.
Ideally, not at all. If you have any counterexamples, please file them as bugs on the Julia issue tracker. Julia uses the state-of-the-art Mersenne Twister library, dSFMT. This library is very fast and has been considered to use best practices for pseudo-random number generation. It has, however, recently come to my attention that there may be subtle statistical issues with PRNGs like MT in general – in particular with using small, consecutive seed values. To mitigate this if you're really worried about potential correlations, you could do something like this:
julia> using SHA
julia> srand(reinterpret(UInt32,sha256(string(1))))
MersenneTwister(UInt32[0x73b2866b,0xe1fc34ff,0x4e806b9d,0x573f5aff,0xeaa4ad47,0x491d2fa2,0xdd521ec0,0x4b5b87b7],Base.dSFMT.DSFMT_state(Int32[660235548,1072895699,-1083634456,1073365654,-576407846,1073066249,1877594582,1072764549,-1511149919,1073191776 … -710638738,1073480641,-1040936331,1072742443,103117571,389938639,-499807753,414063872,382,0]),[1.5382,1.36616,1.06752,1.17428,1.93809,1.63529,1.74182,1.30015,1.54163,1.05408 … 1.67649,1.66725,1.62193,1.26964,1.37521,1.42057,1.79071,1.17269,1.37336,1.99576],382)
julia> srand(reinterpret(UInt32,sha256(string(2))))
MersenneTwister(UInt32[0x3a5e73d4,0xee165e26,0x71593fe0,0x035d9b8b,0xd8079c01,0x901fc5b6,0x6e663ada,0x35ab13ec],Base.dSFMT.DSFMT_state(Int32[-1908998566,1072999344,-843508968,1073279250,-1560550261,1073676797,1247353488,1073400397,1888738837,1073180516 … -450365168,1073182597,1421589101,1073360711,670806122,388309585,890220451,386049800,382,0]),[1.5382,1.36616,1.06752,1.17428,1.93809,1.63529,1.74182,1.30015,1.54163,1.05408 … 1.67649,1.66725,1.62193,1.26964,1.37521,1.42057,1.79071,1.17269,1.37336,1.99576],382)
In other words, hash a string representation of a small integer seed value using a strong cryptographic hash like SHA2-256, and use the resulting hash data to seed the Mersenne Twister state. Ottoboni, Rivest & Stark suggest using a strong cryptographic hash for each random number generation, but that's going to be a massive slowdown (on current hardware) and is probably overkill unless you have an application that is really very sensitive to imperfect statistical randomness.
I should perhaps point out that Julia's behavior here is not worse than other languages, some of which use far worse random number generators by default, due to backwards compatibility considerations. This is an very recent research result (not yet published even). The technique I've suggested could be used to mitigate this issue in other languages as well.

Fast, portable, c++ math library for matrix and vector manipulations

I have my own game engine which is written in opengl and c++. I also have my own math library for matrix and vectors manipulations. I always had doubts about the performance of my math library, so I recently decided to search for some popular math library which is used by many game / graphics developers. I was surprised that I couldn't find anything.
People on stackoverflow suggested GLM and Eigen libraries in similar posts, so I made some performance tests. I multiplied 1000000 times two matrices 4x4, and here are results:
GLM: 4.23 seconds
Eigen: 12.57 seconds
My library: 0.25 seconds
I was surprised by these results, because my implementation of multiplying matrices is taken from wikipedia. I checked the code from glm and eigen and I found, that there is a lot of typedefs, assertions and other type checking, unnecessary code, which decrease performance a lot.
So, my question is:
Do you know any FAST math library with nice API for gamedev / graphics purpose? I need functionality like: creating translation, rotation, projection, matrix * matrix, inverse, look at, matrix * vector, quaternions, etc...
I checked the code from glm and eigen and I found, that there is a lot of typedefs, assertions and other type checking, unnecessary code, which decrease performance a lot.
Are you sure that you have done all this benchmarks using higher compiler optimization ON ?
And not for example using Debug settings ?
Another alternative would be also MathFu from Google.
http://google.github.io/mathfu/

Parallel inner product of a vector, using scan algorithm

I have the optimized code for parallel (exclusive) scan algorithm, that's written in OpenCL.
I've read that inner (dot) product of a vector is based on parallel reduction but I was wondering is it somehow possible to use this already finished scan algorithm for the purpose?
dot product by defintion is a reduction algorithm. The reduction algorithm is not too hard to implement and even a moderately optimized version is much faster than a scan algorithm. It is best if you wrote a fast reduction algorithm that you can use.

Has your pseudo-random number generator (PRNG) ever not been random enough?

Have you ever written simulations or randomized algorithms where you've run into trouble because of the quality of the (pseudo)-random numbers you used?
What was happening?
How did you detect / realize your prng was the problem?
Was switching PRNGs enough to fix the problem, or did you have to switch to a source of true randomness?
I'm trying to figure out what types of applications require one to worry about the quality of their source of randomness and how one realizes when this becomes a problem.
The dated random number generator RANDU was infamous in the seventies for producing "bad" random numbers. My PhD supervisor mentioned that it affected his PhD and he had to rerun simulations. A search on Google for RANDU linear congrunetial generator brings up other examples.
When I run simulations on multiple machines, I've sometimes been tempted to generate "random seeds", rather than just use a proper parallel random number generator. For example, generate the seed using the current time in seconds. This has caused me enough problems that I avoid this at all costs.
This is mainly due to my particular interests, but other than parallel computing, the thought of creating my own random number generator would never cross my mind. Calling a well tested random number function is trivial in most languages.
It is a good practice to run your prng against DieHard. Very good and fast PRNG exist nowadays (see the work of Marsaglia), see Numerical Recipes edition 3 for a good introduction.

Resources