BLAS sdot operation implementation in MKL library - math

I tested the BLAS sdot interface for single precise floating point dot operations. I found that the results of Intel MKL library are a little different from that of the BLAS fortran code given in http://netlib.org/blas/. The MKL ones appear more accurate.
I just wonder is there any optimization made by MKL? Or how does MKL implement it to make it more accurate?

Well, since the MKL is especially written by a specific CPU vendor for their own products, I guess they can use a bit more knowledge about the underlying machine than the reference implementation can.
First thoughts may be that they use optimized assembly and always keep the running sum on the x87 80bit floating point stack without rounding it down to 32bit in each iteration. Or maybe they use SSE(2) and compute the whole sum in double precision (which shouldn't make much of a difference for addition and multiplication, performance-wise). Or maybe they use a completely different computation or what black magic machine tricks ever.
The point is that these routines are far more optimized for a specific hardware than the basic reference implementation, but without seeing their implementation we cannot say in which way. The above mentioned ideas are just simple approaches.

Related

Why the optimization, division-by-constant is not implemented in LLVM IR?

According the source code *1 give below and my experiment, LLVM implements a transform that changes the division to multiplication and shift right.
In my experiment, this optimization is applied at the backend (because I saw that change on X86 assembly code instead of LLVM IR).
I know this change may be associated with the hardware. In my point, in some hardware, the multiplication and shift right maybe more expensive than a single division operator. So this optimization is implemented in backend.
But when I search the DAGCombiner.cpp, I saw a function named isIntDivCheap(). And in the definition of this function, there are some comments point that the decision of cheap or expensive depends on optimizing base on code size or the speed.
That is, if I always optimize the code base on the speed, the division will convert to multiplication and shift right. On the contrary, the division will not convert.
In the other hand, a single division is always slower than multiplication and shift right or the function will do more thing to decide the cost.
So, why this optimization is NOT implemented in LLVM IR if a single division always slower?
*1: https://llvm.org/doxygen/DivisionByConstantInfo_8cpp.html
Interesting question. According to my experience of working on LLVM front ends for High-level Synthesis (HLS) compilers, the answer to your questions lies in understanding the LLVM IR and the limitations/scope of the optimizations at LLVM IR stage.
The LLVM Intermediate Representation (IR) is the backbone that connects frontends and backends, allowing LLVM to parse multiple source languages and generate code to multiple targets. Hence, at the LLVM IR stage, it's often about intent rather than full-fledge performance optimizations.
Divide-by-constant optimization is very much performance driven. Not saying at all that optimizations at IR level have less or nothing to do with performance, however, there are inherent limitations in terms of optimizations at IR stage and divide-by-constant is one of those limitations.
To be more precise, the IR is not entrenched enough in low-level machine details and instructions. If you observe that the optimizations at LLVM IR are usually composed of analysis and transform passes. And as per my knowledge, you don't see divide-by-constant pass at the IR stage.

Understanding the reasons behind Openmdao design

I am reading about MDO and I find openmdao really interesting. However I have trouble understanding/justifying the reasons behind some basic choices.
Why Gradient-based optimization ? Since gradient-based optimizer can never guarantee global optimum why is it preferred. I understand that finding a global minima is really hard for MDO problems with numerous design variables and a local optimum is far better than a human design. But considering that the application is generally for expensive systems like aircrafts or satellites, why settle for local minima ? Wouldn't it be better to use meta-heuristics or meta-heuristics on top of gradient methods to converge to global optimum ? Consequently the computation time will be high but now that almost every university/ leading industry have access to super computers, I would say it is an acceptable trade-off.
Speaking about computation time, why python ? I agree that python makes scripting convenient and can be interfaced to compiled languages. Does this alone tip the scales in favor of Python ? But if computation time is one of the primary reasons that makes finding the global minima really hard, wouldn't it be better to use C++ or any other energy efficient language ?
To clarify the only intention of this post is to justify (to myself) using Openmdao as I am just starting to learn about MDO.
No algorithm can guarantee that it finds a global optimum in finite time, but gradient-based methods generally find locals faster than gradient-free methods. OpenMDAO concentrates on gradient-based methods because they are able to traverse the design space much more rapidly than gradient-free methods.
Gradient-free methods are generally good for exploring the design space more broadly for better local optima, and there's nothing to prevent users from wrapping the gradient-based optimization drivers under a gradient-free caller. (see the literature about algorithms like Monotonic Basin Hopping, for instance)
Python was chosen because, while it's not the most efficient in run-time, it considerably reduces the development time. Since using OpenMDAO means writing code, the relatively low learning curve, ease of access, and cross-platform nature of Python made it attractive. There's also a LOT of open-source code out there that's written in Python, which makes it easier to incorporate things like 3rd party solvers and drivers. OpenMDAO is only possible because we stand on a lot of shoulders.
Despite being written in Python, we achieve relatively good performance because the algorithms involved are very efficient and we attempt to minimize the performance issues of Python by doing things like using vectorization via Numpy rather than Python loops.
Also, the calculations that Python handles at the core of OpenMDAO are generally very low cost. For complex engineering calculations like PDE solvers (e.g. CFD or FEA) the expensive parts of the code can be written in C, C++, Fortran, or even Julia. These languages are easy to interface with python, and many OpenMDAO users do just that.
OpenMDAO is actively used in a number of applications, and the needs of those applications drives its design. While we don't have a built-in monotonic-basin-hopping capability right now (for instance), if that was determined to be a need by our stakeholders we'd look to add it in. As our development continues, if we were to hit roadblocks that could be overcome by switching do a different language, we would consider it, but backwards compatibility (the ability of users to use their existing Python-based models) would be a requirement.

Differences between clBLAS and ViennaCL?

Looking at the OpenCL libraries out there I am trying to get a complete grasp of each one. One library in particular is clBLAS. Their website states that it implements BLAS level 1,2, & 3 methods. That is great but ViennaCL also has BLAS routines, linear algebra solvers, supports OpenCL and CUDA backends, and is header only. It seems to me, at the moment, that there doesn't appear to be a reason to use clBLAS over ViennaCL but I was wondering if anyone had any reasons why one would use clBLAS over ViennaCL?
Although similar, this is meant to be an extension of this previous question comparing VexCL, Thrust, and Boost.Compute.
clBlas is implemented by AMD, so one can hope that it would be faster on AMD hardware. That is usually the sole advantage of vendor BLAS implementations. Unfortunately, this seems to not be the case here.
In this talk ViennaCL authors report that due to their autotuning framework they are able to either outperform clBLAS, or show similar performance.

Tuning Mathematical Parallel Codes

Assuming that I am interested in performance rather than portability of my linear algebra iterative multi-threaded solver and that I have the results of profiling my code in hand, how do I go about tuning my code to run optimally on that machine of my choice?
The algorithm involves Matrix-Vector multiplications, norms and dot-products. (FWIW, I am working on CG and GMRES).
I am working on codes which are of matrix size roughly equivalent to the full size of the RAM (~6GB). I'll be working on Intel i3 Laptop. I'll be linking my codes using Intel MKL.
Specifically,
Is there a good resource(PDF/Book/Paper) for learning manual tuning? There are numerous things that I learnt by doing for instance : Manual Unrolling isn't always optimal or about compiler flags but I would prefer a centralized resource.
I need something to translate profiler information to improved performance. For instance, my profiler tells me that my stacks of one processor are being accessed by another or that my mulpd ASM is taking too much time. I have no clue what these mean and how I could use this information for improving my code.
My intention is to spend as much time as needed to squeeze as much compute power as possible. Its more of a learning experience than for actual use or distribution as of now.
(I am concerned about manual tuning not auto-tuning)
Misc Details:
This differs from usual performance tuning since the major portions of the code are linked to Intel's proprietary MKL library.
Because of Memory Bandwidth issues in O(N^2) matrix-vector multiplications and dependencies, there is a limit to what I could manage on my own through simple observation.
I write in C and Fortran and I have tried both and as discussed a million times on SO, I found no difference in either if I tweak them appropriately.
Gosh, this still has no answers. After you've read this you'll still have no useful answers ...
You imply that you've already done all the obvious and generic things to make your codes fast. Specifically you have:
chosen the fastest algorithm for your problem (either that, or your problem is to optimise the implementation of an algorithm rather than to optimise the finding of a solution to a problem);
worked your compiler like a dog to squeeze out the last drop of execution speed;
linked in the best libraries you can find which are any use at all (and tested to ensure that they do in fact improve the performance of your program;
hand-crafted your memory access to optimise r/w performance;
done all the obvious little tricks that we all do (eg when comparing the norms of 2 vectors you don't need to take a square root to determine that one is 'larger' than another, ...);
hammered the parallel scalability of your program to within a gnat's whisker of the S==P line on your performance graphs;
always executed your program on the right size of job, for a given number of processors, to maximise some measure of performance;
and still you are not satisfied !
Now, unfortunately, you are close to the bleeding edge and the information you seek is not to be found easily in books or on web-sites. Not even here on SO. Part of the reason for this is that you are now engaged in optimising your code on your platform and you are in the best position to diagnose problems and to fix them. But these problems are likely to be very local indeed; you might conclude that no-one else outside your immediate research group would be interested in what you do, I know you wouldn't be interested in any of the micro-optimisations I do on my code on my platform.
The second reason is that you have stepped into an area that is still an active research front and the useful lessons (if any) are published in the academic literature. For that you need access to a good research library, if you don't have one nearby then both the ACM and IEEE-CS Digital Libraries are good places to start. (Post or comment if you don't know what these are.)
In your position I'd be looking at journals on 2 topics: peta- and exa-scale computing for science and engineering, and compiler developments. I trust that the former is obvious, the latter may be less obvious: but if your compiler already did all the (useful) cutting-edge optimisations you wouldn't be asking this question and compiler-writers are working hard so that your successors won't have to.
You're probably looking for optimisations which like, say, loop unrolling, were relatively difficult to find implemented in compilers 25 years ago and which were therefore bleeding-edge back then, and which themselves will be old and established in another 25 years.
EDIT
First, let me make explicit something that was originally only implicit in my 'answer': I am not prepared to spend long enough on SO to guide you through even a summary of the knowledge I have gained in 25+ years in scientific/engineering and high-performance computing. I am not given to writing books, but many are and Amazon will help you find them. This answer was way longer than most I care to post before I added this bit.
Now, to pick up on the points in your comment:
on 'hand-crafted memory access' start at the Wikipedia article on 'loop tiling' (see, you can't even rely on me to paste the URL here) and read out from there; you should be able to quickly pick up the terms you can use in further searches.
on 'working your compiler like a dog' I do indeed mean becoming familiar with its documentation and gaining a detailed understanding of the intentions and realities of the various options; ultimately you will have to do a lot of testing of compiler options to determine which are 'best' for your code on your platform(s).
on 'micro-optimisations', well here's a start: Performance Optimization of Numerically Intensive Codes. Don't run away with the idea that you will learn all (or even much) of what you want to learn from this book. It's now about 10 years old. The take away messages are:
performance optimisation requires intimacy with machine architecture;
performance optimisation is made up of 1001 individual steps and it's generally impossible to predict which ones will be most useful (and which ones actually harmful) without detailed understanding of a program and its run-time environment;
performance optimisation is a participation sport, you can't learn it without doing it;
performance optimisation requires obsessive attention to detail and good record-keeping.
Oh, and never write a clever piece of optimisation that you can't easily un-write when the next compiler release implements a better approach. I spend a fair amount of time removing clever tricks from 20-year old Fortran that was justified (if at all) on the grounds of boosting execution performance but which now just confuses the programmer (it annoys the hell out of me too) and gets in the way of the compiler doing its job.
Finally, one piece of wisdom I am prepared to share: these days I do very little optimisation that is not under one of the items in my first list above; I find that the cost/benefit ratio of micro-optimisations is unfavourable to my employers.

Fast interpreted language for memory constrained microcontroller

I'm looking for a fast interpreted language for a microcontroller.
The requirements are:
should be fast (not crucial but would be nice)
should be light on data memory (small overhead <8KB, excludes program variable space)
preferably would be small in program size and the language would be compact
preferably, human readable (for example, BASIC)
Thanks!
Some AVR interpreters:
http://www.cqham.ru/tbcgroup/index_eng.htm
http://www.jcwolfram.de/projekte/avr/chipbasic2/main.php
http://www.jcwolfram.de/projekte/avr/chipbasic8/main.php
http://www.jcwolfram.de/projekte/avr/main.php
http://code.google.com/p/python-on-a-chip/
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=688&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=626&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=460&item_type=project
This is a bit generic: there are many kinds of Microcontrollers, and thanks to technologies like Jazelle, it is possible to run hardware-accelerated Java on Microcontrollers. (if... your microcontroller supports it)
For a generic answer: Forth is commonly referenced. But really, you need to be far more specific with your question.
Micro-controllers come in a vast variety of architectures. There are small 8-bit families, 32-bit families with simple architectures and 32-bit families with MMU support, suitable for running a modern OS. If you don't state which family you are targeted at, it is impossible to answer your question.
Anyway, for 8-bit families the best you can get is a BASIC variant. See Bascom for example. Note that this would be a compiler version of the "interpreted" language. If you actually want to have a runtime or an interpreter that will execute your code, then you most probably need to install an operation system in your microcontroller.
There were a variety of interpreted languages for small micros back in the late 1970's and 1980's. They seem to have mostly fallen out of fashion. I'd like to have a p-code based C compiler for the PIC18 that could coexist nicely with my other C compiler; for much of my code I'd be willing to accept a 100-fold slowdown for a 50% space reduction (so long as I could keep the important stuff in native code). I would think that would be achievable, but I'm not about to implement such a thing from scratch myself.

Resources