Why the optimization, division-by-constant is not implemented in LLVM IR? - math

According the source code *1 give below and my experiment, LLVM implements a transform that changes the division to multiplication and shift right.
In my experiment, this optimization is applied at the backend (because I saw that change on X86 assembly code instead of LLVM IR).
I know this change may be associated with the hardware. In my point, in some hardware, the multiplication and shift right maybe more expensive than a single division operator. So this optimization is implemented in backend.
But when I search the DAGCombiner.cpp, I saw a function named isIntDivCheap(). And in the definition of this function, there are some comments point that the decision of cheap or expensive depends on optimizing base on code size or the speed.
That is, if I always optimize the code base on the speed, the division will convert to multiplication and shift right. On the contrary, the division will not convert.
In the other hand, a single division is always slower than multiplication and shift right or the function will do more thing to decide the cost.
So, why this optimization is NOT implemented in LLVM IR if a single division always slower?
*1: https://llvm.org/doxygen/DivisionByConstantInfo_8cpp.html

Interesting question. According to my experience of working on LLVM front ends for High-level Synthesis (HLS) compilers, the answer to your questions lies in understanding the LLVM IR and the limitations/scope of the optimizations at LLVM IR stage.
The LLVM Intermediate Representation (IR) is the backbone that connects frontends and backends, allowing LLVM to parse multiple source languages and generate code to multiple targets. Hence, at the LLVM IR stage, it's often about intent rather than full-fledge performance optimizations.
Divide-by-constant optimization is very much performance driven. Not saying at all that optimizations at IR level have less or nothing to do with performance, however, there are inherent limitations in terms of optimizations at IR stage and divide-by-constant is one of those limitations.
To be more precise, the IR is not entrenched enough in low-level machine details and instructions. If you observe that the optimizations at LLVM IR are usually composed of analysis and transform passes. And as per my knowledge, you don't see divide-by-constant pass at the IR stage.

Related

Float related numerical stability issues for parallel reduction

I have been looking at some online resources related to float summation and the related accuracy issues.
E.g.:
https://devtalk.nvidia.com/default/topic/1044661/cuda-programming-and-performance/how-to-improve-float-array-summation-precision-and-stability-/
https://hal.archives-ouvertes.fr/hal-00949355v4/document
Most of them recommend using some form of manual intervention when handling floating-point summation for any modern hardware. E.g.
(1) to use Kahan’s algorithm for float summation, or (2) Sort and sum closer magnitude numbers together, etc.
Are these kind of nuances handled by MPI_AllReduce or OpenMP reduction kernels?
Speaking only for OpenMP: the standard says nothing about the order in which reduction operations are applied, and, indeed, that can even differ at each execution of the code. (Some OpenMP runtimes, such as the LLVM/Intel one implement a deterministic reduction*, but only guarantee determinism between runs with the same number of threads).
If you want to sort, or perform reduction in other ways, you will need to implement it yourself...
See https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-supported-environment-variables and search for KMP_DETERMINISTIC_REDUCTION for details.

Frama-c execution time & heap memory bounds proof

Does Frama-C provide any tools for proving the run-time characteristics of a function such as execution time (possibly as instruction count) and heap memory space (counted as bytes allocated)?
Concerning execution time estimation
Frama-C works at the C level. The Metrics plug-in can provide a few metrics (such as statement count) on a version of the source very close to the original one (-metrics -metrics-ast cabs), or on the normalized source (often referred to as Cil code) that it uses. However, it does not have any knowledge of assembly code, therefore it cannot provide precise information about execution time at this level.
Since compiler optimizations impact code generation, the numbers given by Frama-C may or may not be close to what will be produced by a compiler, depending on which optimizations are enabled, what is known about the compiler and the target architecture, etc. In the general case, Frama-C cannot give any guarantees; in specific situations, it is possible to develop plug-ins to provide some of this information (e.g. the Cost plug-in, mentioned here uses annotations to try and maintain some correspondence between source and compiled code, and then uses them to provide some execution time information).
Concerning memory size estimation
There is an option, -metrics-locals-size, which does a rough estimation of the stack memory usage by a function. As in the previous case, this is only an estimation based on the source code. Compilers are likely to stack-allocate temporary variables for computing temporary subexpressions, or for register spilling, so the numbers given by Frama-C cannot be used in a worst-case stack estimation.
Dynamic memory allocation is supported in ACSL, so in theory it is possible to write annotations concerning it. However, current plug-ins do not provide a direct way to handle this precisely; it might require writing a new plug-in or, at least, an abstract domain for Eva.
Eva currently handles dynamic allocation, but probably not precisely enough for estimating heap size in an interesting way. It is possible to develop an abstract domain for Eva that would keep track of this information (adding mallocs and subtracting frees) and compute an overapproximation of the heap memory space, but this would require being able to bound the number of iterations of loops containing allocations (otherwise the upper bound would be infinite). Precision would depend on the complexity of the program.
For runtime verification, the E-ACSL plug-in already tracks some stack/heap information usage (even though it is not currently exported to the user), so in theory one could write an assertion similar to //# assert heap_size <= \old(heap_size) + 42;, and have it checked at runtime, when running the instrumented program.
To complement anol's answer, the PathCrawler plug-in (online version can be used freely, but the plugin itself is proprietary) has been used to generate sets of test cases covering all paths of C functions. This article explains under which assumptions this can be used as the basis for WCET measurement, but basically the issues are the one already mentioned by anol: without a precise knowledge of the work done by the compiler and of the underlying hardware, which is not something Frama-C provides natively, things are going to be quite rough.
There has been apparently some recent work taking the same route of using PathCrawler for generating execution traces covering a sufficiently large proportion of the search space as a bachelor project in Amsterdam.

How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
Overview
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
Papers
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
Code
You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.

Is there a general binary intermediate representation for OpenCL kernel programming?

as I understood, the OpenCL uses a modified C language (by adding some keywords like __global) as the general purpose for defining kernel function. And now I am doing a front-end inside F# language, which has a code quotation feature that can do meta programming (you can think it as some kind of reflection tech). So I would like to know if there is a general binary intermediate representation for the kernel instead of C source file.
I know that CUDA supports LLVM IR for the binary intermediate representation, so we can create kernel programmatically, and I want to do the same thing with OpenCL. But the document says that the binary format is not specified, each implementation can use their own binary format. So is there any general purpose IR which can be generated by program and can also run with NVIDIA, AMD, Intel implementation of OpenCL?
Thansk.
No, not yet. Khronos is working on SPIR (the spec is still provisional), which would hopefully become this. As far as I can tell, none of the major implementations support it yet. Unless you want to bet your project on its success and possibly delay your project for a year or two, you should probably start with generating code in the C dialect.

BLAS sdot operation implementation in MKL library

I tested the BLAS sdot interface for single precise floating point dot operations. I found that the results of Intel MKL library are a little different from that of the BLAS fortran code given in http://netlib.org/blas/. The MKL ones appear more accurate.
I just wonder is there any optimization made by MKL? Or how does MKL implement it to make it more accurate?
Well, since the MKL is especially written by a specific CPU vendor for their own products, I guess they can use a bit more knowledge about the underlying machine than the reference implementation can.
First thoughts may be that they use optimized assembly and always keep the running sum on the x87 80bit floating point stack without rounding it down to 32bit in each iteration. Or maybe they use SSE(2) and compute the whole sum in double precision (which shouldn't make much of a difference for addition and multiplication, performance-wise). Or maybe they use a completely different computation or what black magic machine tricks ever.
The point is that these routines are far more optimized for a specific hardware than the basic reference implementation, but without seeing their implementation we cannot say in which way. The above mentioned ideas are just simple approaches.

Resources