I do not understand the reason for OpenCL native_ prefixed functions. The documentation says:
Functions with the native_ prefix may map to one or more native device
instructions and will typically have better performance compared to
the corresponding functions (without the native__ prefix). The
accuracy (and in some cases the input range(s)) of these functions is
implementation-defined.
Ok, so I get that native_ functions might be slightly faster and slightly less accurate. Are there any other pros and cons? In what use case might I want to use something like log() versus native_log()?
Apologies if this is a dumb question. I want to make sure I grasp the underlying reason that native_ functions exist.
If you want to release software that is used on all possible devices you should use the normal functions because you can never know what to expect from native_ functions. Or you can make a simple test that checks if you want to use native_ or not, as an example calculate bunch of values within range of interest and see if they are close enough.
The most important part of native_ functions is that their precision is completely implementation defined. This is important because some parts of OpenCL numerical precision spec are rather stupid to be honest. As an example the spec requires sin to have relative 4 ULP of precision, which is quite funny for an oscillating function and makes implementation really hard when input ULP values grow high.
In practice the native_ implementations are usually pretty much what you would expect from the device. On GPU's the native_ functions are generally what is defined in DirectX spec. Some have even better implementations. As an example on AMD HW the native_sin function is blazingly fast compared to normal sin and from my experiences it has absolute error in order of 1 ULP of input value. On Intel integrated GPU the native_sin is quite imprecise but still within DirectX spec.
The reason native_ functions are in the spec is to allow users who do not care about the extreme precision required by OpenCL spec to use something that performs better.
tl;dr: If you can use native_ functions do it because they are in general much faster than the normal functions.
Related
According the source code *1 give below and my experiment, LLVM implements a transform that changes the division to multiplication and shift right.
In my experiment, this optimization is applied at the backend (because I saw that change on X86 assembly code instead of LLVM IR).
I know this change may be associated with the hardware. In my point, in some hardware, the multiplication and shift right maybe more expensive than a single division operator. So this optimization is implemented in backend.
But when I search the DAGCombiner.cpp, I saw a function named isIntDivCheap(). And in the definition of this function, there are some comments point that the decision of cheap or expensive depends on optimizing base on code size or the speed.
That is, if I always optimize the code base on the speed, the division will convert to multiplication and shift right. On the contrary, the division will not convert.
In the other hand, a single division is always slower than multiplication and shift right or the function will do more thing to decide the cost.
So, why this optimization is NOT implemented in LLVM IR if a single division always slower?
*1: https://llvm.org/doxygen/DivisionByConstantInfo_8cpp.html
Interesting question. According to my experience of working on LLVM front ends for High-level Synthesis (HLS) compilers, the answer to your questions lies in understanding the LLVM IR and the limitations/scope of the optimizations at LLVM IR stage.
The LLVM Intermediate Representation (IR) is the backbone that connects frontends and backends, allowing LLVM to parse multiple source languages and generate code to multiple targets. Hence, at the LLVM IR stage, it's often about intent rather than full-fledge performance optimizations.
Divide-by-constant optimization is very much performance driven. Not saying at all that optimizations at IR level have less or nothing to do with performance, however, there are inherent limitations in terms of optimizations at IR stage and divide-by-constant is one of those limitations.
To be more precise, the IR is not entrenched enough in low-level machine details and instructions. If you observe that the optimizations at LLVM IR are usually composed of analysis and transform passes. And as per my knowledge, you don't see divide-by-constant pass at the IR stage.
Looking into the class, I'm seeing that by default it looks like they're complex stepped. Is there a way to specify an analytical partial?
I've got some code that has a lot of essentially one liner explicit comps with analytical partials specified. Is there any real performance benefit to that over an ExecComp? Or with simple functions does work out to roughly the same?
There's currently no way to specify analytic partials for ExecComps and you're right that they're complex-stepped.
The short answer to your next question is that for simple functions there's no meaningful performance benefit using explicit components over ExecComp. This is because complex-step computes derivatives within machine precision when using an adequately small step size, which OpenMDAO does. The actual computational cost of performing the complex-step, for one-liners, is generally trivial.
The longer answer involves a few considerations, such as the sizes of the component's input and output arrays, the sparsity pattern of the Jacobian, and the cost of the actual compute function. If you want, I can go into more detail about these considerations and suggest which method to use for your problems.
[Edit: I've updated the figure with results for this compute: y=sum(log(x)/x**2+3*log(x)]
I've added a figure below showing the cost for computing derivatives of a component as we change the size of the input array to that component. The analytic component is slightly faster across the board, but requires more lines of code.
Basically, whichever method is easier to implement is probably advantageous as there's not a huge cost difference. For this extremely simple compute function, because it's so inexpensive, the framework overhead probably has a larger impact on cost than the actual derivative computation. Of course these trends are also problem dependent.
I have a problem composed of around 6 mathematical expressions - i.e. (f(g(z(y(x))))) where x are two independent arrays.
I can divide this expression into multiple explicit comps with analytical derivatives or use an algorithmic differentiation method to get the derivatives which reduces the system to a single explicit component.
As far as i understand it is not easy to tell in advance the possible computational performance difference between these 2 approaches.
It might depend on the algorithmic differentiation tools capabilities on the reverse mode case but maybe the system will be very large with multiple explicit components that it would still be ok to use algo diff.
my questions is :
Is algo diff. a common tool being used by any of the developers/users ?
I found AlgoPY but not sure about other python tools.
As of OpenMDAO v2.4 the OpenMDAO development team has not heavily used AD tools on any pure-python components. We have experimented with it a bit and found roughly a 2x increase in computational vs hand differentiated components. While some computational cost increase is expected, I do not want to indicate that I expect 2x to be the final rule of thumb. We simply don't have enough data to provide such an estimate.
Python based AD tools are much less well developed than those for compiled languages. The dynamic typing and general language flexibility both make it much more challenging to write good AD tools.
We have interfaced OpenMDAO with compiled codes that use AD, such as CFD and FEA tools. In these cases you're always using the matrix-free derivative APIs for OpenMDAO (apply_linear and compute_jacvec_product).
If your component is small enough to fit in memory and fast enough to run on a single process, I suggest you hand differentiate your code. That will give you the best overall performance for now.
AD support for small serial components is something we'll look into supporting in the future, but we don't have anything to offer you in the near term (as of OpenMDAO v2.4)
In OpenCL, which is C-99, we have two options for creating something function-like:
macros
functions
[edit: well, or use a templating language, new option 3 :-) ]
I heard somewhere (can't find any official reference to this anywhere, just saw it in a comment somewhere on stackoverflow, once), that functions are almost always inlined in practice, and that therefore using functions is ok performance-wise?
But macros are basically guaranteed to be inlined, by the nature of macros. But susceptible to bugs, eg if dont add parentheses around everything, and not typesafe.
In practice, what works well? What is most standard? What is likely to be most portable?
I suppose my requirements are some combination of:
as fast as possible
as little register pressure as possible
when used with compile-time constants, should ideally be guaranteed to be optimized-away to another constant
easy to maintain...
standard, not too weird, since I'm looking at using this for an opensource project, that I hope other people will contribute to
But macros are basically guaranteed to be inlined, by the nature of macros.
On GPU at least, the OpenCL compilers aggressively inlines pretty much everything, with the exception of recursive functions (OpenCL 2.0). This is both for hardware constraints and performance reasons.
While this is indeed implementation dependent, I have yet to see a GPU binary that is not aggressively inlined. I do not work much with CPU OpenCL but I believe the optimizer strategy may be similar, although the hardware constraints are not the same.
But as far as the standard is concerned, there are no guarantees.
Let's go through your requirements:
as fast as possible
as little register pressure as possible
when used with compile-time constants, should ideally be guaranteed to be optimized-away to another constant
Inline functions are as fast as macros, do not use more registers, and will be optimized away when possible.
easy to maintain...
Functions are much easier to maintain that macros. They are type safe, they can be refactored easily, etc, the list goes on forever.
standard, not too weird, since I'm looking at using this for an opensource project, that I hope other people will contribute to
I believe that is very subjective. I personally hate macros with a passion and avoid them like the plague. But I know some very successful projects that use them extensively (for example Boost.Compute). It's up to you.
How do I declare the types of the parameters in order to circumvent type checking?
How do I optimize the speed to tell the compiler to run the function as fast as possible like (optimize speed (safety 0))?
How do I make an inline function in Scheme?
How do I use an unboxed representation of a data object?
And finally are any of these important or necessary? Can I depend on my compiler to make these optimizations?
thanks,
kunjaan.
You can't do any of these in any portable way.
You can get a "sort of" inlining using macros, but it's almost always to try to do that. People who write Scheme (or any other language) compilers are usually much better than you in deciding when it is best to inline a function.
You can't make values unboxed; some Scheme compilers will do that as an optimization, but not in any way that is visible (because it is an optimization -- so it should preserve the semantics).
As for your last question, an answer is very subjective. Some people cannot sleep at night without knowing exactly how many CPU cycles some function uses. Some people don't care and are fine with trusting the compiler to optimize things reasonably well. At least at the stages where you're more of a student of the language and less of an implementor, it is better to stick to the latter group.
If you want to help out the compiler, consider reducing top level definitions where possible.
If the compiler sees a function at top-level, it's very hard for it to guess how that function might be used or modified by the program.
If a function is defined within the scope of a function that uses it, the compiler's job becomes much simpler.
There is a section about this in the Chez Scheme manual:
http://www.scheme.com/csug7/use.html#./use:h4
Apparently Chez is one of the fastest Scheme implementations there is. If it needs this sort of "guidance" to make good optimizations, I suspect other implementations can't live without it either (or they just ignore it all together).