I read a article which stated that "Kernels can invoke a broader number of functions than shaders" how far is this true.
link for that article is http://www.dyn-lab.com/articles/cl-gl.html
The difference is quite the opposite actually. If you compare Section 8 of the GLSL specification with Section 6.12 of the OpenCL specification, you can see that there is a large overlap concerning mathematical operations.
However, GLSL has far more bit- and image-related operations and provides matrix operations which are not existing in OpenCL 1.2. On the other hand, OpenCL has more synchronization primitives and work group management functions that are not necessary with GLSL. Moreover, OpenCL provides smaller and larger integer types than GLSL.
Also, in Appendix C of the AMD APP OpenCL Programming Guide, the amount/types of available functions is not listed as a major difference between a shader and a kernel.
Related
According the source code *1 give below and my experiment, LLVM implements a transform that changes the division to multiplication and shift right.
In my experiment, this optimization is applied at the backend (because I saw that change on X86 assembly code instead of LLVM IR).
I know this change may be associated with the hardware. In my point, in some hardware, the multiplication and shift right maybe more expensive than a single division operator. So this optimization is implemented in backend.
But when I search the DAGCombiner.cpp, I saw a function named isIntDivCheap(). And in the definition of this function, there are some comments point that the decision of cheap or expensive depends on optimizing base on code size or the speed.
That is, if I always optimize the code base on the speed, the division will convert to multiplication and shift right. On the contrary, the division will not convert.
In the other hand, a single division is always slower than multiplication and shift right or the function will do more thing to decide the cost.
So, why this optimization is NOT implemented in LLVM IR if a single division always slower?
*1: https://llvm.org/doxygen/DivisionByConstantInfo_8cpp.html
Interesting question. According to my experience of working on LLVM front ends for High-level Synthesis (HLS) compilers, the answer to your questions lies in understanding the LLVM IR and the limitations/scope of the optimizations at LLVM IR stage.
The LLVM Intermediate Representation (IR) is the backbone that connects frontends and backends, allowing LLVM to parse multiple source languages and generate code to multiple targets. Hence, at the LLVM IR stage, it's often about intent rather than full-fledge performance optimizations.
Divide-by-constant optimization is very much performance driven. Not saying at all that optimizations at IR level have less or nothing to do with performance, however, there are inherent limitations in terms of optimizations at IR stage and divide-by-constant is one of those limitations.
To be more precise, the IR is not entrenched enough in low-level machine details and instructions. If you observe that the optimizations at LLVM IR are usually composed of analysis and transform passes. And as per my knowledge, you don't see divide-by-constant pass at the IR stage.
I have an OpenCL 1.2 application that I would like to run on iOS.
So, my only choice for gpgpu is Metal. I am curious about what is missing
in Metal relative to OpenCL ? My current app makes heavy use of OpenCL images,
and compute features such as popcnt .
I don't know OpenCL, but I doubt Metal is missing much, since it was designed much later. You can see from the Metal Shader Language Specification (PDF) that it provides the popcount() function.
Compute functions in Metal can read from and write to textures as well as buffers, if that's what OpenCL images are used for.
As warrenm points out, Meta does not support double-precision floating-point types.
Recently I was seeing OpenCL EP support on some development boards like odroid XU. One thing I know is that OpenCL EP is for ARM processors, but in what features will it vary from the main desktop based OpenCL.
The main differences are enumerated below (as of OpenCL 1.2):
64-bit integer support is optional.
Support for 3D images is optional.
Support for 2D image array writes is optional. If the cles_khr_2d_image_array_writes
extension is supported by the embedded profile, writes to 2D image arrays are supported.
There are some limitations on the available channel data types for images and image arrays (in particular, images with channel data types of CL_FLOAT and CL_HALF_FLOAT only support CL_FILTER_NEAREST sampler filter modes)
There are limitations on the sampler addressing modes available to images and image arrays.
There are some floating-point rounding changes that you may need to take into account.
Floating-point addition, subtraction, and multiplication will always be correctly rounded, other operations such as division and square roots have varying accuracies. There are tons of other floating-point things to watch out for as well.
Conversions between integer data types and floating point integers are limited in precision (but there are exceptions).
In short, the main differences here are in floating-point accuracy. In other words, the embedded profile need not adhere to the IEEE 754 floating-point specification, which may be a problem if you are doing lots of numerical calculations which rely on it. Quoted from the specification:
This relaxation of the requirement to adhere to IEEE 754 requirements
for basic floating- point operations, though extremely undesirable, is
to provide flexibility for embedded devices that have lot stricter
requirements on hardware area budgets.
There is also something that is not mentioned in section 10 but is worth noting: while desktop profiles must have a compiler available to compile OpenCL kernels, embedded profiles need not provide one. This can be seen through the clGetDeviceInfo documentation, which states:
CL_DEVICE_COMPILER_AVAILABLE: Return type: cl_bool
Is CL_FALSE if the implementation does not have a compiler available
to compile the program source. Is CL_TRUE if the compiler is available.
This can be CL_FALSE for the embededed (sic) platform profile only.
For a complete and detailed list of the OpenCL Embedded Profile specification, fire up your PDF reader, download the OpenCL spec (whichever version you are developing for), and find the relevant section.
The section 10 in the standard answers your question. This section is entirely dedicated to the OCL embedded profile, ans starts by enumerating the restriction that this profile implies.
as I understood, the OpenCL uses a modified C language (by adding some keywords like __global) as the general purpose for defining kernel function. And now I am doing a front-end inside F# language, which has a code quotation feature that can do meta programming (you can think it as some kind of reflection tech). So I would like to know if there is a general binary intermediate representation for the kernel instead of C source file.
I know that CUDA supports LLVM IR for the binary intermediate representation, so we can create kernel programmatically, and I want to do the same thing with OpenCL. But the document says that the binary format is not specified, each implementation can use their own binary format. So is there any general purpose IR which can be generated by program and can also run with NVIDIA, AMD, Intel implementation of OpenCL?
Thansk.
No, not yet. Khronos is working on SPIR (the spec is still provisional), which would hopefully become this. As far as I can tell, none of the major implementations support it yet. Unless you want to bet your project on its success and possibly delay your project for a year or two, you should probably start with generating code in the C dialect.
Is there any advantage when using floatN instead float in OpenCL?
for example
float3 position;
and
float posX, posY, posZ;
Thank you
It depends on the hardware.
NVidia GPUs have a scalar architecture, so vectors provide little advantage on them over writing purely scalar code. Quoting the NVidia OpenCL best practices guide (PDF link):
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience. It is also in general better to have more work-items than fewer using
large vectors.
With CPUs and ATI GPUs, you will gain more benefits from using vectors as these architectures have vector instructions (though I've heard this might be different on the latest Radeons - wish I had a link to the article where I read this).
Quoting the ATI Stream OpenCL programming guide (PDF link), for CPUs:
The SIMD floating point resources in a CPU (SSE) require the use of
vectorized types (float4) to enable packed SSE code generation and extract
good performance from the SIMD hardware.
This article provides a performance comparison on ATI GPUs of a kernel written with vectors vs pure scalar types.
In both Nvidia and AMD architectures, the memory is divided into banks of 128 bits. Often, reading a single float3 or float4 value is going to be faster for the memory controller than reading 3 separate floats.
When you read float values from consecutive memory addresses, you are relying heavily on the compiler to combine the reads for you. There is no guarantee that posX, posY, and posZ are in the same bank. Declaring it as float3 usually forces the locations of the component floats to fall within the same bank.
How the GPUs handle the vector computations varies between the vendors, but the memory accesses on both platforms will benefit from from the vectorization.
I'm not terribly familiar with OpenCL, but in GLSL doing math with vectors is more efficient because the GPU can apply the same operation to all N components concurrently. Also, in GLSL vectors also support operations like dot products as built-in language features.