Why Byte Addressable Stores Limitation in OpenCL 1.0? - opencl

I'm quite new in OpenCL and reading The OpenCL Specifications v1.0 I saw that: in order to perform writes on pointers that are less than 32 bits in size the following directive must be included:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
My question is:
Why was this limitation introduced?

The limitation exists because there is/was hardware (mostly GPUs) which only support 32 bit memory writes, meaning that 32 bit packed vector types (char4 for example) were the only way to write out small types. So the standard was engineered to the lowest common denominator and 32 bit write sizes were mandated.
However, as some OpenCL compatible devices did have memory controllers which support 16 or 8 bit size writes, the extension was added to allow small types to be directly written to memory without requiring packing into vector contains.
Related question here.

Related

Shipping reliable OpenCL applications - Tools/Techniques/Tips?

I want to ship OpenCL code that should work on all OpenCL 1.1 compatible GPUs. Rather than buying a bunch of GPUs and testing on them, are there any tools that can help ensure reliability?
If anyone has experience shipping OpenCL applications to a wide hardware base, I'd be interested in knowing about any other methods for testing reliability.
I've a bit of knowledge on this. Unfortunately, the answer is: depends on what the kernel is doing.
My biggest gripe is with NVIDIA and OpenCL, since they don't seem to support: vectors (float2, 4, etc) and global offsets. Kind of obnoxious. Intel and ATI are both solid, but even then vector sizes can differ. The above doesn't really matter if you are doing image convolution.
It matters if you want to run AMD FFT on an NVIDIA card, are doing matrix math, etc. To address the vector issue, you can write multiple kernels that each have a different vector size and call the right one: MatrixMult_float4(...).
You can check whether your code compiles by using the AMD KernelAnalyzer2, although this does need some component of the Catalyst drivers so it only works for me on PCs with AMD GPUs. There is also the Intel Kernel Builder, which works for devices with Intel OpenCL SDK support. Nvidia's implementation has bugs in it, especially on newer GPUs in my experience so there the best is to test one GPU from each generation.
To avoid extensions and validate CL language versions, one could try to test compile the code using the LLVM, or just getting the grammar for validation, e.g. as BNF.
There's a promising open source project, which probably contains useful stuff: http://bazaar.launchpad.net/~pocl/pocl/master/files/head:/lib/CL/
However, the problems I encountered were:
Newline characters caused build breakers on certain implementations (CR, LF, CRLF) in OpenCL source files. Specifying one of these as the only valid line ending would be just stupid. If one is editing source files on different platforms in conjunction with an SCM, it could get inconvenient. So I remove comments and clean up line breaks before compilation.
Performance: Feeding the GPU efficiently using multithreading; different hardware constellations have different bottlenecks. Here I needed a client side pipeline with multiple dispatcher threads. Of course, the amount of work that remains for the CPU depends on the task or capabilities, amount and resources of computing devices. Things that needed serialized execution or dynamic loop counts have been such candidates.

Main difference between OpenCL and OpenCL Embedded profile

Recently I was seeing OpenCL EP support on some development boards like odroid XU. One thing I know is that OpenCL EP is for ARM processors, but in what features will it vary from the main desktop based OpenCL.
The main differences are enumerated below (as of OpenCL 1.2):
64-bit integer support is optional.
Support for 3D images is optional.
Support for 2D image array writes is optional. If the cles_khr_2d_image_array_writes
extension is supported by the embedded profile, writes to 2D image arrays are supported.
There are some limitations on the available channel data types for images and image arrays (in particular, images with channel data types of CL_FLOAT and CL_HALF_FLOAT only support CL_FILTER_NEAREST sampler filter modes)
There are limitations on the sampler addressing modes available to images and image arrays.
There are some floating-point rounding changes that you may need to take into account.
Floating-point addition, subtraction, and multiplication will always be correctly rounded, other operations such as division and square roots have varying accuracies. There are tons of other floating-point things to watch out for as well.
Conversions between integer data types and floating point integers are limited in precision (but there are exceptions).
In short, the main differences here are in floating-point accuracy. In other words, the embedded profile need not adhere to the IEEE 754 floating-point specification, which may be a problem if you are doing lots of numerical calculations which rely on it. Quoted from the specification:
This relaxation of the requirement to adhere to IEEE 754 requirements
for basic floating- point operations, though extremely undesirable, is
to provide flexibility for embedded devices that have lot stricter
requirements on hardware area budgets.
There is also something that is not mentioned in section 10 but is worth noting: while desktop profiles must have a compiler available to compile OpenCL kernels, embedded profiles need not provide one. This can be seen through the clGetDeviceInfo documentation, which states:
CL_DEVICE_COMPILER_AVAILABLE: Return type: cl_bool
Is CL_FALSE if the implementation does not have a compiler available
to compile the program source. Is CL_TRUE if the compiler is available.
This can be CL_FALSE for the embededed (sic) platform profile only.
For a complete and detailed list of the OpenCL Embedded Profile specification, fire up your PDF reader, download the OpenCL spec (whichever version you are developing for), and find the relevant section.
The section 10 in the standard answers your question. This section is entirely dedicated to the OCL embedded profile, ans starts by enumerating the restriction that this profile implies.

Is there a general binary intermediate representation for OpenCL kernel programming?

as I understood, the OpenCL uses a modified C language (by adding some keywords like __global) as the general purpose for defining kernel function. And now I am doing a front-end inside F# language, which has a code quotation feature that can do meta programming (you can think it as some kind of reflection tech). So I would like to know if there is a general binary intermediate representation for the kernel instead of C source file.
I know that CUDA supports LLVM IR for the binary intermediate representation, so we can create kernel programmatically, and I want to do the same thing with OpenCL. But the document says that the binary format is not specified, each implementation can use their own binary format. So is there any general purpose IR which can be generated by program and can also run with NVIDIA, AMD, Intel implementation of OpenCL?
Thansk.
No, not yet. Khronos is working on SPIR (the spec is still provisional), which would hopefully become this. As far as I can tell, none of the major implementations support it yet. Unless you want to bet your project on its success and possibly delay your project for a year or two, you should probably start with generating code in the C dialect.

Radeon HD 4850 and OpenCL: will cl_khr_fp64 work on this videocard?

This videocard (Radeon HD 4850) conforms only with OpenCL 1.0, per AMD Compatibility table. I need some hardware to conduct intensive financial calculations with doubleN types (no floats at all!). According to this cardtable, this card is able to work with double types. Now I have the possibility to buy it at quite an attractive price.
I'd greatly appreciate if an answerer has real experience in working with this card for OpenCL with fp64 extension. Of course, if there are problems with this card, please put two lines here.
Thank you and sorry for my English.
I haven't used this card with DP before, but if the spec says it is supported, then it's worth a try.
In my opinion, you should go with a newer model card though. There are a lot of cheap cards out that will outperform the 4850, and they will support some new features as well.
This card supports double precision but the 4xxx series doesn't include local memory in the chip. As the standard mandates local memory support it is emulated with global memory and very slow. Many algorithms require local memory for obtaining a good speed-up. So, a newer card 5xxx and higher is a lot better.
In addition, some combinations of older cards/older SDK versions only support double precision through the cl_amd_fp64 extension (not the official cl_khr_fp64 extension) because of some small things from the standard that are not supported. For the most part, this doesn't matter much except that you need to change the extension name in your code to make it work with doubles.
As a general tip, I would try to avoid the 4xxx series if you intend to make serious GPGPU development. Keep in mind also, that the newer 7xxx series it is much more optimized for GPU computations than both the 5xxx and 6xxx series closing much of the gap with NVIDIA cards. So, if you can, try to aim for a 7xxx with double precision support.

float VS floatN

Is there any advantage when using floatN instead float in OpenCL?
for example
float3 position;
and
float posX, posY, posZ;
Thank you
It depends on the hardware.
NVidia GPUs have a scalar architecture, so vectors provide little advantage on them over writing purely scalar code. Quoting the NVidia OpenCL best practices guide (PDF link):
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience. It is also in general better to have more work-items than fewer using
large vectors.
With CPUs and ATI GPUs, you will gain more benefits from using vectors as these architectures have vector instructions (though I've heard this might be different on the latest Radeons - wish I had a link to the article where I read this).
Quoting the ATI Stream OpenCL programming guide (PDF link), for CPUs:
The SIMD floating point resources in a CPU (SSE) require the use of
vectorized types (float4) to enable packed SSE code generation and extract
good performance from the SIMD hardware.
This article provides a performance comparison on ATI GPUs of a kernel written with vectors vs pure scalar types.
In both Nvidia and AMD architectures, the memory is divided into banks of 128 bits. Often, reading a single float3 or float4 value is going to be faster for the memory controller than reading 3 separate floats.
When you read float values from consecutive memory addresses, you are relying heavily on the compiler to combine the reads for you. There is no guarantee that posX, posY, and posZ are in the same bank. Declaring it as float3 usually forces the locations of the component floats to fall within the same bank.
How the GPUs handle the vector computations varies between the vendors, but the memory accesses on both platforms will benefit from from the vectorization.
I'm not terribly familiar with OpenCL, but in GLSL doing math with vectors is more efficient because the GPU can apply the same operation to all N components concurrently. Also, in GLSL vectors also support operations like dot products as built-in language features.

Resources