float VS floatN - opencl

Is there any advantage when using floatN instead float in OpenCL?
for example
float3 position;
and
float posX, posY, posZ;
Thank you

It depends on the hardware.
NVidia GPUs have a scalar architecture, so vectors provide little advantage on them over writing purely scalar code. Quoting the NVidia OpenCL best practices guide (PDF link):
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience. It is also in general better to have more work-items than fewer using
large vectors.
With CPUs and ATI GPUs, you will gain more benefits from using vectors as these architectures have vector instructions (though I've heard this might be different on the latest Radeons - wish I had a link to the article where I read this).
Quoting the ATI Stream OpenCL programming guide (PDF link), for CPUs:
The SIMD floating point resources in a CPU (SSE) require the use of
vectorized types (float4) to enable packed SSE code generation and extract
good performance from the SIMD hardware.
This article provides a performance comparison on ATI GPUs of a kernel written with vectors vs pure scalar types.

In both Nvidia and AMD architectures, the memory is divided into banks of 128 bits. Often, reading a single float3 or float4 value is going to be faster for the memory controller than reading 3 separate floats.
When you read float values from consecutive memory addresses, you are relying heavily on the compiler to combine the reads for you. There is no guarantee that posX, posY, and posZ are in the same bank. Declaring it as float3 usually forces the locations of the component floats to fall within the same bank.
How the GPUs handle the vector computations varies between the vendors, but the memory accesses on both platforms will benefit from from the vectorization.

I'm not terribly familiar with OpenCL, but in GLSL doing math with vectors is more efficient because the GPU can apply the same operation to all N components concurrently. Also, in GLSL vectors also support operations like dot products as built-in language features.

Related

Why the optimization, division-by-constant is not implemented in LLVM IR?

According the source code *1 give below and my experiment, LLVM implements a transform that changes the division to multiplication and shift right.
In my experiment, this optimization is applied at the backend (because I saw that change on X86 assembly code instead of LLVM IR).
I know this change may be associated with the hardware. In my point, in some hardware, the multiplication and shift right maybe more expensive than a single division operator. So this optimization is implemented in backend.
But when I search the DAGCombiner.cpp, I saw a function named isIntDivCheap(). And in the definition of this function, there are some comments point that the decision of cheap or expensive depends on optimizing base on code size or the speed.
That is, if I always optimize the code base on the speed, the division will convert to multiplication and shift right. On the contrary, the division will not convert.
In the other hand, a single division is always slower than multiplication and shift right or the function will do more thing to decide the cost.
So, why this optimization is NOT implemented in LLVM IR if a single division always slower?
*1: https://llvm.org/doxygen/DivisionByConstantInfo_8cpp.html
Interesting question. According to my experience of working on LLVM front ends for High-level Synthesis (HLS) compilers, the answer to your questions lies in understanding the LLVM IR and the limitations/scope of the optimizations at LLVM IR stage.
The LLVM Intermediate Representation (IR) is the backbone that connects frontends and backends, allowing LLVM to parse multiple source languages and generate code to multiple targets. Hence, at the LLVM IR stage, it's often about intent rather than full-fledge performance optimizations.
Divide-by-constant optimization is very much performance driven. Not saying at all that optimizations at IR level have less or nothing to do with performance, however, there are inherent limitations in terms of optimizations at IR stage and divide-by-constant is one of those limitations.
To be more precise, the IR is not entrenched enough in low-level machine details and instructions. If you observe that the optimizations at LLVM IR are usually composed of analysis and transform passes. And as per my knowledge, you don't see divide-by-constant pass at the IR stage.

How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
Overview
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
Papers
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
Code
You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.

Radeon HD 4850 and OpenCL: will cl_khr_fp64 work on this videocard?

This videocard (Radeon HD 4850) conforms only with OpenCL 1.0, per AMD Compatibility table. I need some hardware to conduct intensive financial calculations with doubleN types (no floats at all!). According to this cardtable, this card is able to work with double types. Now I have the possibility to buy it at quite an attractive price.
I'd greatly appreciate if an answerer has real experience in working with this card for OpenCL with fp64 extension. Of course, if there are problems with this card, please put two lines here.
Thank you and sorry for my English.
I haven't used this card with DP before, but if the spec says it is supported, then it's worth a try.
In my opinion, you should go with a newer model card though. There are a lot of cheap cards out that will outperform the 4850, and they will support some new features as well.
This card supports double precision but the 4xxx series doesn't include local memory in the chip. As the standard mandates local memory support it is emulated with global memory and very slow. Many algorithms require local memory for obtaining a good speed-up. So, a newer card 5xxx and higher is a lot better.
In addition, some combinations of older cards/older SDK versions only support double precision through the cl_amd_fp64 extension (not the official cl_khr_fp64 extension) because of some small things from the standard that are not supported. For the most part, this doesn't matter much except that you need to change the extension name in your code to make it work with doubles.
As a general tip, I would try to avoid the 4xxx series if you intend to make serious GPGPU development. Keep in mind also, that the newer 7xxx series it is much more optimized for GPU computations than both the 5xxx and 6xxx series closing much of the gap with NVIDIA cards. So, if you can, try to aim for a 7xxx with double precision support.

Kernels can invoke a broader number of functions than shaders

I read a article which stated that "Kernels can invoke a broader number of functions than shaders" how far is this true.
link for that article is http://www.dyn-lab.com/articles/cl-gl.html
The difference is quite the opposite actually. If you compare Section 8 of the GLSL specification with Section 6.12 of the OpenCL specification, you can see that there is a large overlap concerning mathematical operations.
However, GLSL has far more bit- and image-related operations and provides matrix operations which are not existing in OpenCL 1.2. On the other hand, OpenCL has more synchronization primitives and work group management functions that are not necessary with GLSL. Moreover, OpenCL provides smaller and larger integer types than GLSL.
Also, in Appendix C of the AMD APP OpenCL Programming Guide, the amount/types of available functions is not listed as a major difference between a shader and a kernel.

Is there any benefit in nVidia Tesla cards?

I'm planning to buy a serious GPU for running a parallel algorithm on (budget 2k-4k). Now I see everywhere supercomputers featuring nVidia Tesla GPU cards "made especially for GPGPU".
While this seems very nice on first sight, a better reading makes me have serious second thoughts on that: compared to e.g. a Radeon HD 7970, its performance (in terms of flops) is significantly lower, its cost price is significantly higher, and I can't seem to find any benchmark comparison between the Tesla and normal gaming GPUs.
I have found that the Tesla features ECC-memory. Is this the only difference? Or am I missing a deeper architectural difference between both? Perhaps relevant info: I will be using OpenCL, not Cuda.
There are two technical differences I know of between the brands, when you comparing similar cards.
1) Nvidia cards tend to have better double precision FLOPS than AMD - by a factor of 2 sometimes. AMD usually does better for single precision FLOPS.
2) ECC memory is available for both brands for the GDDR5 memory. The difference is that Nvidia uses ECC on the internal memory (registers and such) as well, where AMD does not.
In my opinion, choose the card based on your application. If you use more single than double precision, go AMD, otherwise Nvidia. If you need the ECC for high fault tolerance, maybe Nvidia is your best choice. Sometimes many cheaper cards does better than 1 or 2 top of the line cards - think of PCI-e bandwidth. Read up on benchmarks, and try to determine which card is best suited for your needs.
I don't know if your problem is similar to mining bitcoins, but there is a LOT of info on parallel GPU setups here...
https://en.bitcoin.it/wiki/Mining_hardware_comparison

Resources