I use my GPU to perform lots of integer arithmetic. mul24() and mad24() are very helpful to get significant integer performance boosts. Sadly, some of my kernels needs more than 24-bit integers, forcing me to use compiler generated code, which is not always optimal. If I could access hardware instruction equivalent to mul_hi() but for 24-bit integers, name it mul24_hi(), I would get better performance from my GPUs.
Is there any equivalent to mul_hi() but for 24-bit integers or any pattern/idiom/workaround to reliably instruct the compiler to emit it?
Related
In terms of SIMD and parallelization, what is the difference between AVX2 and AVX-512? Are they the same thing or different? I just see that double8 is used in AVX-512 and double4 is used for AVX2?
I am using PyOpenCL to write kernel code in C and not sure what the difference would be.
AVX2 is a 256 bit vector instruction set. You have 256 bit registers which can be interpreted several ways (8 floats, 4 doubles, 32 bytes, etc). AVX1 supports only floating point operations, AVX2 adds 256 bit integer operations. AVX-512 is a set of 512 bit vector instructions. There are only 2 flavors of AVX, plain old AVX and AVX2. AVX-512 comes in many different flavors. You may find Intel's Intrinsics Guide interesting.
The biggest difference is simply getting twice as many operations processed per instruction. Though, there are certain instructions in AVX-512 which may make some specific things more optimal (exponent approximations, for example).
I would like to ask if the functions mpi_send and mpi_recv have any rounding error similar to mpi_reduce ? I thought it should not be since the rounding error of the mpr_reduce function comes from the difference in the order of processor executing but the functions mpi_send and mpr_recv do not have a similar procedure.
Then I would like to ask if it is logical to verify the calculation of a parallel code with only mpi_send and mpi_recv functions by compare its results with a serial code ?
Thank you for your time.
MPI_Send and MPI_Recv do not perform rounding per se. But there could still be differences between the results from the serial code and the parallel one on systems, where higher internal precision is used. A typical example is x86 when the x87 FPU is used (mostly in 32-bit code). x87 operates on a small stack of 80-bit values and all operations, even those involving values of lesser precision, are performed with 80-bit internal precision. Whenever an intermediate value has to be transferred to another MPI rank, it first gets rounded to either float or double, unless the non-standard extended precision type is used, which removes significant bits that would otherwise be there if the value were to remain in the x87 stack. This is not an MPI-specific problem as it might also manifest itself in serial code as different results depending on the level of register optimisation performed by the compiler.
I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required.
Why use SIMD if we have GPGPU?
SIMD intrinsics - are they usable on gpus?
CPU SIMD vs GPU SIMD?
Are following points correct,if I compile the code in SIMD-8 mode ?
1) it means 8 instructions of different work items are getting executing in parallel.
2) Does it mean All work items are executing the same instruction only?
3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean to say is it true GPU is till executing the same instruction (either vload16/ float16 / vstore16) for all 8 work items?
How should I understand this concept?
In the past many OpenCL vendors required to use vector types to be able to use SIMD. Nowadays OpenCL vendors are packing work items into SIMD so there is no need to use vector types. Whether is preffered to use vector types can be checked by querying for: CL_DEVICE_PREFERRED_VECTOR_WIDTH_<CHAR, SHORT, INT, LONG, FLOAT, DOUBLE>.
On Intel if vector type is used the vectorizer first scalarize them and then re-vectorize to make use of the wide instruction set. This is probably going to be similar on the other platforms.
So far I learned that a processor has registers, for 32 bit processor
they are 32 bits, for 64 bit they are 64 bits. So can someone explain
what happens if I give to the processor a larger value than its register
size? How is the calculation performed?
It depends.
Assuming x86 for the sake of discussion, 64-bit integers can still be handled "natively" on a 32-bit architecture. In this case, the program often uses a pair of 32-bit registers to hold the 64-bit value. For example, the value 0xDEADBEEF2B84F00D might be stored in the EDX:EAX register pair:
eax = 0x2B84F00D
edx = 0xDEADBEEF
The CPU actually expects 64-bit numbers in this format in some cases (IDIV, for example).
Math operations are done in multiple instructions. For example, a 64-bit add on a 32-bit x86 CPU is done with an add of the lower DWORDs, and then an adc of the upper DWORDs, which takes into account the carry flag from the first addition.
For even bigger integers, an arbitrary-precision arithmetic (or "big int") library is used. Here, a dynamically-sized array of bytes is used to represent the integer, with additional information (like the number of bits used). GMP is a popular choice.
Mathematical operations on big integers are done iteratively, probably in native word-size values at-a-time. For the gory details, I suggest you have a look through the source code of one of these open-source libraries.
The key to all of this, is that numeric operations are carried out in manageable pieces, and combined to produce the final result.
I'm setting up software to work with multiplication of large numbers, and I'd like to compare the speed between several techniques, one of which is OpenCL. How can I pass in and multiply two 256-bit unsigned integers? What are the performance implications of this? What's the practical limit of how large numbers can get before the performance becomes terrible?
OpenCL only has intrinsic support for 8, 16, 32 and 64 bit integers. There might be vendor extensions for big integers on some platforms, but at least on the ones I am familiar with, there is not. If you want larger integer types, you will have to implement them yourself.