How to make ARGB transparency using bitwise operators - math

I need to make transparency, having 2 pixels:
pixel1: {A, R, G, B} - foreground pixel
pixel2: {A, R, G, B} - background pixel
A,R,G,B are Byte values
each color is represented by byte value
now I'm calculating transparency as:
newR = pixel2_R * alpha / 255 + pixel1_R * (255 - alpha) / 255
newG = pixel2_G * alpha / 255 + pixel1_G * (255 - alpha) / 255
newB = pixel2_B * alpha / 255 + pixel1_B * (255 - alpha) / 255
but it is too slow
I need to do it with bitwise operators (AND,OR,XOR, NEGATION, BIT MOVE)
I want to do it on Windows Phone 7 XNA
---attached C# code---
public static uint GetPixelForOpacity(uint reduceOpacityLevel, uint pixelBackground, uint pixelForeground, uint pixelCanvasAlpha)
{
byte surfaceR = (byte)((pixelForeground & 0x00FF0000) >> 16);
byte surfaceG = (byte)((pixelForeground & 0x0000FF00) >> 8);
byte surfaceB = (byte)((pixelForeground & 0x000000FF));
byte sourceR = (byte)((pixelBackground & 0x00FF0000) >> 16);
byte sourceG = (byte)((pixelBackground & 0x0000FF00) >> 8);
byte sourceB = (byte)((pixelBackground & 0x000000FF));
uint newR = sourceR * pixelCanvasAlpha / 256 + surfaceR * (255 - pixelCanvasAlpha) / 256;
uint newG = sourceG * pixelCanvasAlpha / 256 + surfaceG * (255 - pixelCanvasAlpha) / 256;
uint newB = sourceB * pixelCanvasAlpha / 256 + surfaceB * (255 - pixelCanvasAlpha) / 256;
return (uint)255 << 24 | newR << 16 | newG << 8 | newB;
}

You can't do an 8 bit alpha blend using only bitwise operations, unless you basically re-invent multiplication with basic ops (8 shift-adds).
You can do two methods as mentioned in other answers: use 256 instead of 255, or use a lookup table. Both have issues, but you can mitigate them. It really depends on what architecture you're doing this on: the relative speed of multiply, divide, shift, add and memory loads. In any case:
Lookup table: a trivial 256x256 lookup table is 64KB. This will thrash your data cache and end up being very slow. I wouldn't recommend it unless your CPU has an abysmally slow multiplier, but does have low latency RAM. You can improve performance by throwing away some alpha bits, e.g A>>3, resulting in 32x256=8KB of lookup, which has a better chance of fitting in cache.
Use 256 instead of 255: the idea being divide by 256 is just a shift right by 8. This will be slightly off and tend to round down, darkening the image slightly, e.g if R=255, A=255 then (R*A)/256 = 254. You can cheat a little and do this: (R*A+R+A)/256 or just (R*A+R)/256 or (R*A+A)/256 = 255. Or, scale A to 0..256 first, e.g: A = (256*A)/255. That's just one expensive divide-by-255 instead of 6. Then, (R*A)/256 = 255.

I don't think it can be done with the same precision using only those operators. Your best bet is, I reckon, using a LUT (as long as the LUT can fit in the CPU cache, otherwise it might even be slower)
// allocate the LUT (64KB)
unsigned char lut[256*256] __cacheline_aligned; // __cacheline_aligned is a GCC-ism
// macro to access the LUT
#define LUT(pixel, alpha) (lut[(alpha)*256+(pixel)])
// precompute the LUT
for (int alpha_value=0; alpha_value<256; alpha_value++) {
for (int pixel_value=0; pixel_value<256; pixel_value++) {
LUT(pixel_value, alpha_value) = (unsigned char)((double)(pixel_value) * (double)(alpha_value) / 255.0));
}
}
// in the loop
unsigned char ialpha = 255-alpha;
newR = LUT(pixel2_R, alpha) + LUT(pixel1_R, ialpha);
newG = LUT(pixel2_G, alpha) + LUT(pixel1_G, ialpha);
newB = LUT(pixel2_B, alpha) + LUT(pixel1_B, ialpha);
otherwise you should try vectorizing your code. But to do that you should at least provide us with more info on your CPU architecture and compiler. Keep in mind that your compiler might be able to vectorize automatically, if provided with the right options.

Related

math behind normal compression

below codes(rewritten by C#) are used to compress unit normal vector from Wild Magic 5.17,could someone explain some math behind them or share some related refs ? I can figure out the octant bits setting, but the mantissa packing and unpacking seem complex ...
codes gist
some of codes here
// ...
public static ushort CompressNormal(Vector3 normal)
{
var x = normal.x;
var y = normal.y;
var z = normal.z;
Debug.Assert(MathUtil.IsSame(x * x + y * y + z * z, 1));
// Determine octant.
ushort index = 0;
if (x < 0.0)
{
index |= 0x8000;
x = -x;
}
if (y < 0.0)
{
index |= 0x4000;
y = -y;
}
if (z < 0.0)
{
index |= 0x2000;
z = -z;
}
// Determine mantissa.
ushort usX = (ushort)Mathf.Floor(gsFactor * x);
ushort usY = (ushort)Mathf.Floor(gsFactor * y);
ushort mantissa = (ushort)(usX + ((usY * (255 - usY)) >> 1));
index |= mantissa;
return index;
}
// ...
Author wanted to use 13 bits.
Trivial way: 6 bits for x component + 6 bits for y - occupies only 12 bits, so he invented approach to assign ~90 (lsb) units for x and and ~90 (msb) units for y (90*90~2^13).
I have no idea why he uses quadratic formula for y-component - this way gives slightly different distribution of approximated values between smaller and larger values - but why specifically for y?
I've asked Mr. Eberly (author of Wild Magic) and he gives the ref, desc in short, codes above try to map (x, y) to an index of triangular array (index is from 0 to N * (N + 1) / 2 - 1)
more details are in the related doc here,
btw, another solution here with a different compress method.

GPU Driver does not respond after NDRangekernel increase

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/
I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.
CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
here is my NDRange code
queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);
and my kernel code:
__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve[n] ^= true;
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve[n] ^= true;
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve[n] ^= true;
for (int r = 5; r * r < limit; r++) {
if (sieve[r]) {
for (int i = r * r; i < limit; i += r * r)
sieve[i] = false;
}
}
}
You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.
I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.
As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?
Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .

Opencl size of local memory has impact on speed?

i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
}
}
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
Thanks!
This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
}
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
atomic_inc(histogram[input]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
}
}
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.

OpenCL AMD vs NVIDIA performance

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong).
I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
ind += tBpp;
baseImgInd += imgBpp;
}
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
return;
}
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
}
}
Kernel run configuration:
Global work size = (1900, 1200)
Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time:
HD 5850 - 69 ms,
HD 6670 - 200 ms,
GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.
The difference in execution times is caused by compilers.
Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
use only one:
unsigned char4 c = image[baseImgInd];
Instead of bulky if:
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
use fast:
unsigned char4 t = templates[i * tOffset + ind];
nonmatchQt[i] += any(abs_diff(c,t)>diff);
Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.
There can be no exact perfect answer for this. OpenCL performance depends on many parameters. The number of access to global memory, efficiency of the code etc. Moreover its very difficult compare between two device since they might be having different local, global, constant memories. Number of cores, frequency, memory bandwidth, more importantly the hardware architecture etc.
Each hardware provides their own performance boost, for example native_ from NVIDIA. So you need to explore more regarding the hardware on which you are working, that might actually work. But what I would recommend personally is not to use such hardware specific optimizations it might effect the flexibility of your code.
You can also find some papers published that shows, CUDA performance is much better than the OpenCL performance on same NVIDIA hardware.
So its always better to write code which provides good flexibility, rather than device specific optimizations.

Inverse sqrt for fixed point

I am looking for the best inverse square root algorithm for fixed point 16.16 numbers. The code below is what I have so far(but basically it takes the square root and divides by the original number, and I would like to get the inverse square root without a division). If it changes anything, the code will be compiled for armv5te.
uint32_t INVSQRT(uint32_t n)
{
uint64_t op, res, one;
op = ((uint64_t)n<<16);
res = 0;
one = (uint64_t)1 << 46;
while (one > op) one >>= 2;
while (one != 0)
{
if (op >= res + one)
{
op -= (res + one);
res += (one<<1);
}
res >>= 1;
one >>= 2;
}
res<<=16;
res /= n;
return(res);
}
The trick is to apply Newton's method to the problem x - 1/y^2 = 0. So, given x, solve for y using an iterative scheme.
Y_(n+1) = y_n * (3 - x*y_n^2)/2
The divide by 2 is just a bit shift, or at worst, a multiply by 0.5. This scheme converges to y=1/sqrt(x), exactly as requested, and without any true divides at all.
The only problem is that you need a decent starting value for y. As I recall there are limits on the estimate y for the iterations to converge.
ARMv5TE processors provide a fast integer multiplier, and a "count leading zeros" instruction. They also typically come with moderately sized caches. Based on this, the most suitable approach for a high-performance implementation appears to be a table lookup for an initial approximation, followed by two Newton-Raphson iterations to achieve fully accurate results. We can speed up the first of these iterations further with additional pre-computation that is incorporated into the table, a technique used by Cray computers forty years ago.
The function fxrsqrt() below implements this approach. It starts out with an 8-bit approximation r to the reciprocal square root of the argument a, but instead of storing r, each table element stores 3r (in the lower ten bits of the 32-bit entry) and r3 (in the upper 22 bits of the 32-bit entry). This allows the quick computation of the first iteration as
r1 = 0.5 * (3 * r - a * r3). The second iteration is then computed in the conventional way as r2 = 0.5 * r1 * (3 - r1 * (r1 * a)).
To be able to perform these computations accurately, regardless of the magnitude of the input, the argument a is normalized at the start of the computation, in essence representing it as a 2.32 fixed-point number multiplied with a scale factor of 2scal. At the end of the computation the result is denormalized according to formula 1/sqrt(22n) = 2-n. By rounding up results whose most significant discarded bit is 1, accuracy is improved, resulting in almost all results being correctly rounded. The exhaustive test reports: results too low: 639 too high: 1454 not correctly rounded: 2093
The code makes use of two helper functions: __clz() determines the number of leading zero bits in a non-zero 32-bit argument. __umulhi() computes the 32 most significant bits of a full 64-bit product of two unsigned 32-bit integers. Both functions should be implemented either via compiler intrinsics, or by using a bit of inline assembly. In the code below I am showing portable implementations well suited to ARM CPUs along with inline assembly versions for x86 platforms. On ARMv5TE platforms __clz() should be mapped map to the CLZ instruction, and __umulhi() should be mapped to UMULL.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#define USE_OWN_INTRINSICS 1
#if USE_OWN_INTRINSICS
__forceinline int __clz (uint32_t a)
{
int r;
__asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a));
return 31 - r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
uint32_t r;
__asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t"
: "=r"(r) : "r"(a), "r"(b) : "eax", "edx");
return r;
}
#else // USE_OWN_INTRINSICS
int __clz (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#endif // USE_OWN_INTRINSICS
/*
* For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal
* square root. To speed up subsequent Newton-Raphson iterations, each entry in
* the table combines two pieces of information: The least-significant 10 bits
* store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to
* 22 bits such that accuracy is optimized.
*/
uint32_t rsqrt_tab [96] =
{
0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9,
0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2,
0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694,
0x9beb568b, 0x97a5c685, 0x9163027c, 0x8d4fd276,
0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e,
0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46,
0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634,
0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22,
0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10,
0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01,
0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5,
0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6,
0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd,
0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1,
0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5,
0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc,
0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3,
0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa,
0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4,
0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b,
0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195,
0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c,
0x2318718c, 0x224da189, 0x224dd989, 0x21860d86,
0x21862586, 0x20c19183, 0x20c1b183, 0x20001580
};
/* This function computes the reciprocal square root of its 16.16 fixed-point
* argument. After normalization of the argument if uses the most significant
* bits of the argument for a table lookup to obtain an initial approximation
* accurate to 8 bits. This is followed by two Newton-Raphson iterations with
* quadratic convergence. Finally, the result is denormalized and some simple
* rounding is applied to maximize accuracy.
*
* To speed up the first NR iteration, for the initial 8-bit approximation r0
* the lookup table supplies 3*r0 along with r0**3. A first iteration computes
* a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes
* the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)).
*
* The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows:
* 639 results are too small by one ulp, 1454 results are too big by one ulp.
* A total of 2093 results deviate from the correctly rounded result.
*/
uint32_t fxrsqrt (uint32_t a)
{
uint32_t s, r, t, scal;
/* handle special case of zero input */
if (a == 0) return ~a;
/* normalize argument */
scal = __clz (a) & 0xfffffffe;
a = a << scal;
/* initial approximation */
t = rsqrt_tab [(a >> 25) - 32];
/* first NR iteration */
r = (t << 22) - __umulhi (t, a);
/* second NR iteration */
s = __umulhi (r, a);
s = 0x30000000 - __umulhi (r, s);
r = __umulhi (r, s);
/* denormalize and round result */
r = ((r >> (18 - (scal >> 1))) + 1) >> 1;
return r;
}
/* reference implementation, 16.16 reciprocal square root of non-zero argment */
uint32_t ref_fxrsqrt (uint32_t a)
{
double arg = a / 65536.0;
double rsq = sqrt (1.0 / arg);
uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5);
return r;
}
int main (void)
{
uint32_t arg = 0x00000001;
uint32_t res, ref;
uint32_t err, lo = 0, hi = 0;
do {
res = fxrsqrt (arg);
ref = ref_fxrsqrt (arg);
err = 0;
if (res < ref) {
err = ref - res;
lo++;
}
if (res > ref) {
err = res - ref;
hi++;
}
if (err > 1) {
printf ("!!!! arg=%08x res=%08x ref=%08x\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
printf ("results too low: %u too high: %u not correctly rounded: %u\n",
lo, hi, lo + hi);
return EXIT_SUCCESS;
}
I have a solution that I characterize as "fast inverse sqrt, but for 32bit fixed points". No table, no reference, just straight to the point with a good guess.
If you want, jump to the source code below, but beware of a few things.
(x * y)>>16 can be replaced with any fixed-point multiplication scheme you want.
This does not require 64-bit [long-words], I just use that for the ease of demonstration. Long words are used to prevent overflow in multiplication. A fixed-point math library will have fixed-point multiplication functions that handle this better.
The initial guess is pretty good, so you get relatively precise results in the first incantation.
The code is more verbose than needed for demonstration.
Values less than 65536 (<1) and greater than 32767<<16 cannot be used.
This is generally not faster than using a square root table and division if your hardware has a division function. If it does not, this avoids divisions.
int fxisqrt(int input){
if(input <= 65536){
return 1;
}
long xSR = input>>1;
long pushRight = input;
long msb = 0;
long shoffset = 0;
long yIsqr = 0;
long ysqr = 0;
long fctrl = 0;
long subthreehalf = 0;
while(pushRight >= 65536){
pushRight >>=1;
msb++;
}
shoffset = (16 - ((msb)>>1));
yIsqr = 1<<shoffset;
//y = (y * (98304 - ( ( (x>>1) * ((y * y)>>16 ) )>>16 ) ) )>>16; x2
//Incantation 1
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
//Incantation 2 - Increases precision greatly, but may not be neccessary
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
return yIsqr;
}

Resources