Bitwise operation on floating point numbers (for graphics)? [duplicate]

Bitwise operation on floating point numbers (for graphics)? [duplicate] - math

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
how to perform bitwise operation on floating point numbers
Hello, everyone!
Background:
I know that it is possible to apply bitwise operation on graphics (for example XOR). I also know, that in graphic programs, graphic data is often stored in floating point data types (to be able for example to "multiply" the data with 1.05). So it must be possible to perform bitwise operations on floating point data, right?
I need to be able to perform bitwise operations on floating point data. I do not want to cast the data to long, bitwise manipulate it, and cast back to float.
I assume, there exist a mathematical way to achieve this, which is more elegant (?) and/or faster (?).
I've seen some answers but they could not help, including this one.
EDIT:
That other question involves void-pointer casting, which would rely on deeper-level data representation. So it's not such an "exact duplicate".

By the time the "graphics data" hits the screen, none of it is floating point. Bitwise operations are really done on bit strings. Bitwise operations only make sense on numbers because of consistent encoding scheme to binary. Trying to get any kind of logical bitwise operations on floats other than extracting the exponent or mantissa is a road to hell.
Basically, you probably don't want to do this. Why do you think you do?

A floating point number is just another representation of a binary in memory, so you could:
measure the size of the data type (e.g. 32 bits), e.g. sizeof(pixel)
get a pointer to it - choose an integer type of the same size for that, e.g. UINT *ptr = &pixel
use the pointer's value, e.g. newpixel=(*ptr)^(*ptr)
This should at least work with non-negative values and should have no considerable calculative overhead, at least in an unmanaged context like C++. Maybe you have to mask out some bits when doing your operation, and - depending of the type - you may have to treat exponent and base separately.

Related

Are there any conventional rules for offset binary arithmetic with fixed size bit-width registers?

Let's say we have two fixed sized binary numbers (8-bit), e.g.
00000101 (5) and 00100000 (32)
The task is to add them in offset binary (excess of 128). Are there any specific rules concerning how to go about this?
Would I for instance first convert both the numbers into offset binary notation, then add them and afterwards subtract the offset (because I added it twice)? But if so, what about overflow, given that the imaginary registers are only 8 bit wide?
Or would I first subtract the excess and then add the second number? Are there any conventional rules when it comes to offset binary arithmetic?
I'm preparing for an exam in computer architecture and computer data arithmetic. This has been a task on an exercise sheet in a previous term. I'v already searched the net extensively for answers but can't seem to find a solid one.

I do not know what the "conventional rules" are for this operation, but I can tell you how I did this operation back when I did machine code.
This method works well when the offset is half the first number that overflows the register. That is the case for you, since the offset is 128 and the 8-bit register overflows on 256. This works especially well when the two numbers you want to add are already in the offset format.
The method is: add the two offset numbers, as unsigned addition and ignoring any overflow, then flip the most significant bit.
In your case, you are adding 10000101 (5 in offset) and 10100000 (32 in offset). Adding those results in 00100101, since there is overflow out of the most significant bit. Flipping the most-significant bit results in 10100101, which is indeed 37 in offset format.
This method may result in overflow, but only when the result is too positive or too negative to fit into the offset format anyway. And in most CPUs the two operations (unsigned addition and flipping the MSB) are practically trivial.

Optimize dataset for floating point add/sub/mul/div

Suppose we have a data set of numbers, with which we want to do some calculations using addition/subtraction/multiplication/division using a computer.
The coverage of the real numbers by the floating point representation varies a lot, depending on the number being represented:
In terms of absolute precision in the real->FP mapping the "holes" grow towards the bigger numbers, with a weird hole around 0, depending on the architecture. Due to this, the add/sub precision towards the bigger numbers will drop.
If we divide 2 consecutive numbers which are represented in our floating point representation, the result of the division will be bigger both while going to the bigger numbers and when going to smaller and smaller fractions.
So, my question is:
Is there a "sweet interval" for floats on an ordinary PC today, where the results for the arithmetics with the said operators (add/sub/mul/div) are just more precise?
If I have a data set of many-significant-digit numbers like "123123123123123", "134534513412351151", etc., with which I want to do some arithmetics, which floating point interval should it be converted to, to have the best precision for the result?
Since floating points are something like 1.xxx*10^yyy, 2.xxx*10^yyy, ..., 9.xxx*10^yyy, I would assume, converting my numbers into the [1, 9] interval would give the best results for the memory consumed, but I may be terribly wrong...
Suppose I use C, can such conversion even be made? Is there a best-practice to do that? Before an operation, C will convert the operands to the same format, so I guess I would have to use a string representation, inject a "." somewhere and parse that as float.
Please note:
This is a theoretical question, I don't have an actual data set on my hand that would decide what is best. On the same note, the mentioning of C was random, I am also interested in responses like "forget C, I would use this and this, BECAUSE it supports this and this".
Please spare me from answers like "this cannot be answered, because it depends on the actual operations, since the results may be in another magnitude range than the original data, etc., etc.". Let's suppose that the results of the calculation is more or less in the same interval, as the operands. Sure, when dividing the "more-or-less the same magnitude" operands, the result will be somewhere between 1-10, maybe 0.1-100, ... , but that is probably exactly the best interval they can be in.
Of course, if the answer includes some explanation, other than a brush-off, I will be happy to read it!

The absolute precision of floating-point numbers changes with the magnitude of the numbers because the exponent changes. The relative precision does not change, except for numbers near the bottom of the exponent range, where underflow occurs. If you multiply binary floating-point numbers by a power of two, perform arithmetic (suitably adjusted for the scaling), and reverse the scaling, the results will be identical to doing the arithmetic without scaling, barring effects from overflow and underflow. If your arithmetic does involve underflow or overflow, then scaling could help avoid that. For example, if your precision is suffering because your numbers are so small that some intermediate results are below the normal range of the floating-point format, then scaling by a power of two can avoid the loss of precision from underflow.
If you scale by something other than a power of two, the results can be different, due to changes in the significands. The effects will generally be tiny, and whether the results are better or worse will effectively be random chance, except in carefully engineered special situations.

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.

I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.

This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

Addition and multiplication in a Galois Field

I am attempting to generate QR codes on an extremely limited embedded platform. Everything in the specification seems fairly straightforward except for generating the error correction codewords. I have looked at a bunch of existing implementations, and they all try to implement a bunch of polynomial math that goes straight over my head, particularly with regards to the Galois fields. The most straightforward way I can see, both in mathematical complexity and in memory requirements is a circuit concept that is laid out in the spec itself:
With their description, I am fairly confident I could implement this with the exception of the parts labeled GF(256) addition and GF(256) Multiplication.
They offer this help:
The polynomial arithmetic for QR Code shall be calculated using bit-wise modulo 2 arithmetic and byte-wise
modulo 100011101 arithmetic. This is a Galois field of 2^8
with 100011101 representing the field's prime modulus
polynomial x^8+x^4+x^3+x^2+1.
which is all pretty much greek to me.
So my question is this: What is the easiest way to perform addition and multiplication in this kind of Galois field arithmetic? Assume both input numbers are 8 bits wide, and my output needs to be 8 bits wide also. Several implementations precalculate, or hardcode in two lookup tables to help with this, but I am not sure how those are calculated, or how I would use them in this situation. I would rather not take the 512 byte memory hit for the two tables, but it really depends on what the alternative is. I really just need help understanding how to do a single multiplication and addition operation in this circuit.

In practice only one table is needed. That would be for the GP(256) multiply. Note that all arithmetic is carry-less, meaning that there is no carry-propagation.
Addition and subtraction without carry is equivalent to an xor.
So in GF(256), a + b and a - b are both equivalent to a xor b.
GF(256) multiplication is also carry-less, and can be done using carry-less multiplication in a similar way with carry-less addition/subtraction. This can be done efficiently with hardware support via say Intel's CLMUL instruction set.
However, the hard part, is reducing the modulo 100011101. In normal integer division, you do it using a series of compare/subtract steps. In GF(256), you do it in a nearly identical manner using a series of compare/xor steps.
In fact, it's bad enough where it's still faster to just precompute all 256 x 256 multiplies and put them into a 65536-entry look-up table.
page 3 of the following pdf has a pretty good reference on GF256 arithmetic:
http://www.eecs.harvard.edu/~michaelm/CS222/eccnotes.pdf

(I'm following up on the pointer to zxing in the first answer, since I'm the author.)
The answer about addition is exactly right; that's why working in this field is convenient on a computer.
See http://code.google.com/p/zxing/source/browse/trunk/core/src/com/google/zxing/common/reedsolomon/GenericGF.java
Yes multiplication works, and is for GF256. a * b is really the same as exp(log(a) + log(b)). And because GF256 has only 256 elements, there are only 255 unique powers of "x", and same for log. So these are easy to put in a lookup table. The tables would "wrap around" at 256, so that is why you see the "% size". "/ size" is slightly harder to explain in a sentence -- it's because really 1-255 "wrap around", not 0-255. So it's not quite just a simple modulus that's needed.
The final piece perhaps is how you reduce modulo an irreducible polynomial. The irreducibly polynomial is x^8 plus some lower-power terms, right -- call it I(x) = x^8 + R(x). And the polynomial is congruent to 0 in the field, by definition; I(x) == 0. So x^8 == -R(x). And, conveniently, addition and subtraction are the same, so x^8 == -R(x) == R(x).
The only time we need to reduce higher-power polynomials is when constructing the exponents table. You just keep multiplying by x (which is a shift left) until it gets too big -- gets an x^8 term. But x^8 is the same as R(x). So you take out the x^8 and add in R(x). R(x) merely has powers up to x^7 so it's all in a byte still, all in GF(256). And you know how to add in this field.
Helps?

How do programming languages handle huge number arithmetic

For a computer working with a 64 bit processor, the largest number that it can handle would be 264 = 18,446,744,073,709,551,616. How does programming languages, say Java or be it C, C++ handle arithmetic of numbers higher than this value. Any register cannot hold it as a single piece. How was this issue tackled?

There are lots of specialized techniques for doing calculations on numbers larger than the register size. Some of them are outlined in this wikipedia article on arbitrary precision arithmetic
Low level languages, like C and C++, leave large number calculations to the library of your choice. One notable one is the GNU Multi-Precision library. High level languages like Python, and others, integrate this into the core of the language, so normal numbers and very large numbers are identical to the programmer.

You assume the wrong thing. The biggest number it can handle in a single register is a 64-bits number. However, with some smart programming techniques, you could just combined a few dozens of those 64-bits numbers in a row to generate a huge 6400 bit number and use that to do more calculations. It's just not as fast as having the number fit in one register.
Even the old 8 and 16 bits processors used this trick, where they would just let the number overflow to other registers. It makes the math more complex but it doesn't put an end to the possibilities.
However, such high-precision math is extremely unusual. Even if you want to calculate the whole national debt of the USA and store the outcome in Zimbabwean Dollars, a 64-bits integer would still be big enough, I think. It's definitely big enough to contain the amount of my savings account, though.

Programming languages that handle truly massive numbers use custom number primitives that go beyond normal operations optimized for 32, 64, or 128 bit CPUs. These numbers are especially useful in computer security and mathematical research.
The GNU Multiple Precision Library is probably the most complete example of these approaches.
You can handle larger numbers by using arrays. Try this out in your web browser. Type the following code in the JavaScript console of your web browser:
The point at which JavaScript fails
console.log(9999999999999998 + 1)
// expected 9999999999999999
// actual 10000000000000000 oops!
JavaScript does not handle plain integers above 9999999999999998. But writing your own number primitive is to make this calculation work is simple enough. Here is an example using a custom number adder class in JavaScript.
Passing the test using a custom number class
// Require a custom number primative class
const {Num} = require('./bases')
// Create a massive number that JavaScript will not add to (correctly)
const num = new Num(9999999999999998, 10)
// Add to the massive number
num.add(1)
// The result is correct (where plain JavaScript Math would fail)
console.log(num.val) // 9999999999999999
How it Works
You can look in the code at class Num { ... } to see details of what is happening; but here is a basic outline of the logic in use:
Classes:
The Num class contains an array of single Digit classes.
The Digit class contains the value of a single digit, and the logic to handle the Carry flag
Steps:
The chosen number is turned into a string
Each digit is turned into a Digit class and stored in the Num class as an array of digits
When the Num is incremented, it gets carried to the first Digit in the array (the right-most number)
If the Digit value plus the Carry flag are equal to the Base, then the next Digit to the left is called to be incremented, and the current number is reset to 0
... Repeat all the way to the left-most digit of the array
Logistically it is very similar to what is happening at the machine level, but here it is unbounded. You can read more about about how digits are
carried here; this can be applied to numbers of any base.

Ada actually supports this natively, but only for its typeless constants ("named numbers"). For actual variables, you need to go find an arbitrary-length package. See Arbitrary length integer in Ada

More-or-less the same way that you do. In school, you memorized single-digit addition, multiplication, subtraction, and division. Then, you learned how to do multiple-digit problems as a sequence of single-digit problems.
If you wanted to, you could multiply two twenty-digit numbers together using nothing more than knowledge of a simple algorithm, and the single-digit times tables.

In general, the language itself doesn't handle high-precision, high-accuracy large number arithmetic. It's far more likely that a library is written that uses alternate numerical methods to perform the desired operations.
For example (I'm just making this up right now), such a library might emulate the actual techniques that you might use to perform that large number arithmetic by hand. Such libraries are generally much slower than using the built-in arithmetic, but occasionally the additional precision and accuracy is called for.

As a thought experiment, imagine the numbers stored as a string. With functions to add, multiply, etc these arbitrarily long numbers.
In reality these numbers are probably stored in a more space efficient manner.

Think of one machine-size number as a digit and apply the algorithm for multi-digit multiplication from primary school. Then you don't need to keep the whole numbers in registers, just the digits as they are worked on.

Most languages store them as array of integers. If you add/subtract two to of these big numbers the library adds/subtracts all integer elements in the array separately and handles the carries/borrows.
It's like manual addition/subtraction in school because this is how it works internally.
Some languages use real text strings instead of integer arrays which is less efficient but simpler to transform into text representation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex