integer stored as float - math

I have some questions regarding integers and floats:
Can I store every 32-bit unsigned integer value into a 64-bit IEEE floating point value (such that when I assign the double value back to an int the int will contain the original value)?
What are the smallest (magnitude wise) positive and negative integer values that cannot be stored in a 32-bit IEEE floating point value (by the same definition as in 1)?
Do the answers to these questions depend on language used?
//edit: I know these questions sound a bit like from some test but I'm asking about these things because I need to make some decisions on a dataformat definition

Yes, you can store a 32-bit integer a 64-bit double without information loss. The mantissa has 53 bits of precision, which is enough.
A 32-bit float has a 24-bit mantissa, so the maximum and minimum integers with a unique representation are 2^24-1 and -2^24+1 (16777215 and -16777215). Greater numbers don't have a unique representation; for example 16777216 == (float)16777217.
If you assume the language follows IEEE-754, it doesn't depend on the language.

Related

sqlite3 shortens float value

I do a simple:
latitude:String = String.fromCString(UnsafePointer(sqlite3_column_text(statement, 11)))!
The value in the Database is "real".
In the database I have
51.234183426424316 (verified using Firefox'SQLite Manager)
With the above I get in my String only:
51.2341834264243
(the last two digits are missing with is not acceptable working with coordinates)
Any explanations? Solutions?
SQLite stores such numbers as as 64-bit IEEE floating-point numbers, which have a significand precisions of 53 bits, which corresponds to about 15-17 decimal digits.
How to format such a number for display is a different question.
If you want to have control over it, get the original value with sqlite3_column_double(), and convert it to a string yourself.
(And you are complaining about a difference that is smaller than the wavelength of visible light ...)

Two's Complement -- How are negative numbers handled?

It is my understanding that numbers are negated using the two's compliment, which to my understanding is: !num + 1.
So my question is does this mean that, for variable 'foo'=1, a negated 'foo' will be the exactly the same as variable 'bar'=255.
f we were to check if -'foo' == 'bar' or if -'foo' == 255, would we get that they are equal?
I know that some languages, such as Java, keep a sign bit -- so the comparisons would yield false. What of languages that do not? And I'm assuming that assembler/native machine does not have a sign bit.
In addition to all of this, I read about a zero flag or a carry-over flag that is set when a 'negative' number is added to another (of any sign) number. This flag being set whenever it is added because of the way two's complement works, 0x01 + 0xff = 0x00 (with the leading 1 truncated). What exactly is this flag used for?
And my last question, for other math operations (such as multiplication), would I have to re-negate the number (so it is now positive), perform the operation, and negate the result? E.g., !((!neg + 1) * pos) + 1.
Edit
Finished the question, so feel free fire away.
Yes, in two’s complement, the number x is represented as ~x+1, where ~x is the bitwise complement of the binary numeral for x in some fixed number bits. E.g., for eight bits, the binary numeral for x is 000000001, so the bitwise complement is 11111110, and adding one produces 11111111.
There is no way to distinguish -1 in eight-bit two’s complement from 255 in eight-bit binary (with no sign). They both have the same representation in bits: 11111111. If you are using both of these numbers, you must either separately remember which one is eight-bit two’s complement and which one is plain eight-bit binary or you must use more than eight bits. In other words, at the raw bit level, 11111111 is just eight bits; it has no value until we decide how to interpret it.
Java and typical other languages do not maintain a sign bit separate from the value of a number; the sign is part of the encoding of the number. Also, typical languages do not allow you to compare different types. If you have a two’s complement x and an unsigned y, then either one must be converted to the type of the other before comparison or they must both be converted to a third type. Thus, if you compare x and y, and one is converted to the other, then the conversion will overflow or wrap, and you cannot expect to get the correct mathematical result. To compare these two numbers, we might convert each of them to a wider integer, such as 32-bits, then compare. Converting the eight-bit two’s complement 11111111 to a 32-bit integer produces -1, and converting the eight-bit plain binary 11111111 to a 32-bit integer produces 255, and then the comparison reports they are unequal.
The zero flag and the carry flag you read about are flags that are set when a comparison instruction is executed in a computer processor. Most high-level languages do not give you direct access to these flags. Many processors have an instruction with a form like this:
cmp a, b
That instruction subtracts b from a and discards the difference but remembers several flags that describe the subtraction: Was the result zero (zero flag)? Did a borrow occur (borrow flag)? Was the result negative (sign flag)? Did an overflow occur (overflow flag)?
The compare instruction requires that the two things being compared be the same type (two’s complement or unsigned), but it does not care which type. The results can be tested later by checking particular combinations of the flags depending on the type. That is, the information recorded in the flags can distinguish whether one two’s complement number was greater than another or whether one unsigned number was greater than another, depending on what tests are made. There are conditional branch instructions that test the desired flag properties.
There is generally no need to “un-negate” a number to perform arithmetic operations. Processors include arithmetic instructions that work on two’s complement numbers. Usually the add and subtract instructions are type-agnostic, the same way the compare instruction is, but the multiply and divide instructions are not (except for some forms of multiply that return partial results). The add and subtract instructions can be type-agnostic because the wrapping that occurs in the arithmetic works for both two’s complement and unsigned. However, that wrapping does not work for multiplication and division.

Integer divide by Zero and Float (Real no.) divide by Zero

If I run following line of code, I get DIVIDE BY ZERO error
1. System.out.println(5/0);
which is the expected behavior.
Now I run the below line of code
2. System.out.println(5/0F);
here there is no DIVIDE BY ZERO error, rather it shows INFINITY
In the first line I am dividing two integers and in the second two real numbers.
Why does dividing by zero for integers gives DIVIDE BY ZERO error while in the case of real numbers it gives INFINITY
I am sure it is not specific to any programming language.
(EDIT: The question has been changed a bit - it specifically referred to Java at one point.)
The integer types in Java don't have representations of infinity, "not a number" values etc - whereas IEEE-754 floating point types such as float and double do. It's as simple as that, really. It's not really a "real" vs "integer" difference - for example, BigDecimal represents real numbers too, but it doesn't have a representation of infinity either.
EDIT: Just to be clear, this is language/platform specific, in that you could create your own language/platform which worked differently. However, the underlying CPUs typically work the same way - so you'll find that many, many languages behave this way.
EDIT: In terms of motivation, bear in mind that for the infinity case in particular, there are ways of getting to infinity without dividing by zero - such as dividing by a very, very small floating point number. In the case of integers, there's obviously nothing between zero and one.
Also bear in mind that the cases in which integers (or decimal floating point types) are used typically don't need to concept of infinity, or "not a number" results - whereas in scientific applications (where float/double are more typically useful), "infinity" (or at least, "a number which is too large to sensibly represent") is still a potentially valid result.
This is specific to one programming language or a family of languages. Not all languages allow integers and floats to be used in the same expression. Not all languages have both types (for example, ECMAScript implementations like JavaScript have no notion of an integer type externally). Not all languages have syntax like this to convert values inline.
However, there is an intrinsic difference between integer arithmetic and floating-point arithmetic. In integer arithmetic, you must define that division by zero is an error, because there are no values to represent the result. In floating-point arithmetic, specifically that defined in IEEE-754, there are additional values (combinations of sign bit, exponent and mantissa) for the mathematical concept of infinity and meta-concepts like NaN (not a number).
So we can assume that the / operator in this programming language is generic, that it performs integer division if both operands are of the language's integer type; and that it performs floating-point division if at least one of the operands is of a float type of the language, whereas the other operands would be implicitly converted to that float type for the purpose of the operation.
In real-number math, division of a number by a number close to zero is equivalent to multiplying the first number by a number whose absolute is very large (x / (1 / y) = x * y). So it is reasonable that the result of dividing by zero should be (defined as) infinity as the precision of the floating-point value would be exceeded.
Implementation details were to be found in that programming language's specification.

what is a multiple precision integer?

"A is a multiple precision n-bit integer represented in radix r"
What does this statement mean?
In particular, what does A being a multiple precision n-bit integer mean?
Please help.
Thank you
Let me answer my own question
if the processor is an X bit processor, any integer number that can be represented with in X bits ( ie, 0 to ((2^X)-1) for unsigned int's) it is a single precision numbers. If the integer needs to be represented using more than X bits, it is a multiple precision number. Usually any number is represented using bits that are a multiple of X, since an X bit processor has registers X bits wide, and if a number requires , for eg, 1 bit more than X, it still needs two X bit registers to be stored (hence such a number would be double precision)
Another example, a floating point number in C requires 4 bytes space. So something like
float x;
is a single precision floating point number
double x; requires twice the space of float, hence it is a double precision floating point number
(And if you believe this is wrong please leave your suggestions in the comments)
It's very difficult to say without any context.
But if I had to guess, I'd say it's probably referring to arbitrary-precision arithmetic. i.e. it's a type with no constraints on storage (and therefore no constraints on number of digits).

How to get around some rounding errors?

I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.
This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.
You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.
You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.
You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.

Resources