Arithmetic with only 16-bit signed words - math

I am trying to perform arithmetic using only 16-bit signed words. I need to be able to perform addition, multiplication, etc.
As an example I need to subtract two data values, below is an example:
7269.554688-46.8
or 4385.6616210938 + 32.2
However, these values need to be converted into 16-bit words and then the subtraction, multiplication, or addition can be performed.
I could also use multiple 16-bit words to store one value.
How would I go about performing operations like addition, subtraction and multiplication and how would I convert all of my input values appropriately so that the decimal points always line up properly?

What platform are you coding for? To perform the operations you've given as an example, you would need a floating point unit. Floating point numbers are usually represented through 32 bits or 64 bits, rarely 16 bit.
If you don't have one and all you have are simple operations on 16 bit integers, you could emulate a floating point unit, but that is not a trivial task.

Related

Is it possible to remove the max value of a number

I have a mathematical formula which generates prime numbers. The numbers grow exponentially, and in 7 iterations the value hits inf and then nan.
Is there a way to remove those limits or is there a language that doesn't have limits?
Many languages such as Python 3 can handle arbitrarily large integers (limited only by RAM), so you can certainly play around with integers having thousands of digits. For example, it took less than a second to compute 10,000! = 284625968091705451890641321211... (with 35,660 digits hidden in the ...). In most languages, floating point numbers tend to be limited to what you can represent with 64 bits, though there are various libraries for arbitrary-precision floating point numbers. In no case can you exceed all limits.
If you are using C or C++ the GNU MP Bignum Library allows you to do arbitrary precision integer and floating point arithmetic.

How to perform mathematical operations on large numbers

I have a question about working on very big numbers. I'm trying to run RSA algorithm and lets's pretend i have 512 bit number d and 1024 bit number n. decrypted_word = crypted_word^d mod n, isn't it? But those d and n are very large numbers! Non of standard variable types can handle my 512 bit numbers. Everywhere is written, that rsa needs 512 bit prime number at last, but how actually can i perform any mathematical operations on such a number?
And one more think. I can't use extra libraries. I generate my prime numbers with java, using BigInteger, but on my system, i have only basic variable types and STRING256 is the biggest.
Suppose your maximal integer size is 64 bit. Strings are not that useful for doing math in most languages, so disregard string types. Now choose an integer of half that size, i.e. 32 bit. An array of these can be interpreted as digits of a number in base 232. With these, you can do long addition and multiplication, just like you are used to with base 10 and pen and paper. In each elementary step, you combine two 32-bit quantities, to produce both a 32-bit result and possibly some carry. If you do the elementary operation in 64-bit arithmetic, you'll have both of these as part of a single 64-bit variable, which you'll then have to split into the 32-bit result digit (via bit mask or simple truncating cast) and the remaining carry (via bit shift).
Division is harder. But if the divisor is known, then you may get away with doing a division by constant using multiplication instead. Consider an example: division by 7. The inverse of 7 is 1/7=0.142857…. So you can multiply by that to obtain the same result. Obviously we don't want to do any floating point math here. But you can also simply multiply by 14286 then omit the last six digits of the result. This will be exactly the right result if your dividend is small enough. How small? Well, you compute x/7 as x*14286/100000, so the error will be x*(14286/100000 - 1/7)=x/350000 so you are on the safe side as long as x<350000. As long as the modulus in your RSA setup is known, i.e. as long as the key pair remains the same, you can use this approach to do integer division, and can also use that to compute the remainder. Remember to use base 232 instead of base 10, though, and check how many digits you need for the inverse constant.
There is an alternative you might want to consider, to do modulo reduction more easily, perhaps even if n is variable. Instead of expressing your remainders as numbers 0 through n-1, you could also use 21024-n through 21024-1. So if your initial number is smaller than 21024-n, you add n to convert to this new encoding. The benefit of this is that you can do the reduction step without performing any division at all. 21024 is equivalent to 21024-n in this setup, so an elementary modulo reduction would start by splitting some number into its lower 1024 bits and its higher rest. The higher rest will be right-shifted by 1024 bits (which is just a change in your array indexing), then multiplied by 21024-n and finally added to the lower part. You'll have to do this until you can be sure that the result has no more than 1024 bits. How often that is depends on n, so for fixed n you can precompute that (and for large n I'd expect it to be two reduction steps after addition but hree steps after multiplication, but please double-check that) whereas for variable n you'll have to check at runtime. At the very end, you can go back to the usual representation: if the result is not smaller than n, subtract n. All of this should work as described if n>2512. If not, i.e. if the top bit of your modulus is zero, then you might have to make further adjustments. Haven't thought this through, since I only used this approach for fixed moduli close to a power of two so far.
Now for that exponentiation. I very much suggest you do the binary approach for that. When computing xd, you start with x, x2=x*x, x4=x2*x2, x8=…, i.e. you compute all power-of-two exponents. You also maintain some intermediate result, which you initialize to one. In every step, if the corresponding bit is set in the exponent d, then you multiply the corresponding power into that intermediate result. So let's say you have d=11. Then you'd compute 1*x1*x2*x8 because d=11=1+2+8=10112. That way, you'll need only about 1024 multiplications max if your exponent has 512 bits. Half of them for the powers-of-two exponentiation, the other to combine the right powers of two. Every single multiplication in all of this should be immediately followed by a modulo reduction, to keep memory requirements low.
Note that the speed of the above exponentiation process will, in this simple form, depend on how many bits in d are actually set. So this might open up a side channel attack which might give an attacker access to information about d. But if you are worried about side channel attacks, then you really should have an expert develop your implementation, because I guess there might be more of those that I didn't think about.
You may write some macros you may execute under Microsoft for functions like +, -, x, /, modulo, x power y which work generally for any integer of less than ten or hundred thousand digits (the practical --not theoretical-- limit being the internal memory of your CPU). Please note the logic is exactly the same as the one you got at elementary school.
E.g.: p= 1819181918953471 divider of (2^8091) - 1, q = ((2^8091) - 1)/p, mod(2^8043 ; q ) = 23322504995859448929764248735216052746508873363163717902048355336760940697615990871589728765508813434665732804031928045448582775940475126837880519641309018668592622533434745187004918392715442874493425444385093718605461240482371261514886704075186619878194235490396202667733422641436251739877125473437191453772352527250063213916768204844936898278633350886662141141963562157184401647467451404036455043333801666890925659608198009284637923691723589801130623143981948238440635691182121543342187092677259674911744400973454032209502359935457437167937310250876002326101738107930637025183950650821770087660200075266862075383130669519130999029920527656234911392421991471757068187747362854148720728923205534341236146499449910896530359729077300366804846439225483086901484209333236595803263313219725469715699546041162923522784170350104589716544529751439438021914727772620391262534105599688603950923321008883179433474898034318285889129115556541479670761040388075352934137326883287245821888999474421001155721566547813970496809555996313854631137490774297564881901877687628176106771918206945434350873509679638109887831932279470631097604018939855788990542627072626049281784152807097659485238838560958316888238137237548590528450890328780080286844038796325101488977988549639523988002825055286469740227842388538751870971691617543141658142313059934326924867846151749777575279310394296562191530602817014549464614253886843832645946866466362950484629554258855714401785472987727841040805816224413657036499959117701249028435191327757276644272944743479296268749828927565559951441945143269656866355210310482235520220580213533425016298993903615753714343456014577479225435915031225863551911605117029393085632947373872635330181718820669836830147312948966028682960518225213960218867207825417830016281036121959384707391718333892849665248512802926601676251199711698978725399048954325887410317060400620412797240129787158839164969382498537742579233544463501470239575760940937130926062252501116458281610468726777710383038372260777522143500312913040987942762244940009811450966646527814576364565964518092955053720983465333258335601691477534154940549197873199633313223848155047098569827560014018412679602636286195283270106917742919383395056306107175539370483171915774381614222806960872813575048014729965930007408532959309197608469115633821869206793759322044599554551057140046156235152048507130125695763956991351137040435703946195318000567664233417843805257728.
The last step took about 0.1 sec.
wpjo (willibrord oomen on academia.edu)

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.
I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.
This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

How to get around some rounding errors?

I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.
This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.
You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.
You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.
You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.

Bitwise operation on floating point numbers (for graphics)? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
how to perform bitwise operation on floating point numbers
Hello, everyone!
Background:
I know that it is possible to apply bitwise operation on graphics (for example XOR). I also know, that in graphic programs, graphic data is often stored in floating point data types (to be able for example to "multiply" the data with 1.05). So it must be possible to perform bitwise operations on floating point data, right?
I need to be able to perform bitwise operations on floating point data. I do not want to cast the data to long, bitwise manipulate it, and cast back to float.
I assume, there exist a mathematical way to achieve this, which is more elegant (?) and/or faster (?).
I've seen some answers but they could not help, including this one.
EDIT:
That other question involves void-pointer casting, which would rely on deeper-level data representation. So it's not such an "exact duplicate".
By the time the "graphics data" hits the screen, none of it is floating point. Bitwise operations are really done on bit strings. Bitwise operations only make sense on numbers because of consistent encoding scheme to binary. Trying to get any kind of logical bitwise operations on floats other than extracting the exponent or mantissa is a road to hell.
Basically, you probably don't want to do this. Why do you think you do?
A floating point number is just another representation of a binary in memory, so you could:
measure the size of the data type (e.g. 32 bits), e.g. sizeof(pixel)
get a pointer to it - choose an integer type of the same size for that, e.g. UINT *ptr = &pixel
use the pointer's value, e.g. newpixel=(*ptr)^(*ptr)
This should at least work with non-negative values and should have no considerable calculative overhead, at least in an unmanaged context like C++. Maybe you have to mask out some bits when doing your operation, and - depending of the type - you may have to treat exponent and base separately.

Resources