Whats wrong with my division? - math

I have a 16-bits sample between -32768 and 32767.
To save space I want to convert it to a 8-bits sample, so I divide the sample by 256, and add 128.
-32768 / 256 = -128 + 128 = 0
32767 / 256 = 127.99 + 128 = 255.99
Now, the 0 will fit perfectly in a byte, but the 255.99 has to be rounded down to 255, causing me to loose precision, because when converting back I'll get 32512 instead of 32767.
How can I do this, without loosing the original min/max values? I know I make a very obvious thought error, but I cant figure out where the mistake lies.
And yes, ofcourse I'm fully aware I lost precision by dividing, and will not be able to deduce the original values from the 8-bit samples, but I just wonder why I don't get the original maximum.

The answers for down-sampling have already been provided.
This answer relates to up-sampling using the full range. Here is a C99 snippet demonstrating how you can spread the error across the full range of your values:
#include <stdio.h>
int main(void)
{
for( int i = 0; i < 256; i++ ) {
unsigned short scaledVal = ((unsigned short)i << 8) + (unsigned short)i;
printf( "%8d%8hu\n", i, scaledVal );
}
return 0;
}
It's quite simple. You shift the value left by 8 and then add the original value back. That means every increase by 1 in the [0,255] range corresponds to an increase by 257 in the [0,65535] range.
I would like to point out that this might give worse results than you began with. For example, if you downsampled 65280 (0xff00) you would get 255, but then upsampling that would give 65535 (0xffff), which is a total error of 255. You will have similarly large errors across most of the higher end of your data range.
You might do better to abandon the notion of going back to the [0,65535] range, and instead round your values by half. That is, shift left and add 127. This means the error is uniform instead of skewed. Because you don't actually know what the original value was, the best you can do is estimate it with a value right in the centre.
To summarize, I think this is more mathematically correct:
unsigned short scaledVal = ((unsigned short)i << 8) + 127;

You don't get the original maximum because you can't represent the number 256 as an 8-bit unsigned integer.

if you're trying to compress your 16 bit integer value into a 8 bit integer value range, you take the most significant 8 bits and keep them while throwing out the least significant 8 bits. Normally this is accomplished by shifting the bits. A >> operator is a shift from most to least significant bits which would work if used 8 times or >>8. You can also just mask out the bytes and divide off the 00s doing your rounding before your division, with something like 8BitInt = (16BitInt & 65280)/256; [65280 a.k.a 0xFF00]
Every bit you shift off of a value halves it, like division by 2, and rounds down.
All of the above is complicated some by the fact that you're dealing with a signed integer.
Finally I'm not 100% certain I got everything right here because really, I haven't tried doing this.

Related

is it a bug scaling 0.0-1.0 float to byte by multiplying by 255?

this is something that has always bugged me when I look at code around the web and in so much of the literature: why do we multiply by 255 and not 256?
sometimes you'll see something like this:
float input = some_function(); // returns 0.0 to 1.0
byte output = input * 255.0;
(i'm assuming that there's an implicit floor going on during the type conversion).
am i not correct in thinking that this is clearly wrong?
consider this:
what range of input gives an output of 0 ? (0 -> 1/255], right?
what range of input gives an output of 1 ? (1/255 -> 2/255], great!
what range of input gives an output of 255 ? only 1.0 does. any significantly smaller value of input will return a lower output.
this means that input is not evently mapped onto the output range.
ok. so you might think: ok use a better rounding function:
byte output = round(input * 255.0);
where round() is the usual mathematical rounding to zero decimal places. but this is still wrong. ask the same questions:
what range of input gives an output of 0 ? (0 -> 0.5/255]
what range of input gives an output of 1 ? (0.5/255 -> 1.5/255], twice as much as for 0 !
what range of input gives an output of 255 ? (254.5/255 -> 1.0), again half as much as for 1
so in this case the input range isn't evenly mapped either!
IMHO. the right way to do this mapping is this:
byte output = min(255, input * 256.0);
again:
what range of input gives an output of 0 ? (0 -> 1/256]
what range of input gives an output of 1 ? (1/256 -> 2/256]
what range of input gives an output of 255 ? (255/256 -> 1.0)
all those ranges are the same size and constitute 1/256th of the input.
i guess my question is this: am i right in considering this a bug, and if so, why is this so prevalent in code?
edit: it looks like i need to clarify. i'm not talking about random numbers here or probability. and i'm not talking about colors or hardware at all. i'm talking about converting a float in the range [0,1] evenly to a byte [0,255] so each range in the input that corresponds to each value in the output is the same size.
You are right. Assuming that valueBetween0and1 can take values 0.0 and 1.0, the "correct" way to do it is something like
byteValue = (byte)(min(255, valueBetween0and1 * 256))
Having said that, one could also argue that the desired quality of the software can vary: does it really matter whether you get 16777216 or 16581375 colors in some throw-away plot?
It is one of those "trivial" tasks which is very easy to get wrong by +1/-1. Is it worth it to spend 5 minutes trying to get the 255-th pixel intensity, or can you apply your precious attention elsewhere? It depends on the situation: (byte)(valueBetween0and1 * 255) is a pragmatic solution which is simple, cheap, close enough to the truth, and also immediately, obviously "harmless" in the sense that it definitely won't produce 256 as output. It's not a good solution if you are working on some image manipulation tool like Photoshop or if you are working on some rendering pipeline for a computer game. But it is perfectly acceptable in almost all other contexts. So, whether it is a "bug" or merely a minor improvement proposal depends on the context.
Here is a variant of your problem, which involves random number generators:
Generate random numbers in specified range - various cases (int, float, inclusive, exclusive)
Notice that e.g. Math.random() in Java or Random.NextDouble in C# return values greater or equal to 0, but strictly smaller than 1.0.
You want the case "Integer-B: [min, max)" (inclusive-exclusive) with min = 0 and max = 256.
If you follow the "recipe" Int-B exactly, you obtain the code:
0 + floor(random() * (256 - 0))
If you remove all the zeros, you are left with just
floor(random() * 256)
and you don't need to & with 0xFF, because you never get 256 (as long as your random number generator guarantees to never return 1).
I think your question is misled. It looks like you start assuming that there is some "fairness rule" that enforces the "right way" of translation. Unfortunately in practice this is not the case. If you want just generate a random color, then you may use whatever logic fits you. But if you do actual image processing, there is no rule that says the each integer value has to be mapped on the same interval on the float value. On the contrary what you really want is a mapping between two inclusive intervals [0;1] and [0;255]. And often you don't know how many real discretization steps there will be in the [0;1] range down the line when the color is actually shown. (On modern monitors there are probable all 256 different levels for each color but on other output devices there might be significantly less choices and the total number might be not a power of 2). And the real mapping rule is that if for two colors red component values are R1 and R2 then proportion of the actual colors' red component brightness should be as close to R1:R2 as possible. And this rule automatically implies multiply by 255 when you want to map onto [0;255] and thus this is what everybody does.
Note that what you suggest is most probably introducing a bug rather than fixing a bug. For example the proportion rules actually means that you can calculate a mix of two colors R1 and R2 with mixing coefficients k1 and k2 as
Rmix = (k1*R1 + k2*R2)/(k1+k2)
Now let's try to calculate 3:1 mix of 100% Red with 100% Black (i.e. 0% Red) two ways:
using [0-255] integers Rmix = (255*3+1*0)/(3+1) = 191.25 ≈ 191
using [0;1] floating range and then converting it to [0-255] Rmix_float = (1.0*3 + 1*0.0)/(3+1) = 0.75 so Rmix_converted_256 = 256*0.75 = 192.
It means your "multiply by 256" logic has actually introduced inconsistency of different results depending on which scale you use for image processing. Obviously if you used "multiply by 255" logic as everyone else does, you'd get a consistent answer Rmix_converted_255 = 255*0.75 = 191.25 ≈ 191.

Basic math calculation issue with Arduino

I am doing a basic operation in Arduino and for some reason (this is why I need you) it gives me a totally inappropriate result. Below is the code:
long init_H_top; //I am declaring it a long to make sure I got enough bytes
init_H_top=251*255/360; //gives me -4 and it should be 178
Any idea why it does that?
I am very confused... Thanks!
Your variable may be a long but your constants (251, 255, and 360) are not.
They are int types so will calculate giving an int result which will then be put into the long variable, after any overflow has already done the damage.
Since Arduino has a 16-bit int type, 251 * 255 (64005) will exceed the maximum integer of 32767 and result in behaviour like you're seeing. The value 64005 is -1531 in 16-bit two's complement and, when you divide that by 360, you get about -4.25 which truncates to -4.
You should be using long constants to avoid this:
init_H_top = 251L * 255L / 360L;

Advantages and disadvantages of single numeric (float) data type [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Why we use various data types in programming languages ? Why not use float everywhere ? I have heard some arguments like
Arithmetic on int is faster ( but why ?)
It takes more memory to store float. ( I get it.)
What are the additional benefits of using various types of numeric data types ?
Arithmetic on integers has traditionally been faster because it's a simpler operation. It can be implemented in logic gates and, if properly designed, the whole thing can happen in a single clock cycle.
On most modern PCs floating-point support is actually quite fast, because loads of time has been invested into making it fast. It's only on lower-end processors (like Arduino, or some versions of the ARM platform) where floating point seriously suffers, or is absent from the CPU altogether.
A floating point number contains a few different pieces of data: there's a sign bit, and the mantissa, and the exponent. To put those three parts together to determine the value they represent, you do something like this:
value = sign * mantissa * 2^exponent
It's a little more complicated than that because floating point numbers optimize how they store the mantissa a bit (for instance the first bit of the mantissa is assumed to be 1, thus the first bit doesn't actually need to be stored... But this also means zero has to be stored a particular way, and there's various "special values" that can be stored in floats like "not a number" and infinity that have to be handled correctly when working with floats)
So to store the number "3" you'd have a mantissa of 0.75 and an exponent of 2. (0.75 * 2^2 = 3).
But then to add two floats together, you first have to align them. For instance, 3 + 10:
m3 = 0.75 (stored as binary (1)1000000... the first (1) implicit and not actually stored)
e3 = 2
m10 = .625 (stored as binary (1)010000...)
e10 = 4 (.625 * 2^4 = 10)
You can't just add m3 and m10 together, 'cause you'd get the wrong answer. You first have to shift m3 over by a couple bits to get e3 and e10 to match, then you can add the mantissas together and reassemble the result into a new floating point number. A CPU with good floating-point implementation will do all that for you, of course, and do it fast.
So why else would you not want to use floating point values for everything? Well, for starters there's the problem of exactness. If you add or multiply two integers to get another integer, as long as you don't exceed the limits of your integer size, the answer you get will be exactly correct. This isn't the case with floating-point. For instance:
x = 1000000000.0
y = .0000000001
for (cc = 0; cc < 1000000000; cc++) { x += y; }
Logically you'd expect the final value of (x) to be 1000000000.1, but that's almost certainly not what you're going to get. When you add (y) to (x), the change to (x)'s mantissa may be so small that it doesn't even fit into the float, and so (x) may not change at all. And even if that's not the case, (y)'s value is not exact. There are no two integers (a, b) such that (a * 2^b = 10^-10). That's true for many common decimal values, actually. Even something simple like 0.3 can't be stored as an exact value in a binary floating-point number.
So (y) isn't exactly 10^-10, it's actually off by some small amount. For a 32-bit floating point number it'll be off by about 10^-26:
y = 10^-10 + error, error is about 10^-26
Then if you add (y) together ten billion times, the error is magnified by about ten billion times as well, so your final error is around 10^-16
A good floating-point implementation will try to minimize these errors, but it can't always get it right. The problem is fundamental to how the numbers are stored, and to some extent unavoidable. As a result, for instance, even though it seems natural to store a money value in a float, it might be preferable to store it as an integer instead, to get that assurance that the value is always exact.
The "exactness" issue also means that when you test the value of a floating point number, generally speaking, you can't use exact comparisons. For instance:
x = 11.0 / 500
if (x * 50 == 1.1) { ... It doesn't!
for (float x = 0.0; x < 1.0; x += 0.01) { print x; }
// prints 101 values instead of 100, the last one being 0.9999999...
The test fails because (x) isn't exactly the value we specified, and 1.1, when encoded as a float, isn't exactly the value we specified either. They're both close but not exact. So you have to do inexact comparisons:
if (abs(x - expected_value) < small_value) {...
Choosing the correct "small_value" is a problem unto itself. It can depend on what you're doing with the values, what kind of behavior you're trying to achieve.
Finally, if you look at the "it takes more memory" issue, you can also turn that around and think of it in terms of what you get for the memory you use.
If you can work with integer math for your problem, a 32-bit unsigned integer lets you work with (exact) values between 0 and around 4 billion.
If you're using 32-bit floats instead of 32-bit integers, you can store larger values than 4 billion, but you're still limited by the representation: of those 32 bits, one is used for the sign bit, and eight for the mantissa, so you get 23 bits (24, effectively) of mantissa. Once (x >= 2^24), you're beyond the range where integers are stored "exactly" in that float, so (x+1 = x). So a loop like this:
float i;
for (i = 1600000; i < 1700000; i += 1);
would never terminate: (i) would reach (2^24 = 16777216), and the least-significant bit of its mantissa would be of a magnitude greater than 1, so adding 1 to (i) would cease to have any effect.

ATMega peformance for different operations

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);
the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.
This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

Does the 6502 use signed or unsigned 8 bit registers (JAVA)?

I'm writing an emulator for the 6502, and basically, there are some instructions where there's an offset saved in one of the registers (mostly X and Y) and I'm wondering, since branch instructions use signed 8 bit integers, do the registers keep their values as 8 bit signed? Meaning this:
switch(opcode) {
//Bunch of opcodes
case 0xD5:
//Read the memory area with final address being address + x offset
int rempResult = a - readMemory(address + x);
//Comparing some things, setting/disabling flags
//Incrementing program counter and cycles/ticks
break;
//More opcodes
}
Let's say in this situation that x = 0xEE. In regular binary, this would mean that x = 238. In the 6502 however, the branch instruction uses signed offset for jumping to memory addresses, so I'm wondering, is the 238 interpreted as -18 in this case, or is it just regular unsigned 8 bit value?
It varies.
They're not explicitly signed or unsigned for arithmetic, logical, shift, or load and store operations.
The conditional branches (and the unconditional one on the later 6502 descendants) all take the argument as signed; otherwise loops would be extremely awkward.
zero, x addressing is achieved by performing an 8-bit addition of x to the zero page address, ignoring carry, and reading from the zero page. So e.g.
LDX #-126 ; which is +130 if unsigned
LDA 23, x
Would read from address 23+130 = 153. But had it been 223+130 then the end read would have been from (223 + 130) MOD 256 = 97.
absolute, x/y is unsigned and carry works correctly (but costs an extra cycle)
(zero, x) is much like the direct version in that the offset is signed but the result is always within the zero page. Then the real address is read from there.
(zero), y is unsigned with carry working and costing.
The "sign" is simply the value of the most significant (aka bit 7) in an 8-bit byte.
6502 has support for signed values in these ways:
The N bit in .P - but it really just tells you if the last instruction turned on or off bit 7 of a memory location or register. It was common to use BPL/BMI to do stuff based on bit 7 in a memory location for flag or "boolean" like use.
The V bit of .P which is flipped "when the result of adding two positive numbers overflows and ends up negative, and when the result of adding two negative numbers overflows and ends up positive"
And of course obeying the sign bit for relative branch instructions only, e.g. BEQ with a value with bit 7 set will move to a lower memory location, not a higher one.
Beyond that, whether that bit means anything is completely up to you and your program. What really makes numbers signed or unsigned is how you display the numbers.
The linked article above goes into what one's complement and two's complement is and how it makes the mathematics work without the 6502 having to care too much about the sign.

Resources