This TensorFlow guide gives some insights on 8 bit representation of the neural network weight and activations. It maps the range from min-max in float32 to 8bit format by mapping min value in float32 to 0 in int8 and max value to 255. This means the addition identity (0) is mapped to non-zero value and even the multiplication identity (1) may be mapped to value other than 1 in the int8 representation. My questions are,
After loosing these identities, how the arithmetic is performed in the new representation? In case of addition/sub, we can get back the approx float32 number after appropriate scaling and offseting.
How to convert the result of multiplication in int8 format to the native float32 format?
There are some more details of the quantization process in practice here:
http://www.oreilly.com/data/free/building-mobile-applications-with-tensorflow.csp
We'll be updating the tensorflow.org documentation soon too. To specifically answer #2, you have a new min/max float range for your 32-bit accumulated result which you can use to convert back to floats.
Related
There is a lot of questions on rounding that i have looked at but tey all involve rounding a number to its nearest whole, or to a certain number of points. What i want to do is simply convert a string to a double without any added digits on the right of the decimal point. Here is my code and result as of now:
Convert the string 0.78240 to a double, which should be 0.78240 but instead is 0.78239999999999998 when i look at it in the debugger.
The string value is a QString and is converted to a double simply using the toDouble() function.
I don't understand how or where these extra numbers are coming from, but any help on converting from QString to double directly would be greatly appreciated!
The extra digits are there because you are converting a decimal real number to binary floating point.
Unlike real numbers, floating-point representations have infinite resolution and finite range, and also binary floating-point values do not exactly coincide with all (or even most) decimal real values.
The simple fact is that binary floating-point cannot exactly represent 0.7824010, your debugger is showing you all the available digits after round-tripping the binary value back to decimal.
It is not necessarily a problem, because the error is infinitesimally small compared to the magnitude of the value, and in any event the original 0.78240 value is no doubt some approximation of a real-world value - they are both approximations, just binary or decimal approximations.
The issue is normally dealt with at presentation rather then representation. For example, in this case, unlike your debugger which necessarily shows the full precision of the internal representation (you would not want it any other way in a debugger), the standard means of presenting such a value will limit itself to a small, or caller defined number of decimal places and this value presented to even 15 decimal places will be correctly presented as 0.782400000000000 (by default standard output methods will show just 0.7824).
Any double value presented at 15 significant decimal figures or fewer will display as expected, for a float this reduces to just 6 significant figures. I imagine your debugger is displaying more digits that can accurately be presented in an IEEE 754 64-bit FP (double) value because internally the x86 FPU uses an 80bit representation.
You are quite literally sweating the small stuff.
One place where this difference in representation does matter is in financial applications. For those, it is common to use decimal floating point and normally to many more significant figures than double can provide. However decimal floating-point is not normally implemented in hardware, so is much slower. Moreover decimal floating point is not directly supported in most programming languages, and requires library support. C# is an example of a language with built-in support for decimal floating-point; its decimal type is good for 28 significant figures.
I am trying to read a .tif-file in julia as a Floating Point Array. With the FileIO & ImageMagick-Package I am able to do this, but the Array that I get is of the Type Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}.
I can convert this FixedPoint-Array to Float32-Array by multiplying it with 255 (because UInt8), but I am looking for a function to do this for any type of FixedPointNumber (i.e. reinterpret() or convert()).
using FileIO
# Load the tif
obj = load("test.tif");
typeof(obj)
# Convert to Float32-Array
objNew = real.(obj) .* 255
typeof(objNew)
The output is
julia> using FileIO
julia> obj = load("test.tif");
julia> typeof(obj)
Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}
julia> objNew = real.(obj) .* 255;
julia> typeof(objNew)
Array{Float32,2}
I have been looking in the docs quite a while and have not found the function with which to convert a given FixedPoint-Array to a FloatingPont-Array without multiplying it with the maximum value of the Integer type.
Thanks for any help.
edit:
I made a small gist to see if the solution by Michael works, and it does. Thanks!
Note:I don't know why, but the real.(obj) .* 255-code does not work (see the gist).
Why not just Float32.()?
using ColorTypes
a = Gray.(convert.(Normed{UInt8,8}, rand(5,6)));
typeof(a)
#Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}
Float32.(a)
The short answer is indeed the one given by Michael, just use Float32.(a) (for grayscale). Another alternative is channelview(a), which generally performs channel separation thus also stripping the color information from the array. In the latter case you won't get a Float32 array, because your image is stored with 8 bits per pixel, instead you'll get an N0f8 (= FixedPointNumbers.Normed{UInt8,8}). You can read about those numbers here.
Your instinct to multiply by 255 is natural, given how other image-processing frameworks work, but Julia has made some effort to be consistent about "meaning" in ways that are worth taking a moment to think about. For example, in another programming language just changing the numerical precision of an array:
img = uint8(255*rand(10, 10, 3)); % an 8-bit per color channel image
figure; image(img)
imgd = double(img); % convert to double-precision, but don't change the values
figure; image(imgd)
produces the following surprising result:
That second "all white" image represents saturation. In this other language, "5" means two completely different things depending on whether it's stored in memory as a UInt8 vs a Float64. I think it's fair to say that under any normal circumstances, a user of a numerical library would call this a bug, and a very serious one at that, yet somehow many of us have grown to accept this in the context of image processing.
These new types arise because in Julia we've gone to the effort to implement new numerical types (FixedPointNumbers) that act like fractional values (e.g., between 0 and 1) but are stored internally with the same bit pattern as the "corresponding" UInt8 (the one you get by multiplying by 255). This allows us to work with 8-bit data and yet allow values to always be interpreted on a consistent scale (0.0=black, 1.0=white).
I'm working on a query that adds the following numbers and it's returning a puzzling result.
>SELECT ((36.7300 - 20.4300) - 16.3) AS Amount;
>-3.5527136788005e-15
I've tried everything I can think of like casting the values to different data types, but I'm still getting the same result.
When I separate out the steps it returns the right answer, but I can't get the two arthetic steps to work together for some reason.
>SELECT 36.7300 - 20.4300 AS Amount;
>16.3
>SELECT 16.3 - 16.3 AS Amount;
>0.0
This happens because SQLite uses binary arithmetic instead of base-10 arithmetic.
From the SQLite FAQ:
(16) Why does ROUND(9.95,1) return 9.9 instead of 10.0? Shouldn't 9.95 round up?
SQLite uses binary arithmetic and in binary, there is no way to write
9.95 in a finite number of bits. The closest to you can get to 9.95 in a 64-bit IEEE float (which is what SQLite uses) is
9.949999999999999289457264239899814128875732421875. So when you type "9.95", SQLite really understands the number to be the much longer
value shown above. And that value rounds down.
This kind of problem comes up all the time when dealing with floating
point binary numbers. The general rule to remember is that most
fractional numbers that have a finite representation in decimal (a.k.a
"base-10") do not have a finite representation in binary (a.k.a
"base-2"). And so they are approximated using the closest binary
number available. That approximation is usually very close, but it
will be slightly off and in some cases can cause your results to be a
little different from what you might expect.
There is no direct solution for this, but there are several workarounds to face this problem:
Store your amount without decimals and compute INTEGER operations
Returning your REALs and do your math out of the database
Round your result out of the database
It depends on what are you going to do with your data. For example if you are managing currency you can store your amount in cents instead of dollars, but if you are storing scientific data this solution is far from valid and you may need to retrieve the operands and compute the results out of SQLite.
I used oracle dictionary views to find out column differences if any between two schema's. While syncing data type discrepancies I found that both NUMBER and INTEGER data types stored in all_tab_columns/user_tab_columns/dba_tab_columns as NUMBER only so it is difficult to sync data type discrepancies where one schema/column has number datatype and another schema/column has integer data type.
While comparison of schema's it show datatype mismatch. Please suggest if there is any other alternative apart form using dictionary views or if any specific properties from dictionary views can be used to identify if data type is integer.
the best explanation i've found is this:
What is the difference betwen INTEGER and NUMBER? When should we use NUMBER and when should we use INTEGER? I just wanted to update my comments here...
NUMBER always stores as we entered. Scale is -84 to 127. But INTEGER rounds to whole number. The scale for INTEGER is 0. INTEGER is equivalent to NUMBER(38,0). It means, INTEGER is constrained number. The decimal place will be rounded. But NUMBER is not constrained.
INTEGER(12.2) => 12
INTEGER(12.5) => 13
INTEGER(12.9) => 13
INTEGER(12.4) => 12
NUMBER(12.2) => 12.2
NUMBER(12.5) => 12.5
NUMBER(12.9) => 12.9
NUMBER(12.4) => 12.4
INTEGER is always slower then NUMBER. Since integer is a number with added constraint. It takes additional CPU cycles to enforce the constraint. I never watched any difference, but there might be a difference when we load several millions of records on the INTEGER column. If we need to ensure that the input is whole numbers, then INTEGER is best option to go. Otherwise, we can stick with NUMBER data type.
Here is the link
Integer is only there for the sql standard ie deprecated by Oracle.
You should use Number instead.
Integers get stored as Number anyway by Oracle behind the scenes.
Most commonly when ints are stored for IDs and such they are defined with no params - so in theory you could look at the scale and precision columns of the metadata views to see of no decimal values can be stored - however 99% of the time this will not help.
As was commented above you could look for number(38,0) columns or similar (ie columns with no decimal points allowed) but this will only tell you which columns cannot take decimals, and not what columns were defined so that INTS can be stored.
Suggestion:
do a data profile on the number columns. Something like this:
select max( case when trunc(column_name,0)=column_name then 0 else 1 end ) as has_dec_vals
from table_name
This is what I got from oracle documentation, but it is for oracle 10g release 2:
When you define a NUMBER variable, you can specify its precision (p) and scale (s) so that it is sufficiently, but not unnecessarily, large. Precision is the number of significant digits. Scale can be positive or negative. Positive scale identifies the number of digits to the right of the decimal point; negative scale identifies the number of digits to the left of the decimal point that can be rounded up or down.
The NUMBER data type is supported by Oracle Database standard libraries and operates the same way as it does in SQL. It is used for dimensions and surrogates when a text or INTEGER data type is not appropriate. It is typically assigned to variables that are not used for calculations (like forecasts and aggregations), and it is used for variables that must match the rounding behavior of the database or require a high degree of precision. When deciding whether to assign the NUMBER data type to a variable, keep the following facts in mind in order to maximize performance:
Analytic workspace calculations on NUMBER variables is slower than other numerical data types because NUMBER values are calculated in software (for accuracy) rather than in hardware (for speed).
When data is fetched from an analytic workspace to a relational column that has the NUMBER data type, performance is best when the data already has the NUMBER data type in the analytic workspace because a conversion step is not required.
I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.
This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.
You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.
You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.
You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.