Unexpected value counts using awk - unix

I have a text file called "test.txt" containing multiple lines with the fields separated by a semicolon. I'm trying to take the value of field3 > strip out everything but the numbers in the field > compare it to the value of field 3 in the previous line > if the value is unique, redirect the field 3 value and the difference between it and the last value to a file called "differences.txt".
so far, i have the following code:
awk -F';' '
BEGIN{d=0} {gsub(/^.*=/,"",$3);
if(d>0 && $3-d>0){print $3,$3-d} d=$3}
' test.txt > differences.txt
This works absolutely fine when i try to run in the following text:
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222333;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222444;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222777;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222888;field4=xxx;field5=xxx
output, as expected:
111222333 111
111222444 111
111222555 111
111222777 222
111222888 111
however, when i try and run the following text in, i get completely different, unexpected numbers - I'm not sure if it's due to the increased length of the field or something??
test:
test=none;test=20170606;test=1111111111111111111;
test=none;test=20170606;test=2222222222222222222;
test=none;test=20170606;test=3333333333333333333;
test=none;test=20170606;test=4444444444444444444;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=6666666666666666666;
test=none;test=20170606;test=7777777777777777777;
test=none;test=20170606;test=8888888888888888888;
test=none;test=20170606;test=9999999999999999999;
test=none;test=20170606;test=100000000000000000000;
test=none;test=20170606;test=11111111111111111111;
Output, with unexpected values:
2222222222222222222 1111111111111111168
3333333333333333333 1111111111111111168
4444444444444444444 1111111111111111168
5555555555555555555 1111111111111110656
6666666666666666666 1111111111111111680
7777777777777777777 1111111111111110656
8888888888888888888 1111111111111111680
9999999999999999999 1111111111111110656
100000000000000000000 90000000000000000000
Can anyone see where I'm going wrong, as I'm obviously missing something... and it's driving me mental!!
Many thanks! :)

The numbers in the second example input are too large.
Although the logic of the program is correct,
there's a loss of precision when doing computations with very large integers, such as 2222222222222222222 - 1111111111111111111 resulting in 1111111111111111168 instead of the expected 1111111111111111111.
See a detailed explanation in The GNU Awk User’s Guide:
As has been mentioned already, awk uses hardware double precision with 64-bit IEEE binary floating-point representation for numbers on most systems. A large integer like 9,007,199,254,740,997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits. The biggest integer that can be stored in a C double is usually the same as the largest possible value of a double. If your system double is an IEEE 64-bit double, this largest possible value is an integer and can be represented precisely. What more should one know about integers?
If you want to know what is the largest integer, such that it and all smaller integers can be stored in 64-bit doubles without losing precision, then the answer is 2^53. The next representable number is the even number 2^53 + 2, meaning it is unlikely that you will be able to make gawk print 2^53 + 1 in integer format. The range of integers exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever see an integer outside this range in awk using 64-bit doubles, you have reason to be very suspicious about the accuracy of the output.
As #EdMorton pointed out in a comment,
you can have arbitrary-precision arithmetic if your Awk was compiled with MPFR support and you specify the -M flag.
For more details, see 15.3 Arbitrary-Precision Arithmetic Features.

Related

Is there a way to display more than 25 digits when outputting in Scilab?

I'm working with Scilab 5.5.2 and when using the format command I can display at most 25 digits of a number. Is there a way to somehow display more than this number?
Scilab operates with double precision floating point numbers; it does not support variable-precision arithmetics. Double precision means relative error of %eps, which is 2-52, approximately 2e-16.
This means you can't even get 25 correct decimal digits: when using format(25) you get garbage at the end. For example,
format(25); sqrt(3)
returns 1.732050807568877 1931766
I separated the last 7 digits here because they are wrong; the correct value of sqrt(3) begins with
1.732050807568877 2935274
Of course, if you don't mind the digits being wrong, you can have as many as you want:
strcat([sprintf('%.15f', sqrt(3)), "1111111111111111111111111111111"])
returns 1.7320508075688771111111111111111111111111111111.
But if you want to have arbitrary exceeding of real numbers, Scilab is not the right tool for the job (correction: phuclv pointed out Multiple Precision Arithmetic Toolbox which might work for you). Out of free software packages, mpmath Python library implements arbitrary precision of real numbers: it can be used directly or via Sagemath or SymPy. Commercial packages (Matlab, Maple, Mathematica) support variable precision too.
As for Scilab, I recommend using formatted print commands such as fprintf or sprintf, because they actually care about the output being meaningful. Example: printf('%.25f', sqrt(3)) returns
1.7320508075688772000000000
with garbage replaced by zeros. The last nonzero digit is still off by 1, but at least it's not meaningless.
Scilab uses double-precision floating-point type which has 53 bits of mantissa and can only be precise to ~15-17 digits. There's no reason printing digits beyond that.
If 25 digits of accuracy is needed then you can use a quadruple precision or double-double arithmetic library like ATOMS: Multiple Precision Arithmetic Toolbox details
If you need even more precision then the only way is using an arbitrary precision library like mpscilab, Xnum, etc...

sqlite3 shortens float value

I do a simple:
latitude:String = String.fromCString(UnsafePointer(sqlite3_column_text(statement, 11)))!
The value in the Database is "real".
In the database I have
51.234183426424316 (verified using Firefox'SQLite Manager)
With the above I get in my String only:
51.2341834264243
(the last two digits are missing with is not acceptable working with coordinates)
Any explanations? Solutions?
SQLite stores such numbers as as 64-bit IEEE floating-point numbers, which have a significand precisions of 53 bits, which corresponds to about 15-17 decimal digits.
How to format such a number for display is a different question.
If you want to have control over it, get the original value with sqlite3_column_double(), and convert it to a string yourself.
(And you are complaining about a difference that is smaller than the wavelength of visible light ...)

Why is zsh returning the difference of two floating point number in engineering notation?

When I subtract certain numbers whose difference is rather small, zsh doesn't output a floating point number like I expect. Instead, it outputs the difference like this:
% echo $((-78.44335 - -78.4433))
-5.0000000001659828e-05
This is causing unexpected behavior in a script which deals with arbitrary numbers: except when the difference is very small, there is no problem.
Why is zsh doing this? How can I make it always output a normal floating point number instead?
Edit:
My actual application is closer to this:
var=$((-78.44335 - -78.4433))
var2=$var
var=$((var * var3))
etc.
Concerning the engineering notation, this is normal when the exponent ≤ −5, and often the preferred way to represent floating-point numbers. If you don't like that, you can use printf with %f; for instance:
$ printf "%.24f\n" $((-78.44335 - -78.4433))
-0.000050000000001659827831
Alternatively, to assign the result to a variable without having to use a command substitution (thus a subshell):
$ ((var = -78.44335 - -78.4433))
$ echo $var
-0.0000500000
But only 10 digits are output after the decimal point (like printf "%.10f"). This may not be sufficient for all applications.
Some additional note about the trailing digits: Floating-point numbers are represented in base 2. This means that when converting a decimal number such as -78.44335 or -78.4433 to base 2, a rounding error generally occurs (because the decimal number cannot be represented exactly in the destination format, generally double precision). The effect of rounding errors is what you can see in the output. In particular, when you subtract two inexact numbers that are very close to each other, a catastrophic cancellation occurs, so that the relative error is quite large.
Note: this is not specific to zsh. You'll have similar problems with all software that uses base 2 internally.

Two's Complement -- How are negative numbers handled?

It is my understanding that numbers are negated using the two's compliment, which to my understanding is: !num + 1.
So my question is does this mean that, for variable 'foo'=1, a negated 'foo' will be the exactly the same as variable 'bar'=255.
f we were to check if -'foo' == 'bar' or if -'foo' == 255, would we get that they are equal?
I know that some languages, such as Java, keep a sign bit -- so the comparisons would yield false. What of languages that do not? And I'm assuming that assembler/native machine does not have a sign bit.
In addition to all of this, I read about a zero flag or a carry-over flag that is set when a 'negative' number is added to another (of any sign) number. This flag being set whenever it is added because of the way two's complement works, 0x01 + 0xff = 0x00 (with the leading 1 truncated). What exactly is this flag used for?
And my last question, for other math operations (such as multiplication), would I have to re-negate the number (so it is now positive), perform the operation, and negate the result? E.g., !((!neg + 1) * pos) + 1.
Edit
Finished the question, so feel free fire away.
Yes, in two’s complement, the number x is represented as ~x+1, where ~x is the bitwise complement of the binary numeral for x in some fixed number bits. E.g., for eight bits, the binary numeral for x is 000000001, so the bitwise complement is 11111110, and adding one produces 11111111.
There is no way to distinguish -1 in eight-bit two’s complement from 255 in eight-bit binary (with no sign). They both have the same representation in bits: 11111111. If you are using both of these numbers, you must either separately remember which one is eight-bit two’s complement and which one is plain eight-bit binary or you must use more than eight bits. In other words, at the raw bit level, 11111111 is just eight bits; it has no value until we decide how to interpret it.
Java and typical other languages do not maintain a sign bit separate from the value of a number; the sign is part of the encoding of the number. Also, typical languages do not allow you to compare different types. If you have a two’s complement x and an unsigned y, then either one must be converted to the type of the other before comparison or they must both be converted to a third type. Thus, if you compare x and y, and one is converted to the other, then the conversion will overflow or wrap, and you cannot expect to get the correct mathematical result. To compare these two numbers, we might convert each of them to a wider integer, such as 32-bits, then compare. Converting the eight-bit two’s complement 11111111 to a 32-bit integer produces -1, and converting the eight-bit plain binary 11111111 to a 32-bit integer produces 255, and then the comparison reports they are unequal.
The zero flag and the carry flag you read about are flags that are set when a comparison instruction is executed in a computer processor. Most high-level languages do not give you direct access to these flags. Many processors have an instruction with a form like this:
cmp a, b
That instruction subtracts b from a and discards the difference but remembers several flags that describe the subtraction: Was the result zero (zero flag)? Did a borrow occur (borrow flag)? Was the result negative (sign flag)? Did an overflow occur (overflow flag)?
The compare instruction requires that the two things being compared be the same type (two’s complement or unsigned), but it does not care which type. The results can be tested later by checking particular combinations of the flags depending on the type. That is, the information recorded in the flags can distinguish whether one two’s complement number was greater than another or whether one unsigned number was greater than another, depending on what tests are made. There are conditional branch instructions that test the desired flag properties.
There is generally no need to “un-negate” a number to perform arithmetic operations. Processors include arithmetic instructions that work on two’s complement numbers. Usually the add and subtract instructions are type-agnostic, the same way the compare instruction is, but the multiply and divide instructions are not (except for some forms of multiply that return partial results). The add and subtract instructions can be type-agnostic because the wrapping that occurs in the arithmetic works for both two’s complement and unsigned. However, that wrapping does not work for multiplication and division.

AS3 adding 1 (+1) not working on string cast to Number?

just learning as3 for flex. i am trying to do this:
var someNumber:String = "10150125903517628"; //this is the actual number i noticed the issue with
var result:String = String(Number(someNumber) + 1);
I've tried different ways of putting the expression together and no matter what i seem to do the result is always equal to 10150125903517628 rather than 10150125903517629
Anyone have any ideas??! thanks!
All numbers in JavaScript/ActionScript are effectively double-precision IEEE-754 floats. These use a 64-bit binary number to represent your decimal, and have a precision of roughly 16 or 17 decimal digits.
You've run up against the limit of that format with your 17-digit number. The internal binary representation of 10150125903517628 is no different to that of 10150125903517629 which is why you're not seeing any difference when you add 1.
If, however, you add 2 then you will (should?) see the result as 10150125903517630 because that's enough of a "step" that the internal binary representation will change.

Resources