tSQLt Assert Failure Message Numeric Precision - tsqlt

How can I increase the precision of FLOAT failed assertion messages in tSQLt?
For example
DECLARE #Expected FLOAT = -5.4371511392520810
PRINT STR(#Expected, 40, 20)
DECLARE #Actual FLOAT = #Expected - 0.0000000001
PRINT STR(#Actual, 40, 20)
EXEC tSQLt.AssertEquals #Expected, #Actual
gives
-5.4371511392520810
-5.4371511393520811
[UnitTest].[test A] failed: Expected: <-5.43715> but was: <-5.43715>

In most computer languages (including T-SQL) floating point values are approximate, so comparing FLOAT variables for being equal is often a bad idea (especially after doing some maths on them) E.g. a FLOAT variable is only accurate to about 15 digits (by default)
You can see this by adding the following line at the end of your sample code:
PRINT STR((#Actual - #Expected) * 1000000, 40, 20)
which returns -0.0001000000082740
So you could either
Use the built in SQL function ROUND to allow numbers approximately the same to be viewed as equal:
EXEC tSQLt.AssertEquals ROUND (#Expected, 14), ROUND (#Actual, 14)
Use an exact type for the variables, like NUMERIC (38, 19). Replacing every FLOAT in your example with NUMERIC (38, 19) seems to give the same result, but when you add the PRINT STR((#Actual - #Expected) * 1000000, 40, 20) mentioned above, it now prints exactly
-0.0001000000000000, showing that there is an inaccuracy in the PRINT statement as well
Of course your tSQLt.AssertEquals test will still fail since the values are different in the 10th digit after the decimal point. (one number is ...925... and the other is ...935...). If you want it to pass even then, round the values off to 9 digits with ROUND
Further information:
See David Goldberg's excellent article What Every Computer Scientist Should Know About Floating-Point Arithmetic here or here under the heading Rounding Errors.
http://msdn.microsoft.com/en-us/library/ms173773.aspx
http://www.informit.com/library/content.aspx?b=STY_Sql_Server_7&seqNum=93

Related

R Precision for Double - Why code returns negative why positive outcome expected?

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

ord() Function or ASCII Character Code of String with Z3 Solver

How can I convert a z3.String to a sequence of ASCII values?
For example, here is some code that I thought would check whether the ASCII values of all the characters in the string add up to 100:
import z3
def add_ascii_values(password):
return sum(ord(character) for character in password)
password = z3.String("password")
solver = z3.Solver()
ascii_sum = add_ascii_values(password)
solver.add(ascii_sum == 100)
print(solver.check())
print(solver.model())
Unfortunately, I get this error:
TypeError: ord() expected string of length 1, but SeqRef found
It's apparent that ord doesn't work with z3.String. Is there something in Z3 that does?
The accepted answer dates back to 2018, and things have changed in the mean time which makes the proposed solution no longer work with z3. In particular:
Strings are now formalized by SMTLib. (See https://smtlib.cs.uiowa.edu/theories-UnicodeStrings.shtml)
Unlike the previous version (where strings were simply sequences of bit vectors), strings are now sequences unicode characters. So, the coding used in the previous answer no longer applies.
Based on this, the following would be how this problem would be coded, assuming a password of length 3:
from z3 import *
s = Solver()
# Ord of character at position i
def OrdAt(inp, i):
return StrToCode(SubString(inp, i, 1))
# Adding ascii values for a string of a given length
def add_ascii_values(password, len):
return Sum([OrdAt(password, i) for i in range(len)])
# We'll have to force a constant length
length = 3
password = String("password")
s.add(Length(password) == length)
ascii_sum = add_ascii_values(password, length)
s.add(ascii_sum == 100)
# Also require characters to be printable so we can view them:
for i in range(length):
v = OrdAt(password, i)
s.add(v >= 0x20)
s.add(v <= 0x7E)
print(s.check())
print(s.model()[password])
Note Due to https://github.com/Z3Prover/z3/issues/5773, to be able to run the above, you need a version of z3 that you downloaded on Jan 12, 2022 or afterwards! As of this date, none of the released versions of z3 contain the functions used in this answer.
When run, the above prints:
sat
" #!"
You can check that it satisfies the given constraint, i.e., the ord of characters add up to 100:
>>> sum(ord(c) for c in " #!")
100
Note that we no longer have to worry about modular arithmetic, since OrdAt returns an actual integer, not a bit-vector.
2022 Update
Below answer, written back in 2018, no longer applies; as strings in SMTLib received a major update and thus the code given is outdated. Keeping it here for archival purposes, and in case you happen to have a really old z3 that you cannot upgrade for some reason. See the other answer for a variant that works with the new unicode strings in SMTLib: https://stackoverflow.com/a/70689580/936310
Old Answer from 2018
You're conflating Python strings and Z3 Strings; and unfortunately the two are quite different types.
In Z3py, a String is simply a sequence of 8-bit values. And what you can do with a Z3 is actually quite limited; for instance you cannot iterate over the characters like you did in your add_ascii_values function. See this page for what the allowed functions are: https://rise4fun.com/z3/tutorialcontent/sequences (This page lists the functions in SMTLib parlance; but the equivalent ones are available from the z3py interface.)
There are a few important restrictions/things that you need to keep in mind when working with Z3 sequences and strings:
You have to be very explicit about the lengths; In particular, you cannot sum over strings of arbitrary symbolic length. There are a few things you can do without specifying the length explicitly, but these are limited. (Like regex matches, substring extraction etc.)
You cannot extract a character out of a string. This is an oversight in my opinion, but SMTLib just has no way of doing so for the time being. Instead, you get a list of length 1. This causes a lot of headaches in programming, but there are workarounds. See below.
Anytime you loop over a string/sequence, you have to go up to a fixed bound. There are ways to program so you can cover "all strings upto length N" for some constant "N", but they do get hairy.
Keeping all this in mind, I'd go about coding your example like the following; restricting password to be precisely 10 characters long:
from z3 import *
s = Solver()
# Work around the fact that z3 has no way of giving us an element at an index. Sigh.
ordHelperCounter = 0
def OrdAt(inp, i):
global ordHelperCounter
v = BitVec("OrdAtHelper_%d_%d" % (i, ordHelperCounter), 8)
ordHelperCounter += 1
s.add(Unit(v) == SubString(inp, i, 1))
return v
# Your original function, but note the addition of len parameter and use of Sum
def add_ascii_values(password, len):
return Sum([OrdAt(password, i) for i in range(len)])
# We'll have to force a constant length
length = 10
password = String("password")
s.add(Length(password) == 10)
ascii_sum = add_ascii_values(password, length)
s.add(ascii_sum == 100)
# Also require characters to be printable so we can view them:
for i in range(length):
v = OrdAt(password, i)
s.add(v >= 0x20)
s.add(v <= 0x7E)
print(s.check())
print(s.model()[password])
The OrdAt function works around the problem of not being able to extract characters. Also note how we use Sum instead of sum, and how all "loops" are of fixed iteration count. I also added constraints to make all the ascii codes printable for convenience.
When you run this, you get:
sat
":X|#`y}###"
Let's check it's indeed good:
>>> len(":X|#`y}###")
10
>>> sum(ord(character) for character in ":X|#`y}###")
868
So, we did get a length 10 string; but how come the ord's don't sum up to 100? Now, you have to remember sequences are composed of 8-bit values, and thus the arithmetic is done modulo 256. So, the sum actually is:
>>> sum(ord(character) for character in ":X|#`y}###") % 256
100
To avoid the overflows, you can either use larger bit-vectors, or more simply use Z3's unbounded Integer type Int. To do so, use the BV2Int function, by simply changing add_ascii_values to:
def add_ascii_values(password, len):
return Sum([BV2Int(OrdAt(password, i)) for i in range(len)])
Now we'd get:
unsat
That's because each of our characters has at least value 0x20 and we wanted 10 characters; so there's no way to make them all sum up to 100. And z3 is precisely telling us that. If you increase your sum goal to something more reasonable, you'd start getting proper values.
Programming with z3py is different than regular programming with Python, and z3 String objects are quite different than those of Python itself. Note that the sequence/string logic isn't even standardized yet by the SMTLib folks, so things can change. (In particular, I'm hoping they'll add functionality for extracting elements at an index!).
Having said all this, going over the https://rise4fun.com/z3/tutorialcontent/sequences would be a good start to get familiar with them, and feel free to ask further questions.

Best way to correct the modulus error in R?

The core R engine has a serious flaw with the way it expresses output from the Modulus operation:
ceiling((1.99 %% 1) * 100)
Returns: 99 (correct)
ceiling((2.99 %% 1) * 100)
Returns: 100 (incorrect)
The behavior will manifest in any integer value N + 2.99 (e.g. 3.99, etc.). If this is tied to a floating point representation, the system is not expressing the full details of the difference. This is especially disturbing because:
Both (1.99 %% 1) and (2.99 %% 1) appear to return 0.99.
Both ((1.99 %% 1) * 100) and ((2.99 %% 1) * 100) appear to return 99.
However, if you do any rounding or similar mathematical operations, the invisible residual value for 2.99 flips things in an unexpected way.
While solving this problem for my current application is trivial:
floor((2.99 - floor(2.99)) * 100)
Returns: 99 (correct)
sprintf("%.22f", floor((2.99 - floor(2.99)) * 100))
Returns: 99.0000000000000000000000 (correct)
... I wonder how many other instances that Modulus returns bad values without the underlying detail to show the floating point delta. Is there a way to expose the underlying residual value which Modulus seems to attach? It's otherwise invisible.
EDIT: As per the generous example from andrew.punnett below, print(1.99, digits = 22) returns 1.99 (no float expansion), while print(1.99 %% 1, digits = 22) returns 0.98999999999999999. As per the astute eye of Aaron, this appears to be version and / or system dependent.
Thanks!
This isn't really a bug in R. It is really a property of floating-point arithmetic.
The problem arises because neither 1.99 or 2.99 can be represented exactly as a floating-point number. The closest decimal number to 2.99 that can be stored in a double precision (64bit) floating-point number is 2.99000000000000021316282072803 (try the conversion here)
Therefore the expression evaluates as:
ceiling((2.99 %% 1) * 100) = ceiling(99.000000000000021316282072803)
= 100
Contrastingly, the nearest representation of 1.99 is 1.989999999999999991118215803 which happens to give the answer you expect:
ceiling((1.99 %% 1) * 100) = ceiling(98.9999999999999991118215803)
= 99
Both results are correct with respect to IEEE 754 floating-point arithmetic, but as you have seen only one agrees with the result you would get by applying the rules of real-number arithmetic.
This problem is compounded by the fact that the default behaviour in R is to truncate every floating-point number you print(). If you want to see more digits, then you must supply a digits parameter:
print(1.99, digits = 22)
However, even this doesn't give you the correct number of digits on all platforms, so a more reliable way to accurately view a floating-point number is:
cat(sprintf("%.22f\n", 1.99))

Base conversion error in matlab code

I created the following simple matlab functions to convert a number from an arbitrary base to decimal and back
this is the first one
function decNum = base2decimal(vec, base)
decNum = vec(1);
for d = 1:1:length(vec)-1
decNum = decNum*base + vec(d+1);
end
and here is the other one
function baseNum = decimal2base(num, base, Vlen)
ii = 1;
if num == 0
baseNum = 0;
end
while num ~= 0
baseNum(ii) = mod(num, base);
num = floor(num./base);
ii = ii+1;
end
baseNum = fliplr(baseNum);
if Vlen>(length(baseNum))
baseNum = [zeros(1,(Vlen)-(length(baseNum))) baseNum ];
end
Due to the fact that there are limitations to how big a number can be these functions can't successfully convert vary big vectors, but while testing them I noticed the following bug
Let's use the following testing function
num = 201;
pCount = 7
x=base2decimal(repmat(num-1, 1, pCount), num)
repmat(num-1, 1, pCount)
y=decimal2base(x, num, 1)
isequal(repmat(num-1, 1, pCount),y)
A supposed vector with seven (7) digits in base201 works fine, but the same vector with base200 does not return the expected result even though it is smaller and theoretically should be converted successfully.
(One preliminary comment: calling base2decimal won't result in a decimal number but rather in a number :-D)
This is due floating-point limited precision (in our case, double). To test it, just type at the MATLAB Command Window:
>> 200^7 - 1 == 200^7
ans =
1
>> mod(200^7 - 1, 200)
ans =
0
which means that the value of your number in base 200 (which is precisely 2007−1) is represented exactly as 2007, and the "true" value of representation is 2007.
On the other hand:
>> 201^7 - 1 == 201^7
ans =
1
so still the two numbers are represented the same, but
>> mod(201^7 - 1, 201)
ans =
200
which means that the two values share the "true" representation of 2017−1, which, by accident, is the value that you expected.
TL;DR
When stored in a double, 2007−1 is inaccurately represented as 2007, while 2017−1 is accurately represented.
"Bigger numbers are less accurately represented than smaller numbers" is a misconception: if it was true, there would be no big numbers that could be exactly represented.
Judging from your own observations:
The code works fine in most cases
The code can give small errors for large numbers
The suspect is apparent:
Rounding issues seem to give you headaces here. This is also illustrated by #RTL in the comments.
The first question should now be:
1. Do you need perfect accuracy for such large numbers? Or is it ok if it is off by a relatively small amount sometimes?
If that is answered with a yes, I would recommend you to try a different storage format.
The simple solution would be to use big integers:
uint64
The alternative would be to make your own storage format. This is required if you need even bigger numbers. I think you can cover a huge range with a cell array and some tricks, but of course it is going to be hard to combine those numbers afterwards without losing the accuracy that you worked so hard for.

How do computers evaluate huge numbers?

If I enter a value, for example
1234567 ^ 98787878
into Wolfram Alpha it can provide me with a number of details. This includes decimal approximation, total length, last digits etc. How do you evaluate such large numbers? As I understand it a programming language would have to have a special data type in order to store the number, let alone add it to something else. While I can see how one might approach the addition of two very large numbers, I can't see how huge numbers are evaluated.
10^2 could be calculated through repeated addition. However a number such as the example above would require a gigantic loop. Could someone explain how such large numbers are evaluated? Also, how could someone create a custom large datatype to support large numbers in C# for example?
Well it's quite easy and you can have done it yourself
Number of digits can be obtained via logarithm:
since `A^B = 10 ^ (B * log(A, 10))`
we can compute (A = 1234567; B = 98787878) in our case that
`B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...`
integer part + 1 (601767807 + 1 = 601767808) is the number of digits
First, say, five, digits can be gotten via logarithm as well;
now we should analyze fractional part of the
B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...
f = 0.4709646...
first digits are 10^f (decimal point removed) = 29577...
Last, say, five, digits can be obtained as a corresponding remainder:
last five digits = A^B rem 10^5
A rem 10^5 = 1234567 rem 10^5 = 34567
A^B rem 10^5 = ((A rem 10^5)^B) rem 10^5 = (34567^98787878) rem 10^5 = 45009
last five digits are 45009
You may find BigInteger.ModPow (C#) very useful here
Finally
1234567^98787878 = 29577...45009 (601767808 digits)
There are usually libraries providing a bignum datatype for arbitrarily large integers (eg. mapping digits k*n...(k+1)*n-1, k=0..<some m depending on n and number magnitude> to a machine word of size n redefining arithmetic operations). for c#, you might be interested in BigInteger.
exponentiation can be recursively broken down:
pow(a,2*b) = pow(a,b) * pow(a,b);
pow(a,2*b+1) = pow(a,b) * pow(a,b) * a;
there also are number-theoretic results that have engenedered special algorithms to determine properties of large numbers without actually computing them (to be precise: their full decimal expansion).
To compute how many digits there are, one uses the following expression:
decimal_digits(n) = 1 + floor(log_10(n))
This gives:
decimal_digits(1234567^98787878) = 1 + floor(log_10(1234567^98787878))
= 1 + floor(98787878 * log_10(1234567))
= 1 + floor(98787878 * 6.0915146640862625)
= 1 + floor(601767807.4709647)
= 601767808
The trailing k digits are computed by doing exponentiation mod 10^k, which keeps the intermediate results from ever getting too large.
The approximation will be computed using a (software) floating-point implementation that effectively evaluates a^(98787878 log_a(1234567)) to some fixed precision for some number a that makes the arithmetic work out nicely (typically 2 or e or 10). This also avoids the need to actually work with millions of digits at any point.
There are many libraries for this and the capability is built-in in the case of python. You seem primarily concerned with the size of such numbers and the time it may take to do computations like the exponent in your example. So I'll explain a bit.
Representation
You might use an array to hold all the digits of large numbers. A more efficient way would be to use an array of 32 bit unsigned integers and store "32 bit chunks" of the large number. You can think of these chunks as individual digits in a number system with 2^32 distinct digits or characters. I used an array of bytes to do this on an 8-bit Atari800 back in the day.
Doing math
You can obviously add two such numbers by looping over all the digits and adding elements of one array to the other and keeping track of carries. Once you know how to add, you can write code to do "manual" multiplication by multiplying digits and putting the results in the right place and a lot of addition - but software will do all this fairly quickly. There are faster multiplication algorithms than the one you would use manually on paper as well. Paper multiplication is O(n^2) where other methods are O(n*log(n)). As for the exponent, you can of course multiply by the same number millions of times but each of those multiplications would be using the previously mentioned function for doing multiplication. There are faster ways to do exponentiation that require far fewer multiplies. For example you can compute x^16 by computing (((x^2)^2)^2)^2 which involves only 4 actual (large integer) multiplications.
In practice
It's fun and educational to try writing these functions yourself, but in practice you will want to use an existing library that has been optimized and verified.
I think a part of the answer is in the question itself :) To store these expressions, you can store the base (or mantissa), and exponent separately, like scientific notation goes. Extending to that, you cannot possibly evaluate the expression completely and store such large numbers, although, you can theoretically predict certain properties of the consequent expression. I will take you through each of the properties you talked about:
Decimal approximation: Can be calculated by evaluating simple log values.
Total number of digits for expression a^b, can be calculated by the formula
Digits = floor function (1 + Log10(a^b)), where floor function is the closest integer smaller than the number. For e.g. the number of digits in 10^5 is 6.
Last digits: These can be calculated by the virtue of the fact that the expression of linearly increasing exponents form a arithmetic progression. For e.g. at the units place; 7, 9, 3, 1 is repeated for exponents of 7^x. So, you can calculate that if x%4 is 0, the last digit is 1.
Can someone create a custom datatype for large numbers, I can't say, but I am sure, the number won't be evaluated and stored.

Resources