Test equality of sets in Erlang - functional-programming

Given a data structure for sets, testing two sets for equality seems to be a desirable task, and indeed many implementations allow this (e.g. builtin sets in python).
There are different set implements in Erlang: sets, ordsets, gb_sets. Their documentation does not indicate, whether it is possible to test equality using term comparison ("=="), nor do they provide explicit functions for testing equality.
Some naive cases seem to allow equality testing with "==", but I have a larger application where I'm able to produce sets and gb_sets which are equal (tested with the function below) but do not compare equal with "==". For ordsets, they always compare equal. Unfortunately I haven't found a way to produce a minimal example for cases where equal sets do not compare equal with "==".
For reliably testing equality I use the following function, based on this theorem on set equality:
%% #doc Compare two sets for equality.
-spec sets_equal(sets:set(), sets:set()) -> boolean().
sets_equal(Set1, Set2) ->
sets:is_subset(Set1, Set2) andalso sets:is_subset(Set2, Set1).
My questions:
Is their a rationale, why Erlang set implementations do not offer explicit equality testing?
How to explain the difference when testing set equality with "==" with for the different set implementations?
How can produce a minimal example of sets where "==" does not compare equal but the sets are equal given the above code?
Some thoughts on question 2:
The documentation for sets states, that "The representation of a set is not defined." where as the documentation of ordsets states, that "An ordset is a representation of a set". The documentation on gb_sets does not give any comparable indication.
The following comment, from the source code of the sets implementation, seems to reiterate the statement from the documentation :
Note that as the order of the keys is undefined we may freely reorder keys within in a bucket.
My interpretation is, that term comparison with "==" in Erlang works on the representation of the sets, i.e. two sets only compare equal if their representation is identical. This would explain the different behavior of the different set implementations but also reinforces the question, why there is no explicit equality comparison.

ordsets are implemented as a sorted list, and the implementation is fairly open and meant to be visible. They are going to compare equal (==), although == means that 1.0 is equal to 1. They won't compare as strictly equal (=:=).
sets are implemented as a form of hash table, and its internal representation does not lend itself to any form of direct comparison; as hash collisions happen, the last element added is prepended to the list for the given hash entry. This prepend operation is sensitive to the order in which the elements are added.
gb_sets are implemented as a general balancing tree, and the structure of a tree does depend on the order in which the elements were inserted and when rebalancing took place. They are not safe to compare directly.
To compare two sets of the same type together, an easy way is to call Mod:is_subset(A,B) andalso Mod:is_subset(B,A) -- two sets can only be subsets of each other when they're equal.

Related

Eva method to compute intervals [frama-c]

My goal is to understand how Eva shrinks the intervals for a variable. for example:
unsigned int nondet_uint(void);
int main()
{
unsigned int x=nondet_uint();
unsigned int y=nondet_uint();
//# assert x >= 20 && x <= 30;
//# assert y <= 60;
//# assert(x>=y);
return 0;
}
So, we have x=[20,30] and y=[0,60]. However, the results from Eva shrinks y to [0,30] which is where the domain may be valid.
[eva] ====== VALUES COMPUTED ======
[eva:final-states] Values at end of function main:
x ∈ [20..30]
y ∈ [0..30]
__retres ∈ {0}
I tried some options for the Eva plugin, but none showed the steps for it. May I ask you to provide the method or publication on how to compute these values?
Showing values during abstract interpretation
I tried some options for the Eva plugin, but none showed the steps for it.
The most efficient way to follow the evaluation is not via command-line options, but by adding Frama_C_show_each(exp) statements in the code. These are special function calls which, during the analysis, emit the values of the expression contained in them. They are especially useful in loops, for instance to see when a widening is triggered, what happens to the loop counter values.
Note that displaying all of the intermediary evaluation and reduction steps would be very verbose, even for very small programs. By default, this information is not exposed, since it is too dense and rarely useful.
For starters, try adding Frama_C_show_each statements, and use the Frama-C GUI to see the result. It allows focusing on any expression in the code and, in the Values tab, shows the values for the given expression, at the selected statement, for each callstack. You can also press Ctrl+E and type an arbitrary expression to have its value evaluated at that statement.
If you want more details about the values, their reductions, and the overall mechanism, see the section below.
Detailed information about values in Eva
Your question is related to the values used by the abstract interpretation engine in Eva.
Chapter 3 of the Eva User Manual describes the abstractions used by the engine, which are, succinctly:
sets of integers, which are maximally precise but limited to a number of elements (modified by option -eva-ilevel, which on Frama-C 22 is set to 8 by default);
integer intervals with periodicity information (also called modulo, or congruence), e.g. [2..42],2%10 being the set containing {2, 12, 22, 32, 42}. In the simple case, e.g. [2..42], all integers between 2 and 42 are included;
sets of addresses (for pointers), with offsets represented using the above values (sets of integers or intervals);
intervals of floating-point variables (unlike integers, there are no small sets of floating-point values).
Why is all of this necessary? Because without knowing some of these details, you'll have a hard time understanding why the analysis is sometimes precise, sometimes imprecise.
Note that the term reduction is used in the documentation, instead of shrinkage. So look for words related to reduce in the Eva manual when searching for clues.
For instance, in the following code:
int a = Frama_C_interval(-5, 5);
if (a != 0) {
//# assert a != 0;
int b = 5 / a;
}
By default, the analysis will not be able to remove the 0 from the interval inside the if, because [-5..-1];[1..5] is not an interval, but a disjoint union of intervals. However, if the number of elements drops below -eva-ilevel, then the analysis will convert it into a small set, and get a precise result. Therefore, changing some analysis options will result in different ranges, and different results.
In some cases, you can force Eva to compute using disjunctions, for instance by adding the split ACSL annotation, e.g. //# split a < b || a >= b;. But you still need the give the analysis some "fuel" for it to evaluate both branches separately. The easiest way to do so is to use -eva-precision N, with N being an integer between 0 and 11. The higher N is, the more splitting is allowed to happen, but the longer the analysis may take.
Note that, to ensure termination of the analysis, some mechanisms such as widening are used. Without it, a simple loop might require billions of evaluation steps to terminate. This mechanism may introduce extra values which lead to a less precise analysis.
Finally, there are also some abstract domains (option -eva-domains) which allow other kinds of values besides the default ones mentioned above. For instance, the sign domain allows splitting values between negative, zero and positive, and would avoid the imprecision in the above example. The Eva user manual contains examples of usage of each of the domains, indicating when they are useful.

Correct way of writing recursive functions in CLP(R) with Prolog

I am very confused in how CLP works in Prolog. Not only do I find it hard to see the benefits (I do see it in specific cases but find it hard to generalise those) but more importantly, I can hardly make up how to correctly write a recursive predicate. Which of the following would be the correct form in a CLP(R) way?
factorial(0, 1).
factorial(N, F):- {
N > 0,
PrevN = N - 1,
factorial(PrevN, NewF),
F = N * NewF}.
or
factorial(0, 1).
factorial(N, F):- {
N > 0,
PrevN = N - 1,
F = N * NewF},
factorial(PrevN, NewF).
In other words, I am not sure when I should write code outside the constraints. To me, the first case would seem more logical, because PrevN and NewF belong to the constraints. But if that's true, I am curious to see in which cases it is useful to use predicates outside the constraints in a recursive function.
There are several overlapping questions and issues in your post, probably too many to coherently address to your complete satisfaction in a single post.
Therefore, I would like to state a few general principles first, and then—based on that—make a few specific comments about the code you posted.
First, I would like to address what I think is most important in your case:
LP &subseteq; CLP
This means simply that CLP can be regarded as a superset of logic programming (LP). Whether it is to be considered a proper superset or if, in fact, it makes even more sense to regard them as denoting the same concept is somewhat debatable. In my personal view, logic programming without constraints is much harder to understand and much less usable than with constraints. Given that also even the very first Prolog systems had a constraint like dif/2 and also that essential built-in predicates like (=)/2 perfectly fit the notion of "constraint", the boundaries, if they exist at all, seem at least somewhat artificial to me, suggesting that:
LP &approx; CLP
Be that as it may, the key concept when working with CLP (of any kind) is that the constraints are available as predicates, and used in Prolog programs like all other predicates.
Therefore, whether you have the goal factorial(N, F) or { N > 0 } is, at least in principle, the same concept: Both mean that something holds.
Note the syntax: The CLP(&Rscr;) constraints have the form { C }, which is {}(C) in prefix notation.
Note that the goal factorial(N, F) is not a CLP(&Rscr;) constraint! Neither is the following:
?- { factorial(N, F) }.
ERROR: Unhandled exception: type_error({factorial(_3958,_3960)},...)
Thus, { factorial(N, F) } is not a CLP(&Rscr;) constraint either!
Your first example therefore cannot work for this reason alone already. (In addition, you have a syntax error in the clause head: factorial (, so it also does not compile at all.)
When you learn working with a constraint solver, check out the predicates it provides. For example, CLP(&Rscr;) provides {}/1 and a few other predicates, and has a dedicated syntax for stating relations that hold about floating point numbers (in this case).
Other constraint solver provide their own predicates for describing the entities of their respective domains. For example, CLP(FD) provides (#=)/2 and a few other predicates to reason about integers. dif/2 lets you reason about any Prolog term. And so on.
From the programmer's perspective, this is exactly the same as using any other predicate of your Prolog system, whether it is built-in or stems from a library. In principle, it's all the same:
A goal like list_length(Ls, L) can be read as: "The length of the list Ls is L."
A goal like { X = A + B } can be read as: The number X is equal to the sum of A and B. For example, if you are using CLP(Q), it is clear that we are talking about rational numbers in this case.
In your second example, the body of the clause is a conjunction of the form (A, B), where A is a CLP(&Rscr;) constraint, and B is a goal of the form factorial(PrevN, NewF).
The point is: The CLP(&Rscr;) constraint is also a goal! Check it out:
?- write_canonical({a,b,c}).
{','(a,','(b,c))}
true.
So, you are simply using {}/1 from library(clpr), which is one of the predicates it exports.
You are right that PrevN and NewF belong to the constraints. However, factorial(PrevN, NewF) is not part of the mini-language that CLP(&Rscr;) implements for reasoning over floating point numbers. Therefore, you cannot pull this goal into the CLP(&Rscr;)-specific part.
From a programmer's perspective, a major attraction of CLP is that it blends in completely seamlessly into "normal" logic programming, to the point that it can in fact hardly be distinguished at all from it: The constraints are simply predicates, and written down like all other goals.
Whether you label a library predicate a "constraint" or not hardly makes any difference: All predicates can be regarded as constraints, since they can only constrain answers, never relax them.
Note that both examples you post are recursive! That's perfectly OK. In fact, recursive predicates will likely be the majority of situations in which you use constraints in the future.
However, for the concrete case of factorial, your Prolog system's CLP(FD) constraints are likely a better fit, since they are completely dedicated to reasoning about integers.

R Constained Combination generation

Say I have an set of string:
x=c("a1","b2","c3","d4")
If I have a set of rules that must be met:
if "a1" and "b2" are together in group, then "c3" cannot be in that group.
if "d4" and "a1" are together in a group, then "b2" cannot be in that group.
I was wondering what sort of efficient algorithm are suitable for generating all combinations that meet those rules? What research or papers or anything talk about these type of constrained combination generation problems?
In the above problem, assume its combn(x,3)
I don't know anything about R, so I'll just address the theoretical aspect of this question.
First, the constraints are really boolean predicates of the form "a1 ^ b2 -> ¬c3" and so on. That means that all valid combinations can be represented by one binary decision diagram, which can be created by taking each of the constraints and ANDing them together. In theory you might make an exponentially large BDD that way (that usually doesn't happen, but depends on the structure of the problem), but that would mean that you can't really list all combinations anyway, so it's probably not too bad.
For example the BDD generated for those two constraints would be (I think - not tested - just to give an idea)
But since this is really about a family of sets, a ZDD probably works even better. The difference, roughly, between a BDD and a ZDD is that a BDD compresses nodes that have equal sub-trees (in the total tree of all possibilities), while the ZDD compresses nodes where the solid edge (ie "set this variable to 1") goes to False. Both re-use equal sub-trees and thus form a DAG.
The ZDD of the example would be (again not tested)
I find ZDDs a bit easier to manipulate in code, because any time a variable can be set, it will appear in the ZDD. In contrast, in a BDD, "skipped" nodes have to be detected, including "between the last node and the leaf", so for a BDD you have to keep track of your universe. For a ZDD, most operations are independent of the universe (except complement, which is rarely needed in the family-of-sets scenario). A downside is that you have to be aware of the universe when constructing the constraints, because they have to contain "don't care" paths for all the variables not mentioned in the constraint.
You can find more information about both BDDs and ZDDs in The Art of Computer Programming volume 4A, chapter 7.1.4, there is an old version available for free here.
These methods are in particular nice to represent large numbers of such combinations, and to manipulate them somehow before generating all the possibilities. So this will also work when there are many items and many constraints (such that the final count of combinations is not too large), (usually) without creating intermediate results of exponential size.

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

Do cryptographic hash functions reach each possible values, i.e., are they surjective?

Take a commonly used binary hash function - for example, SHA-256. As the name implies, it outputs a 256 bit value.
Let A be the set of all possible 256 bit binary values. A is extremely large, but finite.
Let B be the set of all possible binary values. B is infinite.
Let C be the set of values obtained by running SHA-256 on every member of B. Obviously this can't be done in practice, but I'm guessing we can still do mathematical analysis of it.
My Question: By necessity, C ⊆ A. But does C = A?
EDIT: As was pointed out by some answers, this is wholly dependent on the has function in question. So, if you know the answer for any particular hash function, please say so!
First, let's point out that SHA-256 does not accept all possible binary strings as input. As defined by FIPS 180-3, SHA-256 accepts as input any sequence of bits of length lower than 2^64 bits (i.e. no more than 18446744073709551615 bits). This is very common; all hash functions are somehow limited in formal input length. One reason is that the notion of security is defined with regards to computational cost; there is a threshold about computational power that any attacker may muster. Inputs beyond a given length would require more than that maximum computational power to simply evaluate the function. In brief, cryptographers are very wary of infinites, because infinites tend to prevent security from being even defined, let alone quantified. So your input set C should be restricted to sequences up to 2^64-1 bits.
That being said, let's see what is known about hash function surjectivity.
Hash functions try to emulate a random oracle, a conceptual object which selects outputs at random under the only constraint that it "remembers" previous inputs and outputs, and, if given an already seen input, it returns the same output than previously. By definition, a random oracle can be proven surjective only by trying inputs and exhausting the output space. If the output has size n bits, then it is expected that about 2^(2n) distinct inputs will be needed to exhaust the output space of size 2^n. For n = 256, this means that hashing about 2^512 messages (e.g. all messages of 512 bits) ought to be enough (on average). SHA-256 accepts inputs very much longer than 512 bits (indeed, it accepts inputs up to 18446744073709551615 bits), so it seems highly plausible that SHA-256 is surjective.
However, it has not been proven that SHA-256 is surjective, and that is expected. As shown above, a surjectivity proof for a random oracle requires an awful lot of computing power, substantially more than mere attacks such as preimages (2^n) and collisions (2^(n/2)). Consequently, a good hash function "should not" allow a property such as surjectivity to be actually proven. It would be very suspicious: security of hash function stems from the intractability of their internal structure, and such an intractability should firmly oppose to any attempt at mathematical analysis.
As a consequence, surjectivity is not formally proven for any decent hash function, and not even for "broken" hash functions such as MD4. It is only "highly suspected" (a random oracle with inputs much longer than the output should be surjective).
Not necessarily. The pigeonhole principle states that once one more hash beyond the size of A has been generated that there is a probability of collision of 1, but it does not state that every single element of A has been generated.
It really depends on the hash function. If you use this valid hash function:
Int256 Hash (string input) {
return 0;
}
then it is obvious that C != A. So the "for example, SHA256" is a pretty important note to consider.
To answer your actual question: I believe so, but I'm just guessing. Wikipedia does not provide any meaningful info on this.
Not necessarily. That would depend on the hash function.
It would probably be ideal if the hash function was surjective, but there are things that're usually more important, such as a low likelihood of collisions.
It is not always the case. However, quality required for an hash algorithm are:
Cardinality of B
Repartition of hashes in B (every value in B must have the same probability to be a hash)

Resources