How is the seed chosen if not set by the user? - r

For the purpose of reproducibility, one has to choose a seed. In R, we can use set.seed().
My question is, when the seed is not set explicitly, how does the computer choose the seed?
Why is there no default seed?

A pseudo random number generator (PRNG) needs a default start value, which you can set with set.seed(). If there is no given it generally takes computer based information. This could be time, cpu temperatur or something similar. If you want a more random start value it is possible to use physical values, like white noise or nuclear decay, but you generally need an extern information source for this kind of random information.
The documentation mentions R uses current time and the process ID:
Initially, there is no seed; a new one is created from the current time and the process ID when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.
A default seed is a bad idea, since a random generators would always produce the same samples of numbers by default. If you always take the same seed it's not anymore randomized, since there will be always the same numbers. So you just provide a fixed data sample, which is not the intended output of a PRNG. You could of course turn the default seed off (if there would be one), but the intended function is primary to generate a completely random set of data and not a fixed one.
For statistical approaches it matters for validation and verification reasons, but it's getting more important when you get to cryptography. In this field a good PRNG is mandatory.

Related

Why isn't it possible to decode a server seed using nonce

I have 0 knowledge as to the way SHA256 or SHA512 function, and I am also uneducated on the similarities/differences between them.
I'm hoping to grasp the most basic understanding of why a server seed can't be cracked using the server seed hash, client seed, and most importantly a bunch of nonced outcomes.
For example,
Lets say you are playing a provably fair game, which provides a server seed hash, and a client seed.
Each game round outcome is determined based on a publically available equation which factors in the "unhashed" / "original" (pardon my terminology) server seed, the client seed, and a "nonce"
Each new game round adds 1 to the nonce.
So after let's say 1000 rounds, the server and client seeds remain the same, and the nonce is 1000, increasing by 1 every round.
The game is considered provably fair, because you are provided the server seed hash, and once you change to a new seed pair, the last secret server seed is revealed so that you can verify all the previous rounds
So once you change to a new seed, a new server seed hash is provided, and nonce increases by 1 every round until you want to verify again, at which point the seed pair is changed again, and the previous unhashed server seed is revealed.
Now sorry for the long explanation, but I thought that might help to understand what I'm trying to grasp in my question.
Q: If you have the equation which uses the seed pair to determine the round outcomes, (a static equation), and you have the client seed, and a list of previous round outcomes which are all based on an incrementing nonce,
Why then could you not "brute force" the unhashed server seed?
I know I said the server seed is hidden until after you change to a new seed pair, but okay in the most basic way possible,
Lets say the equation to determine the round outcomes is
server seed (X) * client seed (Y) * nonce (Z) = round outcome
To phrase my question another way,
Lets say you're 1000 rounds into the same seed pair, or 10,000 rounds or whatever. Why can't you "brute force" the the value of the server seed by throwing every possible seed into the equation, until you come across the one that matches the same outcomes for those 1,000 nonced outcomes, and then use that to pre determine the outcome for round 1001 and beyond?
Hopefully you understand my long question
Edit: you are given everything but the server seed. So after 1000's of rounds, why couldn't the outcomes of those previous rounds be used to determine what the seed is?
I get that there's an incredibly large amount of possibilities, but its not an infinite amount.
So the main thing Im trying to understand is why SHA256 / 512 is stated as being "uncrackable"
I understand something like an account password, you cant just brute force as long as the server has some type of failed attempt lockout function, because you only get so many tries before the account gets locked, and no more attempts are permitted.
In that case, I can understand why that might be considered uncrackable, or at least not "brute force able"
But if you have a list of final values which were determined using a client seed, server seed, and nonce, and you have everything but the server seed,
You could hypothetically "guess" the server seed that results in the same round outcomes, and disregard the server seed hash.
Theres no account lockout type of dilema, so why is this hashing impossible to decode?
The way i see it, if you had enough "luck", or processing power, it would be very much possible,
This:
[[[[Maybe not to "decode" the hash, but rather to "determine" the seed, right? The impossibility of Decoding the hash is one thing but surely with the other given info, the original seed isnt "undetermineable"]]]]]

References or Standardization of "Value Updating" in Constraint Satisfaction

Constraint Satisfaction Problems (CSPs) are basically, you have a set of constraints with variables and the domains of values for the variables. Then given some configuration of the variables (assignment of variables to values in their domains), you check to see if the constraints are "satisfied". That is, you check to see that evaluating all of the constraints returns a Boolean "true".
What I would like to do is sort of the reverse. Instead of this Boolean "testing" if the constraints are true, I would like to instead take the constraints and enforce them on the variables. That is, set the variables to whatever values they need to be in order to satisfy the constraints. An example of this would be like in a game, you say "this box's right side is always to the left of its containing box's right side," or, box.right < container.right. Then the constraint solving engine (like Cassowary for the game example) would take the box and set its "right" property to whatever number value it resolved to. So instead of the constraint solver giving you a Boolean value "yes the variable configuration satisfies the constraints", it instead updates the variables' configuration with appropriate values, "you have updated the variables". I think Cassowary uses the Simplex Algorithm for solving its constraints.
I am a bit confused because Wikipedia says:
constraint satisfaction is the process of finding a solution to a set of constraints that impose conditions that the variables must satisfy. A solution is therefore a set of values for the variables that satisfies all constraints—that is, a point in the feasible region.
That seems different than the constraint satisfaction problem, of which it says:
An evaluation is consistent if it does not violate any of the constraints.
That's why it seems CSPs are to return Boolean values, while in CS you can set the values. Not quite clear the distinction.
Anyways, I am looking for general techniques on Constraint Solving, in the sense of setting variables like in the simplex algorithm. However, I would like to apply it to any situation, not just linear programming. Some standard and simple example constraints are:
All variables are different.
box.right < container.right
The sum of all variables < 10
Variable a goes before variable b in evaluation.
etc.
For the first case, seeing if the constraints are satisfied (Boolean true) is pretty easy: iterate through the pairs of variables, and if any pair is not equal to each other, return false, otherwise return true after processing all variables.
However, doing the equivalent of setting the variables doesn't seem possible at first glance: iterate through the pairs of variables, and if they are not equal, perhaps you set the first one to the second one. You might have to do some fixed point thing, processing some of them more than once. And then figuring out what value to set them to seems arbitrary how I just did it. Maybe instead you need some further (nested) constraints defining how set the values (e.g. "set a to b if a > b, otherwise set b to a"). The possibilities are customizable.
In addition, for simpler cases like box.right < container.right, it is even complicated. You could say at first that if box.right >= container.right then set box.right = container.right. But maybe actually you don't want that, but instead you want some iPhone-like physics "bounce" where it overextends and then bounces back with momentum. So again, the possibilities are large, and you should probably have additional constraints.
So my question is, similar to how for testing the constraints (for Boolean value) is standardized to CSP, I am wondering if there are any references or standardizations in terms of setting the values used by the constraints.
The only thing I have seen so far is that Cassowary simplex algorithm example which works well for an array of linear inequalities on real-numbered variables. I would like to see something that can handle the "All variables are different" case, and the other cases listed, as well as the standard CSP example problems like for scheduling, box packing, etc. I am not sure why I haven't encountered more on setting/updating constraint variables instead of the Boolean "yes constraints are satisfied" problem.
The only limits I have are that the constraints work on finite domains.
If it turns out there is no standardization at all and that every different constraint listed requires its own entire field of research, that would be good to know. Then I at least know what the situation is and why I haven't really seen much about it.
CSP is a research fields with many publications each year. I suggest you to read one of the books on the subject, like Rina Dechter's.
For standardized CSP languages, check MiniZinc on one hand, and XCSP3 on the other.
There are two main approaches to CSP solving: systematic and stochastic (also known as local search). I have worked on three different CSP solvers, one of them stochastic, but I understand systematic solvers better.
There are many different approaches to systematic solvers. It is possible to fill a whole book covering all the possible approaches, so I will explain only the two approaches I believe the most in:
(G)AC3 which propagates constraints, until all global constraints (hyper-arcs) are consistent.
Reducing the problem to SAT, and letting the SAT solver do the hard work. There is a great algorithm that creates the CNF lazily, on demand when the solver is already working. In a sence, this is a hybrid SAT/CSP algorithm.
To get the AC3 approach going you need to maintain a domain for each variable. A domain is basically a set of possible assignments.
For example, consider the domains of a and b: D(a)={1,2}, D(b)={0,1} and the constraint a <= b. The algorithm checks one constraint at a time, and when it reaches a <= b, it sees that a=2 is impossible, and also b=0 is impossible, so it removes them from the domains. The new domains are D'(a)={1}, D'(b)={1}.
This process is called domain propagation. Using a queue of "dirty" constraints, or "dirty" variables, the solver knows which constraint to propagate next. When the queue is empty, then all constraints (hyper arcs) are consistent (this is where the name AC3 comes from).
When all arcs are consistent, then the solver picks a free variable (with more than one value in the domain), and restricts it to a single value. In SAT, this is called a decision It adds it to the queue and propagates the constraints. If it gets to a conflict (a constraint can't be satisfied), it goes back and undos an earlier decision.
There are a lot of things going on here:
First, how the domains are represented. Some solvers only hold a pair of bounds for each domain. Others, have a set of integers. My solver holds an interval set, or a bit vector.
Then, how the solver knows to propagate a constraint? Some solvers such as SAT solvers, Minion, and HaifaCSP, use watches to avoid propagating irrelevant constraints. This has a significant performance impact on clauses.
Then there is the issue of making decisions. Usually, it is good to choose a variable that has a small domain and high connectivity. There are many papers comparing many different strategies. I prefer a dynamic strategy that resembles the VSIDS of SAT solvers. This strategy is auto-tuned according to conflicts.
Making decision on the value is also important. Many simply take the smallest value in the domain. Sometimes this can be suboptimal if there is a constraint that limits a sum from below. Another option is to randomly choose between max and min values. I tune it further, and use the last assigned value.
After everything, there is the matter of backtracking. This is a whole can of worms. The problem with simple backtracking is that sometimes the cause for conflicts happened at the first decision, but it is detected only at the 100'th. The best thing is to analyze the conflict, and realize where the cause of the conflict is. SAT solvers have been doing this for decades. But CSP representation is not as trivial as CNF. So not many solvers could do it efficiently enough.
This is a nontrivial subject that can fill at least two university courses. Just the subject of conflict analysis can take half of a course.

Failsafes simulations

I need to simulate failsafes or mechanical failures described in
Diagnosing problems.
Is this possible using SITL/DroneKit ?
On the ArduPilot pages for SITL, you will find various parameters that can be set to control the virtual environment for a simulated drone. Here's the examples: http://ardupilot.org/dev/docs/using-sitl-for-ardupilot-testing.html
Command param show sim* will show all parameters with name beginning with "sim" which are parameters used by SITL for simulating various conditions. Apart from these, it might also be possible to directly affect vehicle parameters to simulate a deviation in vehicle state relative to input and the desired outcome (like pitch/roll suddently changing to simulate a rotor failure).

Load factor of hash tables with tombstones

So the question came up about whether tombstones should be included when calculating the load factor of a hash table.
I thought that, given that the load factor is used to determine when to expand capacity, tombstones should not be included. An obvious example is if you almost fill and then remove every value in a hash table. Here insertions are super easy (no collisions) so I believe the load factor shouldn't include them.
But you could look at this and think that with all the tombstones lookups will be slow (potentially searching almost the entire space).
So I thought I'd ask the question. Should the load factor of a hashtable include tombstones in the calculation?
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally. It's advantages:
Well, simplicity of understanding and implementation.
Hash table data structure shouldn't store many numbers with some thresholds -- likely only one number. This is meaningful when hash table is very small and the size of the header affects total data structure memory efficiency (in bytes to store an entry).
In certain (and common) case: append/update only hash table, more complex models of behaviour degenerate to the "just load factor" model, in other words, load factor model defines relatively optimal behaviour.
See also my answer on load factor model. I prefer [min load, target load, max load] + growth factor frame model.
If you develop general-purpose hash table with tombstones, I think you can just pick up my results (below). I spend maybe several weeks solely developing this model. Maybe you can make some improvements or further research, I would be glad.
Two main hash table dynamic behaviour patterns are targeted:
growing hash table (maybe in growing phase), with little or no removals
initial fill of hash table, when proper capacity was not specified (or unknown)
hash table that remains of the same or nearly the same size, number of removals is equal or nearly equal to number of insertions
caches with upper size bound, LRUs, tables with entry expires
Two thresholds are defined:
max size (i. e. number of alive entries), table size * max load
min number of free (i. e. empty, without alive entry nor tombstone) slots, computed by magic formula.
If hash table size exceeds max size, we assume we are in the "growing pattern", rehash to the table size to be able to store current size * growth factor entries, i. e. choose table size closest possible to current size * growth factor / target load.
If the number of free slots becomes below than min number of free slots, we are in "cache pattern", rehash "to the current size", i. e. to the table size closest possible to current size / target load.
Read the source where all the above logics are coded.
Also, article Tombstones purge from hashtable: theory and practice sheds some light.
If you develop specially purposed hash table, which dymanic properties are known (or could be studied), I recommend you to develop your own model, fitting your case. Don't rely on pure math and CS theory, evaluate your model in benchmarks.

Unique random number sequence using qrand() and qsrand()

I want to generate unique random number sequence in QT, Using QDateTime::currentDateTime().toTime_t() as seed value, will qrand() generate unique random numbers?
No. qrand can only generate as many unique numbers as fit into an integer, so -- whatever the implementation -- you cannot count on uniqueness.
Also, knowing that a different seed creates a different random integer would yield a level of predictability that effectively makes qrand not random anymore.
Edit: I swear I'm not trying to make fun of you by posting a cartoon; I think this is a quite good explanation of the problem:
(source: dilbert.com)
Depending on how you store your session ids, you can generated a (mostly) guaranteed unique identifier by using a UUID. See the documentation for QUuid. Also be aware of this (bold added):
You can also use createUuid(). UUIDs generated by createUuid() are of the random type. Their QUuid::Version bits are set to QUuid::Random, and their QUuid::Variant bits are set to QUuid::DCE. The rest of the UUID is composed of random numbers. Theoretically, this means there is a small chance that a UUID generated by createUuid() will not be unique. But it is a very small chance.
I can vouch for the fact that those generated UUIDs won't necessarily be unique, so if you do need them to be unique, look into libuuid or something similar.
According to the Qt Documentation, QRand is just a thread-safe version of the standard rand(), I wouldn't assume the method used is any more secure/superior to that of rand() based on that description.
I think you need to use different terminology than 'unique' random numbers (no Psuedo-Random Number Generator will produce a unique stream, as input X will always produce output Y). What's the actual situation?

Resources