Are there functions which produce "infinite" amounts of high entropy data? Moreover, do functions exist which produce the same random data (sequentially) time after time?
I kind of know that they exist, but do they have a specific name?
Use case examples:
Using the function to generate 100 bits of random data. (Great!) But while maintaining high values of entropy.
Using the same function to generate 10000 bits of random data. (The first 100 bits generated are the same as the 100 bits of random data generated before). And while still maintaining high values of entropy
Further, how would I go about building these functions myself?
You are most likely looking for Pseudo-Random Number Generators.
They are initialized by a seed, thus taking in a finite amount of entropy.
Good generators have a decent entropy coming out, supposing you judge it only from its output (thus you ignore the seed and/or the algorithm to generate the numbers, otherwise the entropy is obviously 0).
Most PRNG algorithms produce sequences which are uniformly distributed by any of several tests. It is an open question, and one central to the theory and practice of cryptography, whether there is any way to distinguish the output of a high-quality PRNG from a truly random sequence without knowing the algorithm(s) used and the state with which it was initialized.
All PRNGs have a period, after which a generated sequence will restart.
The period of a PRNG is defined thus: the maximum, over all starting states, of the length of the repetition-free prefix of the sequence. The period is bounded by the number of the states, usually measured in bits. However, since the length of the period potentially doubles with each bit of "state" added, it is easy to build PRNGs with periods long enough for many practical applications.
Thus, to have two sequences of different lengths where one is the prefix of the other, you just have to run a PRNG with the same seed both times.
Building them yourself would be pretty tricky, but a rather good and simple one is the Mersenne Twister, which dates back to only 1998 and defined in a paper by Matsumoto and Nishimura [1].
A trivial example would be a linear congruential generator.
[1] Matsumoto, M.; Nishimura, T. (1998). "Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator". ACM Transactions on Modeling and Computer Simulation 8 (1): 3–30. doi:10.1145/272991.272995.
Related
I am using genetic matching in R using GenMatch in order to find comparable treatment and control groups to estimate a treatment effect. The default code for matching looks as follows:
GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
pop.size = 100, max.generations=100,...)
The description for the pop.size argument in the package is:
Population Size. This is the number of individuals genoud uses to
solve the optimization problem. The theorems proving that genetic
algorithms find good solutions are asymptotic in population size.
Therefore, it is important that this value not be small. See genoud
for more details.
Looking at gnoud the additional description is:
...There are several restrictions on what the value of this number can
be. No matter what population size the user requests, the number is
automatically adjusted to make certain that the relevant restrictions
are satisfied. These restrictions originate in what is required by
several of the operators. In particular, operators 6 (Simple
Crossover) and 8 (Heuristic Crossover) require an even number of
individuals to work on—i.e., they require two parents. Therefore, the
pop.size variable and the operators sets must be such that these three
operators have an even number of individuals to work with. If this
does not occur, the population size is automatically increased until
this constraint is satisfied.
I want to know how gnoud (resp. GenMatch) incorporates the population size argument. Does the algorithm randomly select n individuals from the population for the optimization?
I had a look at the package description and the source code, but did not find a clear answer.
The word "individuals" here does not refer to individuals in the sample (i.e., individual units your dataset), but rather to virtual individuals that the genetic algorithm uses. These individuals are individual draws of a set of the variables to be optimized. They are unrelated to your sample.
The goal of genetic matching is to choose a set of scaling factors (which the Matching documentation calls weights), one for each covariate, that weight the importance of that covariate in a scaled Euclidean distance match. I'm no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors, keeps the ones that "do the best" in the sense of optimizing the criterion (which is determined by fit.func in GenMatch()), and creates new guesses as slight perturbations of the kept guesses. It then repeats this process many times, simulating what natural selection does to optimize traits in living things. Each guess is what the word "individual" refers to in the description for pop.size, which corresponds to the number of guesses at each generation of the algorithm.
GenMatch() always uses your entire sample (unless you have provided a restriction like a caliper, exact matching requirement, or common support rule); it does not sample units from your sample to form each guess (which is what bagging in is other machine learning contexts).
Results will change over many runs because the genetic algorithm itself is a stochastic process. It may converge to a solution asymptotically, but because it is optimizing over a lumpy surface, it will find different solutions each time in finite samples with finite generations and a finite population size (i.e., pop.size).
I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.
I would like to compare the "death penalty method" with other penalty methods proposed in the Genetic Algorithms' literature.
I'm using the R software, so I need to write the codes of these penalty methods. I've finding lots of difficulties because I have not understood one thing about the death penalty function: how I have to handle the infeasible offsprings since the population size usually is fixed in genetic algorithms?
I mean, I understand that, in order to use appropriately the death penalty, I have to initialize the genetic algorithm with all feasible solutions. But even if I have all feasible solutions in the first population (t=0), I could have infeasible solutions in the next generation since the crossover and the mutations are "blind" operators.
So, since the death penalty rejects all the infeasible solutions, then what happen?
Will the next generation have a population side smaller (original dim size - number of infeasible solutions) or I have to select more parents to put in the mating pool for reproduction until the next generation is composed by "original dim size" feasible offsprings or I have to try again the genetic operators until all the individuals in t+1 are feasible?
I do not know R, but the theory of the death penalty implies that you should generate more offspring.
I would generate do the following pseudo-code (translate to R):
n=<desired_population_size>;
while (n>0) {
generate n offspring;
eliminate the non feasible ones
add the feasible ones to the new generation
n=<desired population size> - <current new generation population size>
}
The only problem with this loop is the risk that it may go on forever (if we never generate feasible solutions). Even though it is quite small, if you want to protect yourself from it, you can limit the number of iterations allowed in the while loop, using a simple counter.
There is a pretty interesting article on this by Michalewicz. Have a look.
How to find total number of nodes in a Distributed hash table in efficient way?
You generally do that by estimating from a small sample of the network as enumerating all nodes of a large network is prohibitively expensive for most use-cases. And would still be inaccurate due to NAT anyway. So you have to consider that you are sampling the reachable nodes.
Assuming that nodes are randomly distributed throughout the keyspace and you have some sort of distance metric in your DHT (e.g. XOR metric in Kademlia's case) you can find the median of the distances of a sample and than calculate the keyspace size divided by the average distance between nodes times.
If you use the median you may have to compensate by some factor due to the skewedness of the distribution. but my statistics are rusty, maybe someone else can chip in on that
The result will be very noisy, so you'll want to keep enough samples around for averaging. Together with the skewed distribution and the fact that everything happens at an exponential scale (twiddle one bit to the left and the population estimate suddenly doubles or halves).
I would also suggest to only base estimates on outgoing queries that you control, not on incoming traffic, as incoming traffic may be biased by some implementation details.
Another, crude way to get rough estimates is simply extrapolating from your routing table structure, assuming it scales with the network size.
Depending on your statistics prowess you might either want to do some of the following: scientific papers describing the network, steal code from existing implementations that already do estimation or do simulations over broad ranges of population sizes - simply fitting a few million random node addresses into ram and doing some calculations on them shouldn't be too difficult.
Maybe also talk to developers of existing implementations.
I was glancing through the contents of Concrete Maths online. I had at least heard most of the functions and tricks mentioned but there is a whole section on Special Numbers. These numbers include Stirling Numbers, Eulerian Numbers, Harmonic Numbers so on. Now I have never encountered any of these weird numbers. How do they aid in computational problems? Where are they generally used?
Harmonic Numbers appear almost everywhere! Musical Harmonies, analysis of Quicksort...
Stirling Numbers (first and second kind) arise in a variety of combinatorics and partitioning problems.
Eulerian Numbers also occur several places, most notably in permutations and coefficients of polylogarithm functions.
A lot of the numbers you mentioned are used in the analysis of algorithms. You may not have these numbers in your code, but you'll need them if you want to estimate how long it will take for your code to run. You might see them in your code too. Some of these numbers are related to combinatorics, counting how many ways something can happen.
Sometimes it's not enough to know how many possibilities there are because you need to enumerate over the possibilities. Volume 4 of Knuth's TAOCP, in progress, gives the algorithms you need.
Here's an example of using Fibonacci numbers as part of a numerical integration problem.
Harmonic numbers are a discrete analog of logarithms and so they come up in difference equations just like logs come up in differential equations. Here's an example of physical applications of harmonic means, related to harmonic numbers. See the book Gamma for many examples of harmonic numbers in action, especially the chapter "It's a harmonic world."
These special numbers can help out in computational problems in many ways. For example:
You want to find out when your program to compute the GCD of 2 numbers is going to take the longest amount of time: Try 2 consecutive Fibonacci Numbers.
You want to have a rough estimate of the factorial of a large number, but your factorial program is taking too long: Use Stirling's Approximation.
You're testing for prime numbers, but for some numbers you always get the wrong answer: It could be you're using Fermat's Prime test, in which case the Carmicheal numbers are your culprits.
The most common general case I can think of is in looping. Most of the time you specify a loop using a (start;stop;step) type of syntax, in which case it may be possible to reduce the execution time by using properties of the numbers involved.
For example, summing up all the numbers from 1 to n when n is large in a loop is definitely slower than using the identity sum = n*(n + 1)/2.
There are a large number of examples like these. Many of them are in cryptography, where the security of information systems sometimes depends on tricks like these. They can also help you with performance issues, memory issues, because when you know the formula, you may find a faster/more efficient way to compute other things -- things that you actually care about.
For more information, check out wikipedia, or simply try out Project Euler. You'll start finding patterns pretty fast.
Most of these numbers count certain kinds of discrete structures (for instance, Stirling Numbers count Subsets and Cycles). Such structures, and hence these sequences, implicitly arise in the analysis of algorithms.
There is an extensive list at OEIS that lists almost all sequences that appear in Concrete Math. A short summary from that list:
Golomb's Sequence
Binomial Coefficients
Rencontres Numbers
Stirling Numbers
Eulerian Numbers
Hyperfactorials
Genocchi Numbers
You can browse the OEIS pages for the respective sequences to get detailed information about the "properties" of these sequences (though not exactly applications, if that's what you're most interested in).
Also, if you want to see real-life uses of these sequences in analysis of algorithms, flip through the index of Knuth's Art of Computer Programming, and you'll find many references to "applications" of these sequences. John D. Cook already mentioned applications of Fibonacci & Harmonic numbers; here are some more examples:
Stirling Cycle Numbers arise in the analysis of the standard algorithm that finds the maximum element of an array (TAOCP Sec. 1.2.10): How many times must the current maximum value be updated when finding the maximum value? It turns out that the probability that the maximum will need to be updated k times when finding a maximum in an array of n elements is p[n][k] = StirlingCycle[n, k+1]/n!. From this, we can derive that on the average, approximately Log(n) updates will be necessary.
Genocchi Numbers arise in connection with counting the number of BDDs that are "thin" (TAOCP 7.1.4 Exercise 174).
Not necessarily a magic number from the reference you mentioned, but nonetheless --
0x5f3759df
-- the notorious magic number used to calculate inverse square root of a number by giving a good first estimate to Newton's Approximation of Roots, often attributed to the work of John Carmack - more info here.
Not programming related, huh? :)
Is this directly programming related? Surely related, but I don't know how closely.
Special numbers, such as e, pi, etc., come up all over the place. I don't think that anyone would argue about these two. The Golden_ratio also appears with amazing frequency, in everything from art to other special numbers themselves (look at the ratio between successive Fibonacci numbers.)
Various sequences and families of numbers also appear in many places in mathematics and therefore, in programming too. A beautiful place to look is the Encyclopedia of integer sequences.
I'll suggest this is an experience thing. For example, when I took linear algebra, many, many years ago, I learned about the eigenvalues and eigenvectors of a matrix. I'll admit that I did not at all appreciate the significance of eigenvalues/eigenvectors until I saw them in use in a variety of places. In statistics, in terms of what they tell you about uncertainty of an estimate from a covariance matrix, the size and shape of a confidence ellipse, in terms of principal component analysis, or the long term state of a Markov process. In numerical methods, where they tell you about convergence of a method, be it in optimization or an ODE solver. In mechanical engineering, where you see them as principal stresses and strains.
Discussion in Reddit