Has your pseudo-random number generator (PRNG) ever not been random enough? - math

Have you ever written simulations or randomized algorithms where you've run into trouble because of the quality of the (pseudo)-random numbers you used?
What was happening?
How did you detect / realize your prng was the problem?
Was switching PRNGs enough to fix the problem, or did you have to switch to a source of true randomness?
I'm trying to figure out what types of applications require one to worry about the quality of their source of randomness and how one realizes when this becomes a problem.

The dated random number generator RANDU was infamous in the seventies for producing "bad" random numbers. My PhD supervisor mentioned that it affected his PhD and he had to rerun simulations. A search on Google for RANDU linear congrunetial generator brings up other examples.
When I run simulations on multiple machines, I've sometimes been tempted to generate "random seeds", rather than just use a proper parallel random number generator. For example, generate the seed using the current time in seconds. This has caused me enough problems that I avoid this at all costs.
This is mainly due to my particular interests, but other than parallel computing, the thought of creating my own random number generator would never cross my mind. Calling a well tested random number function is trivial in most languages.

It is a good practice to run your prng against DieHard. Very good and fast PRNG exist nowadays (see the work of Marsaglia), see Numerical Recipes edition 3 for a good introduction.

Related

Time complexity of nlm-package in R?

I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.

How correlated are i.d.d. normal numbers in julia

I noticed while doing numerical simulations a pattern in my data when I use normal numbers in julia.
I have an ensemble of random matrices. In order to do my calculations reproducible, I set the srand function per-realization. That is, each time I use the function randn(n,n) I initialize it with srand(j), where j is the number of the realization.
I would like to know how the normal numbers are generated, and if it has meaning that doing what I do, I introduce accidental correlations.
Ideally, not at all. If you have any counterexamples, please file them as bugs on the Julia issue tracker. Julia uses the state-of-the-art Mersenne Twister library, dSFMT. This library is very fast and has been considered to use best practices for pseudo-random number generation. It has, however, recently come to my attention that there may be subtle statistical issues with PRNGs like MT in general – in particular with using small, consecutive seed values. To mitigate this if you're really worried about potential correlations, you could do something like this:
julia> using SHA
julia> srand(reinterpret(UInt32,sha256(string(1))))
MersenneTwister(UInt32[0x73b2866b,0xe1fc34ff,0x4e806b9d,0x573f5aff,0xeaa4ad47,0x491d2fa2,0xdd521ec0,0x4b5b87b7],Base.dSFMT.DSFMT_state(Int32[660235548,1072895699,-1083634456,1073365654,-576407846,1073066249,1877594582,1072764549,-1511149919,1073191776 … -710638738,1073480641,-1040936331,1072742443,103117571,389938639,-499807753,414063872,382,0]),[1.5382,1.36616,1.06752,1.17428,1.93809,1.63529,1.74182,1.30015,1.54163,1.05408 … 1.67649,1.66725,1.62193,1.26964,1.37521,1.42057,1.79071,1.17269,1.37336,1.99576],382)
julia> srand(reinterpret(UInt32,sha256(string(2))))
MersenneTwister(UInt32[0x3a5e73d4,0xee165e26,0x71593fe0,0x035d9b8b,0xd8079c01,0x901fc5b6,0x6e663ada,0x35ab13ec],Base.dSFMT.DSFMT_state(Int32[-1908998566,1072999344,-843508968,1073279250,-1560550261,1073676797,1247353488,1073400397,1888738837,1073180516 … -450365168,1073182597,1421589101,1073360711,670806122,388309585,890220451,386049800,382,0]),[1.5382,1.36616,1.06752,1.17428,1.93809,1.63529,1.74182,1.30015,1.54163,1.05408 … 1.67649,1.66725,1.62193,1.26964,1.37521,1.42057,1.79071,1.17269,1.37336,1.99576],382)
In other words, hash a string representation of a small integer seed value using a strong cryptographic hash like SHA2-256, and use the resulting hash data to seed the Mersenne Twister state. Ottoboni, Rivest & Stark suggest using a strong cryptographic hash for each random number generation, but that's going to be a massive slowdown (on current hardware) and is probably overkill unless you have an application that is really very sensitive to imperfect statistical randomness.
I should perhaps point out that Julia's behavior here is not worse than other languages, some of which use far worse random number generators by default, due to backwards compatibility considerations. This is an very recent research result (not yet published even). The technique I've suggested could be used to mitigate this issue in other languages as well.

Can any existing Machine Learning structures perfectly emulate recursive functions like the Fibonacci sequence?

To be clear I don't mean, provided the last two numbers in the sequence provide the next one:
(2, 3, -> 5)
But rather given any index provide the Fibonacci number:
(0 -> 1) or (7 -> 21) or (11 -> 144)
Adding two numbers is a very simple task for any machine learning structure, and by extension counting by ones, twos or any fixed number is a simple addition rule. Recursive calculations however...
To my understanding, most learning networks rely on forwards only evaluation, whereas most programming languages have loops, jumps, or circular flow patterns (all of which are usually ASM jumps of some kind), thus allowing recursion.
Sure some networks aren't forwards only; But can processing weights using the hyperbolic tangent or sigmoid function enter any computationally complete state?
i.e. conditional statements, conditional jumps, forced jumps, simple loops, complex loops with multiple conditions, providing sort order, actual reordering of elements, assignments, allocating extra registers, etc?
It would seem that even a non-forwards only network would only find a polynomial of best fit, reducing errors across the expanse of the training set and no further.
Am I missing something obvious, or did most of Machine Learning just look at recursion and pretend like those problems don't exist?
Update
Technically any programming language can be considered the DNA of a genetic algorithm, where the compiler (and possibly console out measurement) would be the fitness function.
The issue is that programming (so far) cannot be expressed in a hill climbing way - literally, the fitness is 0, until the fitness is 1. Things don't half work in programming, and if they do, there is no way of measuring how 'working' a program is for unknown situations. Even an off by one error could appear to be a totally different and chaotic system with no output. This is exactly the reason learning to code in the first place is so difficult, the learning curve is almost vertical.
Some might argue that you just need to provide stronger foundation rules for the system to exploit - but that just leads to attempting to generalize all programming problems, which circles right back to designing a programming language and loses all notion of some learning machine at all. Following this road brings you to a close variant of LISP with mutate-able code and virtually meaningless fitness functions that brute force the 'nice' and 'simple' looking code-space in attempt to follow human coding best practices.
Others might argue that we simply aren't using enough population or momentum to gain footing on the error surface, or make a meaningful step towards a solution. But as your population approaches the number of DNA permutations, you are really just brute forcing (and very inefficiently at that). Brute forcing code permutations is nothing new, and definitely not machine learning - it's actually quite common in regex golf, I think there's even an xkcd about it...
The real problem isn't finding a solution that works for some specific recursive function, but finding a solution space that can encompass the recursive domain in some useful way.
So other than Neural Networks trained using Backpropagation hypothetically finding the closed form of a recursive function (if a closed form even exists, and they don't in most real cases where recursion is useful), or a non-forwards only network acting like a pseudo-programming language with awful fitness prospects in the best case scenario, plus the virtually impossible task of tuning exit constraints to prevent infinite recursion... That's really it so far for machine learning and recursion?
According to Kolmogorov et al's On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition, a three layer neural network can model arbitrary function with the linear and logistic functions, including f(n) = ((1+sqrt(5))^n - (1-sqrt(5))^n) / (2^n * sqrt(5)), which is the close form solution of Fibonacci sequence.
If you would like to treat the problem as a recursive sequence without a closed-form solution, I would view it as a special sliding window approach (I called it special because your window size seems fixed as 2). There are more general studies on the proper window size for your interest. See these two posts:
Time Series Prediction via Neural Networks
Proper way of using recurrent neural network for time series analysis
Ok, where to start...
Firstly, you talk about 'machine learning' and 'perfectly emulate'. This is not generally the purpose of machine learning algorithms. They make informed guesses given some evidence and some general notions about structures that exist in the world. That typically means an approximate answer is better than an 'exact' one that is wrong. So, no, most existing machine learning approaches aren't the right tools to answer your question.
Second, you talk of 'recursive structures' as some sort of magic bullet. Yet they are merely convenient ways to represent functions, somewhat analogous to higher order differential equations. Because of the feedbacks they tend to introduce, the functions tend to be non-linear. Some machine learning approaches will have trouble with this, but many (neural networks for example) should be able to approximate you function quite well, given sufficient evidence.
As an aside, having or not having closed form solutions is somewhat irrelevant here. What matters is how well the function at hand fits with the assumptions embodied in the machine learning algorithm. That relationship may be complex (eg: try approximating fibbonacci with a support vector machine), but that's the essence.
Now, if you want a machine learning algorithm tailored to the search for exact representations of recursive structures, you could set up some assumptions and have your algorithm produce the most likely 'exact' recursive structure that fits your data. There are probably real world problems in which such a thing would be useful. Indeed the field of optimisation approaches similar problems.
The genetic algorithms mentioned in other answers could be an example of this, especially if you provided a 'genome' that matches the sort of recursive function you think you may be dealing with. Closed form primitives could form part of that space too, if you believe they are more likely to be 'exact' than more complex genetically generated algorithms.
Regarding your assertion that programming cannot be expressed in a hill climbing way, that doesn't prevent a learning algorithm from scoring possible solutions by how many much of your evidence it's able to reproduce and how complex they are. In many cases (most? though counting cases here isn't really possible) such an approach will find a correct answer. Sure, you can come up with pathological cases, but with those, there's little hope anyway.
Summing up, machine learning algorithms are not usually designed to tackle finding 'exact' solutions, so aren't the right tools as they stand. But, by embedding some prior assumptions that exact solutions are best, and perhaps the sort of exact solution you're after, you'll probably do pretty well with genetic algorithms, and likely also with algorithms like support vector machines.
I think you also sum things up nicely with this:
The real problem isn't finding a solution that works for some specific recursive function, but finding a solution space that can encompass the recursive domain in some useful way.
The other answers go a long way to telling you where the state of the art is. If you want more, a bright new research path lies ahead!
See this article:
Turing Machines are Recurrent Neural Networks
http://lipas.uwasa.fi/stes/step96/step96/hyotyniemi1/
The paper describes how a recurrent neural network can simulate a register machine, which is known to be a universal computational model equivalent to a Turing machine. The result is "academic" in the sense that the neurons have to be capable of computing with unbounded numbers. This works mathematically, but would have problems pragmatically.
Because the Fibonacci function is just one of many computable functions (in fact, it is primitive recursive), it could be computed by such a network.
Genetic algorithms should do be able to do the trick. The important this is (as always with GAs) the representation.
If you define the search space to be syntax trees representing arithmetic formulas and provide enough training data (as you would with any machine learning algorithm), it probably will converge to the closed-form solution for the Fibonacci numbers, which is:
Fib(n) = ( (1+srqt(5))^n - (1-sqrt(5))^n ) / ( 2^n * sqrt(5) )
[Source]
If you were asking for a machine learning algorithm to come up with the recursive formula to the Fibonacci numbers, then this should also be possible using the same method, but with individuals being syntax trees of a small program representing a function.
Of course, you also have to define good cross-over and mutation operators as well as a good evaluation function. And I have no idea how well it would converge, but it should at some point.
Edit: I'd also like to point out that in certain cases there is always a closed-form solution to a recursive function:
Like every sequence defined by a linear recurrence with constant coefficients, the Fibonacci numbers have a closed-form solution.
The Fibonacci sequence, where a specific index of the sequence must be returned, is often used as a benchmark problem in Genetic Programming research. In most cases recursive structures are generated, although my own research focused on imperative programs so used an iterative approach.
There's a brief review of other GP research that uses the Fibonacci problem in Section 3.4.2 of my PhD thesis, available here: http://kar.kent.ac.uk/34799/. The rest of the thesis also describes my own approach, which is covered a bit more succinctly in this paper: http://www.cs.kent.ac.uk/pubs/2012/3202/
Other notable research which used the Fibonacci problem is Simon Harding's work with Self-Modifying Cartesian GP (http://www.cartesiangp.co.uk/papers/eurogp2009-harding.pdf).

Parallelize Solve() for Ax=b?

Crossposted with STATS.se since this problem could straddle both STATs.se/SO
https://stats.stackexchange.com/questions/17712/parallelize-solve-for-ax-b
I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that http://cran.r-project.org/web/packages/RScaLAPACK/RScaLAPACK.pdf
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
So
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
Thanks.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, http://en.wikipedia.org/wiki/Alpha_centrality. I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
Will crosspost on stat.SE and will add a link back to here
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
Infrastructure
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Statistical
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the stats.stackexchange.com, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

Articles on analysis of mixed precision numerical algorithms?

Many numerical algorithms tend to run on 32/64bit floating points.
However, what if you had access to lower precision (and less power hungry) co-processors? How can then be utilized in numerical algorithms?
Does anyone know of good books/articles that address these issues?
Thanks!
Numerical analysis theory uses methods to predict the precision error of operations, independent of the machine they are running on. There are always cases where even on the most advanced processor operations may lose accuracy.
Some books to read about it:
Accuracy and Stability of Numerical Algorithms by N.J. Higham
An Introduction to Numerical Analysis by E. Süli and D. Mayers
If you cant find them or are too lazy to read them tell me and i will try to explain some things to you. (Well im no expert in this because im a Computer Scientist, but i think i can explain you the basics)
I hope you understand what i wrote (my english is not the best).
Most of what you are likely to find will be about doing floating-point arithmetic on computers irrespective of the size of the representation of the numbers themselves. The basic issues surround f-p arithmetic apply whatever the number of bits. Off the top of my head these basic issues will be:
range and accuracy of numbers that are represented;
careful selection of algorithms which are robust and reliable on f-p numbers rather than on real numbers;
the perils and pitfalls of iterative and lengthy calculations in which you run the risk of losing precision and accuracy.
In general, the fewer bits you have the sooner you run into problems, but just as there are algorithms which are useful in 32 bits, there are algorithms which are useful in 8 bits. Sometimes the same algorithm is useful however many bits you use.
As #George suggested, you should probably start with a basic text on numerical analysis, though I think the Higham book is not a basic text.
Regards
Mark

Resources