Random number analysis - r

Given a series of randomly generated data how can I figure out how random it actually is? Is R-lang a good tool for this matlab? What other questions can can these tools answer about randomly generated data? Is there another tool better for this?

The DieHarder test battery by Robert G. Brown --- which reimplements and extends the old DIEHARD by Marsaglia et al -- has been wrapped into the R package RDieHarder which you could start with.
Note that RDieHarder versions need their particular matching DieHarder releases -- and we're not there yet for the most recent development version of the latter.
Edit Also, for the subset of cryptographioic tests, the NIST suite (which is included in DieHarder) should be appropriate as that is what it was designed for.

First you need to decide what kind of randomness you're testing for. Do you have in mind a uniform distribution inside some range? That's usually what people have in mind, though you may have some other flavor of randomness such as a normal distribution.
Once you have a candidate distribution, you can test the goodness of fit to that distribution. The Kolmogorov-Smirnov test is a good general-purpose test. I believe it's called ks.test in R. But I also believe it assumes distinct values, so that could be a problem if you're sampling from such a small range of values that the same value appears more than once.
S. Lott mentioned Knuth's Seminumerical Algorithms in the comments. That book has a good introduction to the chi-squared test and the Kolmogorov-Smirnov tests for goodness of fit.
If you do suspect you have uniform random values, the DIEHARD test that Dirk Eddelbuettel mentioned is a standard test.

According to Wikipedia (Randomness):
The central idea is that a string of
bits is random if and only if it is
shorter than any computer program that
can produce that string (Kolmogorov
randomness) — this means that random
strings are those that cannot be
compressed.
Therefore, given the random stream of numbers, save it to a file, and compress it using your favorite tool (zip, rar, ...). The compression ratio can be interpreted as measure of randomness... Even better, I would use it as a relative score to compare the randomness of two data series.

I recommend reading Chapter 10 of Beautiful Testing: Testing a Random Number Generator. It's a little more approachable than most texts on the topic. Maybe, if we're nice, the author of that chapter, John Cook, might stop by and give his input.

There's as always a toolbox for it.

For theory, the above mentioned reference by Knuth is useful and to link Amro's response, there is work by Li & Vitanyi which relates here.
link text

Related

Demonstration Code for Nested Dirichlet Process

My question is about how to implement the nested Dirichlet process (NDP) with R code.
The NDP is suitable for clustering over distributions and simultaneously clustering within a distribution. Rodriguez et al. (2008) provided a simulation example to demontrate the ability of the NDP to distinguish different distributions. I am trying to learn this approach by reproducing the results for this example. But failed to do so because I cannot understand well how the base distribution is related to the mixture components.
The simulation example used a normal inverse-gamma distributioin, NIG(0,0.01,3,1), as the base distribution. But the four different distributions are:
The algorithm provided in Section 4 (Rodriguez et al.,2008, p.1135) was used to do the simulation. I have problem to understand and execute this algorithm, especially step 5:
Can you please provide a sample code to demonstrate this algorithm? Your help is highly appreciated!
I have not be able to do the coding by myself but I have found a recent paper which does the simulation using exact inference instead of truncation approximation. I think it might help someone else who has interest just like me, so I post the link to that paper here.
enter link description here
The good thing I like about this paper is that it is well written and has source code (in R) for me to understand this methodology better.

Is there any Python equivalent of R's biglm?

I have used biglm in R and found it very useful. Now I need the same type of functionality in python. Any ideas? I have seen that patsy/statsmodels has an incremental mode, but have not been able to find any samples to copy/adapt. Any pointers would be much appreciated.
from a related answer of Nathaniel Smith on the statsmodels mailing list
My incremental LS code might be useful here, it's basically the same
problem:
https://github.com/njsmith/pyrerp/blob/master/pyrerp/incremental_ls.py#L330
The new X'X is the sum of the old X'Xs, then you have to re-do the
scaling and inversion to get the new vcov matrix for the estimates.
Should be doable so long as you know how many data points are in each
and the various sums-of-squares. (The code I linked has some extra
complexity because of handling a particular sort of heteroskedasticity
via FGLS, but it can pretty much be ignored.)
statsmodels doesn't have anything in this area yet.
There is an incremental OLS function in statsmodels, however that was written as helper function for cusum tests (in memory) and hasn't been used or checked for any other purpose:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.diagnostic.recursive_olsresiduals.html

How does r calculate the p-values in logistic regression

What type of p-values do R calculate in a binomial logistic regression, and where is this documented?
When i read the documentation for ?glm() I find no reference to the calculation of the p-values.
The p-values are calculated by the function summary.glm. See ?summary.glm for a (very brief) bit about how those are calculated.
For more information, look at the source code by typing
summary.glm
at the R command prompt. There you will find the lines of code where an object pvalue is created. Follow the code back to see how the components of the p-value calculation are (conditionally) calculated.
The authors of R wrote the help system with several principles in mind: compactness (don't write more than is needed, it's not a textbook), accuracy, and a curious and well-educated audience. It really was written for other statisticians. The "curious" part of that opening sentence was included to raise the question why you did not also follow the various links in the ?glm page: to summary.glm where you would have found one answer to your ambiguous question or to anova.glm where you would have found another possible answer. The help-authors do expect that you will follow those links and read the whole page and execute the examples. You will notice that even after you get to summary.glm that there is no mention of "binary logistic regression" since they pretty much assume that you are well-grounded in statistics and have copy of McCullagh and Nelder handy, or if not that you will go read the references.
The other principle: sometimes it is the code itself (given the open-source nature of R) that performs the documentation. Technically glm doesn't print anything and print.glm doesn't print p-values. It would be print.summary.glm or print.anova.glm that would be doing any printing. Part of learning R is learning that the results printed to the console will have gone through a eval-print loop and that output can be tailored with object-class-specific functions.
These assumptions are just part of what many people see as a "steep learning curve for R" (although I would have called it a shallow curve if plotted with time/effort on x-axis.)

Has your pseudo-random number generator (PRNG) ever not been random enough?

Have you ever written simulations or randomized algorithms where you've run into trouble because of the quality of the (pseudo)-random numbers you used?
What was happening?
How did you detect / realize your prng was the problem?
Was switching PRNGs enough to fix the problem, or did you have to switch to a source of true randomness?
I'm trying to figure out what types of applications require one to worry about the quality of their source of randomness and how one realizes when this becomes a problem.
The dated random number generator RANDU was infamous in the seventies for producing "bad" random numbers. My PhD supervisor mentioned that it affected his PhD and he had to rerun simulations. A search on Google for RANDU linear congrunetial generator brings up other examples.
When I run simulations on multiple machines, I've sometimes been tempted to generate "random seeds", rather than just use a proper parallel random number generator. For example, generate the seed using the current time in seconds. This has caused me enough problems that I avoid this at all costs.
This is mainly due to my particular interests, but other than parallel computing, the thought of creating my own random number generator would never cross my mind. Calling a well tested random number function is trivial in most languages.
It is a good practice to run your prng against DieHard. Very good and fast PRNG exist nowadays (see the work of Marsaglia), see Numerical Recipes edition 3 for a good introduction.

Finding the Formula for a Curve

Is there a program that will take "response curve" values from me, and provide a formula that approximates the response curve?
It would be cool if such a program would take a numeric "percent correct" (perhaps with a standard deviation) so that it returns simplified formulas when laxity is permissable, and more precise (viz. complex) formulas when the curve needs to be approximated closely.
My interest is to play with the response curve values and "laxity" factor, until such a tool spits out a curve-fit formula simple enough that I know it will be high performance during machine computations.
Check our Eureqa, a free (as in beer) utility from Cornell University.
What's particularly interesting about Eureqa is that it uses genetic algorithms to fit the input curve you specify, and you can say what functions to allow or not in the fit. So if you wanted to stay away from sine and cosine, for instance, it wouldn't even consider those. It will also show you the best approximation with the fewest steps, and the most accurate approximation (regardless of steps). You can also run the fitting tool across multiple networked computers to speed up getting your results.
It's a very interesting tool -- check out their how-to videos.
Matlab, mathematica, octave, maple, numpy, scilab are just six among thousands of programs that will do this.
SigmaPlot - does exactly what you're looking for. Statistics and visualization of data.
(source: sigmaplot.com)

Resources