set.seed() over different OS with RNGkind() [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
This question is similar(but not the same!) as the following questions...
Different sample results using set.seed command?
Is set.seed consistent over different versions of R (and Ubuntu)?
Same seed, different OS, different random numbers in R
... in which RNGkind() is recommended in scripts to guarantee consistency between OS / R versions when setting the seed with set.seed()
However, I have found that in order to reproduce results on the unix and windows systems I'm using, I have to set RNGkind(sample.kind = "Rounding") when running on Windows but not on unix. If I set it on both, I can't reproduce the result.
Can anyone explain this discrepancy in the systems?
And how does one share code with set.seed() and ensure it's reproducible without knowing the end users' OS?
Many thanks
EDIT: I am having this problem using the kmeans() function. I set.seed(1) prior to each use of kmeans()

The random number generators in R are consistent across operating systems, but have been modified a few times over the history of R, so are not consistent by default across R versions. However, you can always reproduce the random streams from earlier R versions by setting set.seed() and RNGkind() to match what was previously used.
The RNGversion() function will set newer versions of R to the defaults from any previous version. If you look at its source, you can see that the defaults changed in 0.99, 1.7.0, and 3.6.0.
One difficulty in reproducing random number results is that people don't always report the value of RNGkind(). If you change to a non-default setting and save the workspace, you'll return to that non-default setting when you reload it.
Generally speaking, each of the changes has been an improvement, so advice to use code like RNGkind(sample.kind = "Rounding") is probably bad advice: it restores buggy behaviour that was fixed by default in R 3.6.0. (Though it's a pretty subtle bug unless you're using the sample() function with really huge populations.)
You are generally better off encouraging people to use the most recent release of R (except occasionally x.y.0 releases, which sometimes introduce new bugs). It's also a bad idea to save the workspace, because that will cause R to retain the old or non-default RNGs.

Related

How to move beyond the limits of R in terms of memory

I am new to advanced R programming. Unfortunately, I am running into memory issues. On stack overflow I could only find somewhat older contributions. I could not find a solution that worked for me. Machine learning is rising in the last years so I think there are new solutions available. So, I am wondering what the best practices are nowadays. Is everyone working on a Azure server or are there other options that can be used locally?
(For completeness I will explain my context. But I think it is not relevant to the question itself... My context:
I perform my analysis on a linux server. The dataset is small enough to load into R without any problems. Details: 42mb, #rows: 225000, #columns: 25. However, when I want to run a random forest classification model and performing 5-fold cross-validation using the caret package. To improve computation time I'm running it over 8 clusters/cores. One of the errors I've got is "Error : cannot allocate vector of size ** GB")
Do you have suggestions for packages or information sources that I can try out? If you need more information to provide a suitable answer please let me know :)

Error in RStudio while running decision tree (mac)

I am running a CART decision tree on a training set which I've tokenized using quanteda for a routine text analysis task. The resulting DFM from tokenizing was turned into a dataframe and appended with the class attribute I am predicting for.
Like many DFMs, the table is very wide (33k columns), but only contains about 5,500 rows of documents. Calling rpart on my training set returns a stack overflow error.
If it matters, to help increase the speed of calculations, I am using the doSNOW library so I can run the model on 3 out of 4 of my cores in parallel.
I've looked at this answer but can't figure out how to do the equivalent on my mac workstation to see if the same solution would work for me. There is a chance that even if I increase the ppsize of RStudio, I may still run into this error.
So my question is how do I increase the maxppsize of RStudio on a mac, or more generally, how can I fix this stack overflow so I can run my model?
Thanks!
In the end, I found that macs don't have this same command line option since the mac version of RStudio uses all available memory by default.
So the way I fixed this is by decreasing the complexity of the task by reducing the sparsity. I cleaned the document-term matrix by removing all tokens that did not occur in at least 5% of the corpus. This was enough to take a matrix with 33k columns down to a much more manageable 3k columns while still leading to a highly representative DFM.

R CMD check fails, devtools::test() works fine

Sometimes R CMD check fails when all your test run fine when you run the manually (or using devtools::test()).
I ran into one of such issues as well, when I wanted to compare results from bootstrapping using the boot package.
I went into a rabbit hole looking for issues caused by parallel computation (done by boot) and Random Number Generators (RNGs).
These were all not the answers.
In the end, the issue was trivial.
I used base::sort() to create the levels of a factor. (To ensure that they would always align, even if the data was in a different order)
The problem is, that the default sort method depends on the locale of your system. And R CMD check uses a different locale than my interactive session.
The issue resided in this:
R interactively used:LC_COLLATE=en_US.UTF-8;
R CMD check used: LC_COLLATE=C;
In the details of base::sort this is mentioned:
Except for method ‘"radix"’, the sort order for character vectors
will depend on the collating sequence of the locale in use:
see ‘Comparison’. The sort order for factors is the order of their
levels (which is particularly appropriate for ordered factors).
I now resolved the issue by specifying the radix sort method.
Now, all works fine.

Numerical method produces platform dependent results

I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?

Why is R slowing down as time goes on, when the computations are the same?

So I think I don't quite understand how memory is working in R. I've been running into problems where the same piece of code gets slower later in the week (using the same R session - sometimes even when I clear the workspace). I've tried to develop a toy problem that I think reproduces the "slowing down affect" I have been observing, when working with large objects. Note the code below is somewhat memory intensive (don't blindly run this code without adjusting n and N to match what your set up can handle). Note that it will likely take you about 5-10 minutes before you start to see this slowing down pattern (possibly even longer).
N=4e7 #number of simulation runs
n=2e5 #number of simulation runs between calculating time elapsed
meanStorer=rep(0,N);
toc=rep(0,N/n);
x=rep(0,50);
for (i in 1:N){
if(i%%n == 1){tic=proc.time()[3]}
x[]=runif(50);
meanStorer[i] = mean(x);
if(i%%n == 0){toc[i/n]=proc.time()[3]-tic; print(toc[i/n])}
}
plot(toc)
meanStorer is certainly large, but it is pre-allocated, so I am not sure why the loop slows down as time goes on. If I clear my workspace and run this code again it will start just as slow as the last few calculations! I am using Rstudio (in case that matters). Also here is some of my system information
OS: Windows 7
System Type: 64-bit
RAM: 8gb
R version: 2.15.1 ($platform yields "x86_64-pc-mingw32")
Here is a plot of toc, prior to using pre-allocation for x (i.e. using x=runif(50) in the loop)
Here is a plot of toc, after using pre-allocation for x (i.e. using x[]=runif(50) in the loop)
Is ?rm not doing what I think it's doing? Whats going on under the hood when I clear the workspace?
Update: with the newest version of R (3.1.0), the problem no longer persists even when increasing N to N=3e8 (note R doesn't allow vectors too much larger than this)
Although it is quite unsatisfying that the fix is just updating R to the newest version, because I can't seem to figure out why there was problems in version 2.15. It would still be nice to know what caused them, so I am going to continue to leave this question open.
As you state in your updated question, the high-level answer is because you are using an old version of R with a bug, since with the newest version of R (3.1.0), the problem no longer persists.

Resources