R CMD check fails, devtools::test() works fine - r

Sometimes R CMD check fails when all your test run fine when you run the manually (or using devtools::test()).
I ran into one of such issues as well, when I wanted to compare results from bootstrapping using the boot package.
I went into a rabbit hole looking for issues caused by parallel computation (done by boot) and Random Number Generators (RNGs).
These were all not the answers.

In the end, the issue was trivial.
I used base::sort() to create the levels of a factor. (To ensure that they would always align, even if the data was in a different order)
The problem is, that the default sort method depends on the locale of your system. And R CMD check uses a different locale than my interactive session.
The issue resided in this:
R interactively used:LC_COLLATE=en_US.UTF-8;
R CMD check used: LC_COLLATE=C;
In the details of base::sort this is mentioned:
Except for method ‘"radix"’, the sort order for character vectors
will depend on the collating sequence of the locale in use:
see ‘Comparison’. The sort order for factors is the order of their
levels (which is particularly appropriate for ordered factors).
I now resolved the issue by specifying the radix sort method.
Now, all works fine.

Related

set.seed() over different OS with RNGkind() [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
This question is similar(but not the same!) as the following questions...
Different sample results using set.seed command?
Is set.seed consistent over different versions of R (and Ubuntu)?
Same seed, different OS, different random numbers in R
... in which RNGkind() is recommended in scripts to guarantee consistency between OS / R versions when setting the seed with set.seed()
However, I have found that in order to reproduce results on the unix and windows systems I'm using, I have to set RNGkind(sample.kind = "Rounding") when running on Windows but not on unix. If I set it on both, I can't reproduce the result.
Can anyone explain this discrepancy in the systems?
And how does one share code with set.seed() and ensure it's reproducible without knowing the end users' OS?
Many thanks
EDIT: I am having this problem using the kmeans() function. I set.seed(1) prior to each use of kmeans()
The random number generators in R are consistent across operating systems, but have been modified a few times over the history of R, so are not consistent by default across R versions. However, you can always reproduce the random streams from earlier R versions by setting set.seed() and RNGkind() to match what was previously used.
The RNGversion() function will set newer versions of R to the defaults from any previous version. If you look at its source, you can see that the defaults changed in 0.99, 1.7.0, and 3.6.0.
One difficulty in reproducing random number results is that people don't always report the value of RNGkind(). If you change to a non-default setting and save the workspace, you'll return to that non-default setting when you reload it.
Generally speaking, each of the changes has been an improvement, so advice to use code like RNGkind(sample.kind = "Rounding") is probably bad advice: it restores buggy behaviour that was fixed by default in R 3.6.0. (Though it's a pretty subtle bug unless you're using the sample() function with really huge populations.)
You are generally better off encouraging people to use the most recent release of R (except occasionally x.y.0 releases, which sometimes introduce new bugs). It's also a bad idea to save the workspace, because that will cause R to retain the old or non-default RNGs.

R version 4.0.1 does not analyse my data properly [duplicate]

I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal.
Any suggestions?
I'm running Windows 10 Pro and create .csv files in Excel 2013.
As Ronak Shah said in a comment to your question, R 4.0.0 changed the default behavior in how read.table() (and so its wrappers including read.csv()) treats character vectors. There has been a long debate over that issue, but basically stringsAsFactors == T setting was a default since the inception of R because it helped to save memory due to the way factor variables are implemented in R (essentially they are an integer vector with factor level information added on top). There is less of a reason do that nowadays since the memory is much more abundant and this option often produced unintended side effects.
You can read more about your particular issue and also other peculiarities of vectors in R in Chapter 3 of Advanced R by Hadley Wickham. In there he gives two articles that go into great detail on why default behavior was the way it was.
Here is one and here is another. I would also suggest that you check out Hadley's book if you already have some experience with R, it helped me very much to learn some of the less obvious features of the language.
As everyone here said - the default behaviour have changed in R 4.0.0 and strings aren't automatically converted to factors anymore. This affects various functions, including read.csv() and data.frame(). However some functions, that are explicitly made to work with factors, are not affected. These include expand.grid() and as.data.frame.table().
One way you can bypass this change is by setting a global option:
options(stringsAsFactors = TRUE)
But this will also be deprecated and eventually you will have to convert strings to factors manually.
The main reason for such a decision seems to be reproducibility. Automatic string to factor conversion produces factor levels and those levels can depend on the locale used by the system. Hence if you are from Russia and share your script with automatically converted factors with your friend in Japan he might end up with different order of factor levels.
You can read more about this on "The R Blog" stringsAsFactors post by Kurt Hornik

RStudio keeps on running code chunk with no output

I was running a spatstat envelop to generate simulations sample, however, it got stuck and did not run. So, I attempted to close the application but fail.
RStudio diagnostic log
Additional error message:
This application has requested the Runtime to terminate it in an
unusual way. Please contact the application's support team for more
information
There are several typing errors in the command shown in the question. The argument rank should be nrank and the argument glocal should be global. I will assume that these were typed correctly when you ran the command.
Since global=TRUE this command will generate 2 * nsim = 198 realisations of a completely random pattern and calculate the L function for each one of them. In my experience it should take only a minute or two to compute this, unless the geometry of the window is very complicated. One hour is really extraordinary.
So I'm guessing either you have a very complicated window (so that the edge correction calculation is taking a long time) or that RStudio is hanging somehow.
Try setting correction="border" or correction="none" and see if that makes it run faster. (These are the fastest choices.) If that works, then read the help for Lest or Kest about edge corrections, and choose an edge correction that you like. If not, then try running the same command in R instead of RStudio.

Behaviour space in Netlogo crashes when using extension R

I am making a simulation with NetLogo and extension R. I have made a supply chain model, where I have distributors and consumers. Consumers provide orders to distributors and distributors forecast future demand and places orders to suppliers in advance to fulfill market demand. The forecast is implemented with extension R (https://ccl.northwestern.edu/netlogo/docs/r.html) by calling elmNN package. The model works fine when simply using "go".
However, when I want to conduct experiments by using behavior space, I keep getting errors. If I set only a few ticks with behavior space, the model works fine. But when I want to launch a few hundred ticks behavior space keeps crashing. For example, "Extension exception: Error in R-extension: error in eval, operator is invalid for atomic vector", "Extension exception: Error in R-extension: error in eval: cannot have attributes on CHARSXP". Sometimes the behavior simply crashes without any error.
I assume that the errors are related to computability issues between NetLogo, R, R extension and java. I am using NetLogo 5.3.1, 64-bit; R-3.3.3 64-bit; rJava 0.9-8.
Model example: https://www.youtube.com/watch?v=zjQpPBgj0A8
A similar question was posted previously, but it has no answer: NetLogo BehaviorSpace crashing when using R extension
The problem was with programming style, which is not suitable for behavior space. Behavior space supports parallel programming due to which some variables were rewritten by new information in the process. When I set Simultaneous runs in parallel to 1 in the behavior space everything was fine.

Numerical method produces platform dependent results

I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?

Resources