R package development: tests pass in console, but fail via devtools::test() - r

I am developing an R package that calls functions from the package rstan. As a MWE, my test file is currently set up like this, using code taken verbatim from rstan's example:
library(testthat)
library(rstan)
# stan's own example
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
# I added this line, and it's the culprit
summary(fit)$summary
When I run this code in the console or via the "Run Tests" button in RStudio, no errors are thrown. However, when I run devtools::test(), I get:
Error (test_moments.R:11:1): (code run outside of `test_that()`)
Error in `summary(fit)$summary`: $ operator is invalid for atomic vectors
and this error is definitely not occurring upstream of that final line of code, because removing the final line allows devtools::test() to run without error. I am running up-to-date packages devtools and rstan.

It seems that devtools::test evaluates the test code in a setting where S4 dispatch does not work in the usual way, at least for packages that you load explicitly in the test file (in this case rstan). As a result, summary dispatches to summary.default instead of the S4 method implemented in rstan for class "stanfit".
The behaviour that you're seeing might relate to this issue on the testthat repo, which seems unresolved.
Here is a minimal example that tries to illuminate what is happening, showing one possible (admittedly inconvenient) work-around.
pkgname <- "foo"
usethis::create_package(pkgname, rstudio = FALSE, open = FALSE)
setwd(pkgname)
usethis::use_testthat()
path_to_test <- file.path("tests", "testthat", "test-summary.R")
text <- "test_that('summary', {
library('rstan')
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
expect_identical(class(fit), structure('stanfit', package = 'rstan'))
expect_true(existsMethod('summary', 'stanfit'))
x <- summary(fit)
expect_error(x$summary)
expect_identical(x, summary.default(fit))
print(x)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
str(y)
})
"
cat(text, file = path_to_test)
devtools::test(".") # all tests pass
If your package actually imports rstan (in the NAMESPACE sense, not in the DESCRIPTION sense), then S4 dispatch seems to work fine, presumably because devtools loads your package and its dependencies in a "proper" way before running any tests.
cat("import(rstan)\n", file = "NAMESPACE")
newtext <- "test_that('summary', {
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
x <- summary(fit)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
expect_identical(x, y)
})
"
cat(newtext, file = path_to_test)
## You must restart your R session here. The current session
## is contaminated by the previous call to 'devtools::test',
## which loads packages without cleaning up after itself...
devtools::test(".") # all tests pass
If your test is failing and your package imports rstan, then something else may be going on, but it is difficult to diagnose without a minimal version of your package.
Disclaimer: Going out of your way to import rstan to get around a relatively obscure devtools issue should be considered more of a hack than a fix, and documented accordingly...

Related

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).
I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.
For example, this is using mcapply():
## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
And this is the same code using pbmclapply():
## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).
What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!
The issue is that in the utils.R file within the pbmcapply package it runs the following line:
if (isTRUE(mc.set.seed))
mc.set.stream()
If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:
if (mc.set.seed)
mc.reset.stream()
This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:
mc.reset.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
sample.int(1L)
# HERE! sets the seed to the global seed variable we set
assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv,
inherits = FALSE), envir = RNGenv)
}
}
mc.set.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv),
envir = .GlobalEnv)
}
else {
if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
}
}
I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.
The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function() {
pkg <- asNamespace('pbmcapply')
.cleanup <- get('.cleanup', pkg)
progressMonitor <- .customized_mcparallel({
mclapply(1:100, function(i) {
rnorm(1)
}, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
mc.cleanup = TRUE, mc.allow.recursive = TRUE)
})
# clean up processes on exit
on.exit(.cleanup(progressMonitor$pid), add = T)
# Retrieve the result
results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])
return(results)
}
.customized_mcparallel <- function (expr, name, detached = FALSE){
# loading hidden functions
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
sendMaster <- get('sendMaster', pkg)
mc.set.stream <- get('mc.set.stream', pkg)
mc.reset.stream <- get('mc.reset.stream', pkg)
f <- mcfork(F)
env <- parent.frame()
mc.advance.stream()
if (inherits(f, "masterProcess")) {
mc.set.stream()
# reset the group process id of the forked process
mcinteractive(FALSE)
sendMaster(try(eval(expr, env), silent = TRUE))
mcexit(0L)
}
f
}
x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)
For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.
Great question and excellent answer by Joel Kandiah!
Another solution would be to put your code into an R-Markdown-File. Knitting the file will always gives the same result. But showing progress is more complicated. You could also simply run your code from the command line via Rscript:
Rscript yourfile.R
This will also give the same result every time because you always start fresh. It will display the progress and you can also redirect the output into a file. For a long running simulations calling Rscript is also more robust than working with a GUI.
Not sure if this is adequate for your needs, but still wanted to share this as it works very well for me and does not require changing pbmclapply.

GLM model runs in interactive code but not when I use knitr

I'm having issues with knitr.
Specifically, I have a model which runs absolutely fine in the console but when I try and knit the document, R throws an error.
Load the dataset (available here to facilitate replication )
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
scabies <- scabies %>% mutate(scabies = case_when(scabies_infestation=="yes"~1,
scabies_infestation=="no"~0)) %>%
mutate(impetigo = case_when(impetigo_active=="yes" ~1,
impetigo_active=="no" ~0))
fit the model
scabiesrisk <- glm(scabies~agegroups+gender+house_cat,data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
This code runs absolutely fine in the Console.
But when I try knitr I get:
Error in model.frame.default(formula = scabies ~ agegroups + gender +
: invalid type(list) for variable 'scabies Calls: ... glm
-> eval -> eval -> -> model.frame.default
I was able to reproduce the problem you describe, but haven't yet fully understood what happens under the hood.
This Markdown chunck is interesting :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
```
If I manually quickly execute the lines in the chunck one after another (ctrl+Enter x 4), sometimes I get two profiling messages:
Waiting for profiling to be done...
Waiting for profiling to be done...
In this case, summary(scabiesrisk) is a matrix:
> class(scabiesrisk_summary)
[1] "matrix" "array"
If I manually slowly execute the lines in the chunk, I get only one profiling message:
Waiting for profiling to be done...
summary(scabiesrisk) is a summary.glm :
> class(scabiesrisk_summary)
[1] "summary.glm"
Looks like profiling is launched on a separate thread, and depending on whether it was finished or not, summary function doesn't have the same behaviour. If profiling is finished, it returns the expected summary.glm object, but if it isn't the case it launches another profiling and returns a matrix.
In particular, with a matrix scabiesrisk_summary$coefficients isn't available and I get in this situation the following error message:
Error in scabiesrisk_summary$coefficients :
$ operator is invalid for atomic vectors
This could possibly also happen while knitting : does knitting overhead make profiling slower so that the problem occurs?
With the workaround found here (use confint.defaultinstead of confint), I wasn't able to reproduce the above problem:
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint.default((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
OR 2.5 % 97.5 % Estimate Std. Error
(Intercept) 0.09357141 0.06984512 0.1253575 -2.3690303 0.1492092
agegroups0-10 2.20016940 1.60953741 3.0075383 0.7885344 0.1594864
agegroups11-20 2.53291768 1.79985894 3.5645415 0.9293719 0.1743214
gendermale 1.44749159 1.13922803 1.8391682 0.3698321 0.1221866
house_cat6-10 1.30521927 1.02586104 1.6606512 0.2663710 0.1228792
house_cat10+ 1.17003712 0.67405594 2.0309692 0.1570355 0.2813713
z value Pr(>|z|)
(Intercept) -15.8772359 9.110557e-57
agegroups0-10 4.9442116 7.645264e-07
agegroups11-20 5.3313714 9.747386e-08
gendermale 3.0267824 2.471718e-03
house_cat6-10 2.1677478 3.017788e-02
house_cat10+ 0.5581076 5.767709e-01
So you could also probably try this in your case.
Contrary to confint.defaut which is a directly readable R function, confint is a S3 dispatch method (thanks #Ben Bolker for the internal references in comments), and I didn't yet investigate further what could explain this surprising behaviour.
Another option seems to save scabiesrisk_summary in another variable.
I tried hard but was never able to reproduce the problem after doing so :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_final <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_final
```
I strongly suspect that you forgot to include library(tidyverse) in your script. If tidyverse is loaded, then your code works fine. If it's not:
the step where you try to mutate() (and use %>%) fails, so the scabies variable is never created within the scabies data set
glm(scabies ~ ...) then interprets the response variable scabies as being the whole data set, and complains that the response variable is "invalid type(list)".
For this reason it's good practice to avoid having variables within data frames that have the same name as the data frames themselves ...
Your data transformation steps can be cleaned up a little bit (as.factor() is redundant; you can do all of the transformations as steps within a single mutate() call; as.numeric(x=="yes") is a shorter way to turn a string into a 0/1 variable ...) If I were going to do a lot more of this I would write a custom mycut() function that took breakpoints and a desired reference level as input arguments, constructed custom labels, and did the releveling.
library(tidyverse)
scabies <- (read.csv(file = "S1-Dataset_CSV.csv") %>%
mutate(agegroups <- cut(age, c(0,10,20,Inf),
labels = c("0-10","11-20","21+"),
include.lowest = TRUE),
agegroups = relevel(agegroups, ref = "21+"),
house_cat = cut(house_inhabitants, c(0,5,10,Inf),
labels = c("0-5","6-10","10+"),
include.lowest = TRUE),
house_cat = relevel(house_cat, ref = "0-5"),
scabies = as.numeric(scabies_infestation=="yes"),
impetigo = as.numeric(impetigo_active=="yes"))
)

R mlogit package: use LAPACK instead of LINPACK

I am estimating a fairly simple McFadden choice model using a very large data set (101.6 million unit-alternatives). I can estimate this model just fine in Stata using the asclogit command, but when I try to use the mlogit package in R, I get the following error:
region1 <- mlogit(chosen ~ mean_log.wage + mean_log.rent + bornNear + Dim.1 + regionFE | 0,
shape= "long", chid.var = "chid", alt.var = "alternatives", data = ready)
Error in qr.default(na.omit(X)) : too large a matrix for LINPACK
Calls: mlogit ... model.matrix -> model.matrix.mFormula -> qr -> qr.default
If I look at the source code of qr.R it's clear that the number of elements in my design matrix is too big relative to the LINPACK limit of 2,147,483,647. However, no such limit exists for LAPACK (that I can tell, at least).
From qr.R:
qr.default <- function(x, tol = 1e-07, LAPACK = FALSE, ...)
{
x <- as.matrix(x)
if(is.complex(x))
return(structure(.Internal(La_qr_cmplx(x)), class = "qr"))
## otherwise :
if(LAPACK)
return(structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr"))
## else "Linpack" case:
p <- as.integer(ncol(x))
if(is.na(p)) stop("invalid ncol(x)")
n <- as.integer(nrow(x))
if(is.na(n)) stop("invalid nrow(x)")
if(1.0 * n * p > 2147483647) stop("too large a matrix for LINPACK")
...
qr() appears to be called in the mFormula method of mlogit, when model.matrix is being created, and probably while checking NAs. But I can't tell if there is a way to pass LAPACK = TRUE to mlogit, or if there is a way to skip the NA checking.
I'm hoping #YvesCroissant will see this.
As I mentioned, I can estimate this model just fine in Stata, so it's not a question of resources. My Stata license is not portable, however, which is why I would like to use R.
Thanks to Julius' comment and this post on namespaces in R, I figured out the answer. I added the following code right after my library statements:
source("mymFormula.R")
tmpfun <- get("model.matrix.mFormula", envir = asNamespace("mlogit"))
environment(mymFormula) <- environment(tmpfun)
attributes(mymFormula) <- attributes(tmpfun) # don't know if this is really needed
assignInNamespace("model.matrix.mFormula", mymFormula, ns="mlogit")
mymFormula.R is an R script where I copy/pasted the contents of mlogit:::model.matrix.mFormula and added mymFormula <- before the function invocation at the top of the file.
I viewed the contents of mlogit:::model.matrix.mFormula by typing trace(mlogit:::model.matrix.mFormula, edit=TRUE) in RStudio. (Thanks to this answer for help on how to do that.)

getting error in mclust-package while working on univariate fit

While working on a univariate fit using Mclust I am getting following error:
Error in mstepE(data = as.matrix(data)[initialization$subset, ], z = z, :
row dimension of z should equal data length
I am using the code mentioned in:
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html#initialisation
This is the code section where I am getting error:
df1 <- dataSample
BIC <- NULL
for(j in 1:20){
rBIC <- mclustBIC(df1, verbose = T,
initialization = list(hcPairs = randomPairs(df1)))
BIC <- mclustBICupdate(BIC, rBIC)
}
summary(BIC)
Following link contains data to be passed to variable 'df1' (file name:dataSample.csv)
https://drive.google.com/open?id=0Bzau9RsRnQreYk9XOWVBSm91b2o4NTQ4RlA2UFdWbDBVOVpR
This is the solution I get from one of the Authors (Prof. Luca Scrucca) for 'mclust' library:
"there was a bug due to the use of automatic subset that clash when hcPairs are provided. I have fixed it in the current dev version of mclust.
Since submission to CRAN won't happen shortly, you may use the following code to avoid the error with the current release of mclust:
rBIC <- mclustBIC(df1, verbose = T,
initialization = list(hcPairs = randomPairs(df1),
subset = 1:NROW(df1)))
When the bug fix will be released, the subset argument could be omitted as it is redundant."
Now, the code is working fine.

Locked package namespace

I am using the "BMA" package in R 3.1.0, and get an error when running one of the functions in the package, iBMA.glm. When running the example in the package documentation:
## Not run:
############ iBMA.glm
library("MASS")
library("BMA")
data(birthwt)
y<- birthwt$lo
x<- data.frame(birthwt[,-1])
x$race<- as.factor(x$race)
x$ht<- (x$ht>=1)+0
x<- x[,-9]
x$smoke <- as.factor(x$smoke)
x$ptl<- as.factor(x$ptl)
x$ht <- as.factor(x$ht)
x$ui <- as.factor(x$ui)
### add 41 columns of noise
noise<- matrix(rnorm(41*nrow(x)), ncol=41)
colnames(noise)<- paste('noise', 1:41, sep='')
x<- cbind(x, noise)
iBMA.glm.out<- iBMA.glm( x, y, glm.family="binomial",
factor.type=FALSE, verbose = TRUE,
thresProbne0 = 5 )
summary(iBMA.glm.out)
I get the error:
Error in registerNames(names, package, ".__global__", add) :
The namespace for package "BMA" is locked; no changes in the global variables list may be made.
I get the error in RStudio running R 3.1.0 on Ubuntu.
on Windows 7, from RStudio and the R console I get a similar error:
Error in utils::globalVariables(c("nastyHack_glm.family", "nastyHack_x.df")) :
The namespace for package "BMA" is locked; no changes in the global variables list may be made.
I also get the same error when running my own data in the function. I'm not clear on what this error means and how to work around the error to be actually able to use the function. Any advice would be appreciated!

Resources