package rgp in r gives "NaNs produced" output that cannot be suppressed - r

I've been using package rgp (genetic programming) in r to predict survival on the Titanic. The input data frame, train, is the training data from Kaggle with the Sex variable changed to 0 for females and 1 for males. Given this, the code is:
fs <- functionSet("+", "-", "*", "exp", "sin", "cos")
ivs <- inputVariableSet("Sex")
cfs <- constantFactorySet(function() rnorm(1))
ff <- function(f) {
err <- rmse(as.numeric(f(train$Sex)), as.numeric(train$Survived))
if (is.na(err)) return(1e12) else return(err)
}
set.seed(42)
gp <- geneticProgramming(functionSet = fs,
inputVariables = ivs,
constantSet = cfs,
fitnessFunction = ff,
stopCondition = makeTimeStopCondition(60),
verbose = FALSE)
The issue is the geneticProgramming function outputs "NaNs produced" frequently. In fact, the output is enough to hang r and rstudio if I run it long enough (e.g. overnight). (The actual code I want to run is more involved and "NaNs produced" prints much more frequently, but this shortened version suffices to show the issue.)
I would like to suppress all output from geneticProgramming. The verbose = FALSE option doesn't address the "NaNs produced" output.
I have tried using capture.output, sink, and invisible. I've tried echo = FALSE in an R markdown cell. All to no avail. I would have thought sink("/dev/null") would solve the problem, but it doesn't.
Where is this output coming from, and how can I suppress it if the usual methods don't work?
Thanks very much.

In your functionSet, remove "sin" and "cos". When these are removed the NaNs are not produced.

Related

R package development: tests pass in console, but fail via devtools::test()

I am developing an R package that calls functions from the package rstan. As a MWE, my test file is currently set up like this, using code taken verbatim from rstan's example:
library(testthat)
library(rstan)
# stan's own example
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
# I added this line, and it's the culprit
summary(fit)$summary
When I run this code in the console or via the "Run Tests" button in RStudio, no errors are thrown. However, when I run devtools::test(), I get:
Error (test_moments.R:11:1): (code run outside of `test_that()`)
Error in `summary(fit)$summary`: $ operator is invalid for atomic vectors
and this error is definitely not occurring upstream of that final line of code, because removing the final line allows devtools::test() to run without error. I am running up-to-date packages devtools and rstan.
It seems that devtools::test evaluates the test code in a setting where S4 dispatch does not work in the usual way, at least for packages that you load explicitly in the test file (in this case rstan). As a result, summary dispatches to summary.default instead of the S4 method implemented in rstan for class "stanfit".
The behaviour that you're seeing might relate to this issue on the testthat repo, which seems unresolved.
Here is a minimal example that tries to illuminate what is happening, showing one possible (admittedly inconvenient) work-around.
pkgname <- "foo"
usethis::create_package(pkgname, rstudio = FALSE, open = FALSE)
setwd(pkgname)
usethis::use_testthat()
path_to_test <- file.path("tests", "testthat", "test-summary.R")
text <- "test_that('summary', {
library('rstan')
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
expect_identical(class(fit), structure('stanfit', package = 'rstan'))
expect_true(existsMethod('summary', 'stanfit'))
x <- summary(fit)
expect_error(x$summary)
expect_identical(x, summary.default(fit))
print(x)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
str(y)
})
"
cat(text, file = path_to_test)
devtools::test(".") # all tests pass
If your package actually imports rstan (in the NAMESPACE sense, not in the DESCRIPTION sense), then S4 dispatch seems to work fine, presumably because devtools loads your package and its dependencies in a "proper" way before running any tests.
cat("import(rstan)\n", file = "NAMESPACE")
newtext <- "test_that('summary', {
stancode <- 'data {real y_mean;} parameters {real y;} model {y ~ normal(y_mean,1);}'
mod <- stan_model(model_code = stancode, verbose = TRUE)
fit <- sampling(mod, data = list(y_mean = 0))
x <- summary(fit)
f <- selectMethod('summary', 'stanfit')
y <- f(fit)
expect_identical(x, y)
})
"
cat(newtext, file = path_to_test)
## You must restart your R session here. The current session
## is contaminated by the previous call to 'devtools::test',
## which loads packages without cleaning up after itself...
devtools::test(".") # all tests pass
If your test is failing and your package imports rstan, then something else may be going on, but it is difficult to diagnose without a minimal version of your package.
Disclaimer: Going out of your way to import rstan to get around a relatively obscure devtools issue should be considered more of a hack than a fix, and documented accordingly...

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

GLM model runs in interactive code but not when I use knitr

I'm having issues with knitr.
Specifically, I have a model which runs absolutely fine in the console but when I try and knit the document, R throws an error.
Load the dataset (available here to facilitate replication )
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
scabies <- scabies %>% mutate(scabies = case_when(scabies_infestation=="yes"~1,
scabies_infestation=="no"~0)) %>%
mutate(impetigo = case_when(impetigo_active=="yes" ~1,
impetigo_active=="no" ~0))
fit the model
scabiesrisk <- glm(scabies~agegroups+gender+house_cat,data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
This code runs absolutely fine in the Console.
But when I try knitr I get:
Error in model.frame.default(formula = scabies ~ agegroups + gender +
: invalid type(list) for variable 'scabies Calls: ... glm
-> eval -> eval -> -> model.frame.default
I was able to reproduce the problem you describe, but haven't yet fully understood what happens under the hood.
This Markdown chunck is interesting :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
```
If I manually quickly execute the lines in the chunck one after another (ctrl+Enter x 4), sometimes I get two profiling messages:
Waiting for profiling to be done...
Waiting for profiling to be done...
In this case, summary(scabiesrisk) is a matrix:
> class(scabiesrisk_summary)
[1] "matrix" "array"
If I manually slowly execute the lines in the chunk, I get only one profiling message:
Waiting for profiling to be done...
summary(scabiesrisk) is a summary.glm :
> class(scabiesrisk_summary)
[1] "summary.glm"
Looks like profiling is launched on a separate thread, and depending on whether it was finished or not, summary function doesn't have the same behaviour. If profiling is finished, it returns the expected summary.glm object, but if it isn't the case it launches another profiling and returns a matrix.
In particular, with a matrix scabiesrisk_summary$coefficients isn't available and I get in this situation the following error message:
Error in scabiesrisk_summary$coefficients :
$ operator is invalid for atomic vectors
This could possibly also happen while knitting : does knitting overhead make profiling slower so that the problem occurs?
With the workaround found here (use confint.defaultinstead of confint), I wasn't able to reproduce the above problem:
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint.default((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
OR 2.5 % 97.5 % Estimate Std. Error
(Intercept) 0.09357141 0.06984512 0.1253575 -2.3690303 0.1492092
agegroups0-10 2.20016940 1.60953741 3.0075383 0.7885344 0.1594864
agegroups11-20 2.53291768 1.79985894 3.5645415 0.9293719 0.1743214
gendermale 1.44749159 1.13922803 1.8391682 0.3698321 0.1221866
house_cat6-10 1.30521927 1.02586104 1.6606512 0.2663710 0.1228792
house_cat10+ 1.17003712 0.67405594 2.0309692 0.1570355 0.2813713
z value Pr(>|z|)
(Intercept) -15.8772359 9.110557e-57
agegroups0-10 4.9442116 7.645264e-07
agegroups11-20 5.3313714 9.747386e-08
gendermale 3.0267824 2.471718e-03
house_cat6-10 2.1677478 3.017788e-02
house_cat10+ 0.5581076 5.767709e-01
So you could also probably try this in your case.
Contrary to confint.defaut which is a directly readable R function, confint is a S3 dispatch method (thanks #Ben Bolker for the internal references in comments), and I didn't yet investigate further what could explain this surprising behaviour.
Another option seems to save scabiesrisk_summary in another variable.
I tried hard but was never able to reproduce the problem after doing so :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_final <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_final
```
I strongly suspect that you forgot to include library(tidyverse) in your script. If tidyverse is loaded, then your code works fine. If it's not:
the step where you try to mutate() (and use %>%) fails, so the scabies variable is never created within the scabies data set
glm(scabies ~ ...) then interprets the response variable scabies as being the whole data set, and complains that the response variable is "invalid type(list)".
For this reason it's good practice to avoid having variables within data frames that have the same name as the data frames themselves ...
Your data transformation steps can be cleaned up a little bit (as.factor() is redundant; you can do all of the transformations as steps within a single mutate() call; as.numeric(x=="yes") is a shorter way to turn a string into a 0/1 variable ...) If I were going to do a lot more of this I would write a custom mycut() function that took breakpoints and a desired reference level as input arguments, constructed custom labels, and did the releveling.
library(tidyverse)
scabies <- (read.csv(file = "S1-Dataset_CSV.csv") %>%
mutate(agegroups <- cut(age, c(0,10,20,Inf),
labels = c("0-10","11-20","21+"),
include.lowest = TRUE),
agegroups = relevel(agegroups, ref = "21+"),
house_cat = cut(house_inhabitants, c(0,5,10,Inf),
labels = c("0-5","6-10","10+"),
include.lowest = TRUE),
house_cat = relevel(house_cat, ref = "0-5"),
scabies = as.numeric(scabies_infestation=="yes"),
impetigo = as.numeric(impetigo_active=="yes"))
)

R mlogit package: use LAPACK instead of LINPACK

I am estimating a fairly simple McFadden choice model using a very large data set (101.6 million unit-alternatives). I can estimate this model just fine in Stata using the asclogit command, but when I try to use the mlogit package in R, I get the following error:
region1 <- mlogit(chosen ~ mean_log.wage + mean_log.rent + bornNear + Dim.1 + regionFE | 0,
shape= "long", chid.var = "chid", alt.var = "alternatives", data = ready)
Error in qr.default(na.omit(X)) : too large a matrix for LINPACK
Calls: mlogit ... model.matrix -> model.matrix.mFormula -> qr -> qr.default
If I look at the source code of qr.R it's clear that the number of elements in my design matrix is too big relative to the LINPACK limit of 2,147,483,647. However, no such limit exists for LAPACK (that I can tell, at least).
From qr.R:
qr.default <- function(x, tol = 1e-07, LAPACK = FALSE, ...)
{
x <- as.matrix(x)
if(is.complex(x))
return(structure(.Internal(La_qr_cmplx(x)), class = "qr"))
## otherwise :
if(LAPACK)
return(structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr"))
## else "Linpack" case:
p <- as.integer(ncol(x))
if(is.na(p)) stop("invalid ncol(x)")
n <- as.integer(nrow(x))
if(is.na(n)) stop("invalid nrow(x)")
if(1.0 * n * p > 2147483647) stop("too large a matrix for LINPACK")
...
qr() appears to be called in the mFormula method of mlogit, when model.matrix is being created, and probably while checking NAs. But I can't tell if there is a way to pass LAPACK = TRUE to mlogit, or if there is a way to skip the NA checking.
I'm hoping #YvesCroissant will see this.
As I mentioned, I can estimate this model just fine in Stata, so it's not a question of resources. My Stata license is not portable, however, which is why I would like to use R.
Thanks to Julius' comment and this post on namespaces in R, I figured out the answer. I added the following code right after my library statements:
source("mymFormula.R")
tmpfun <- get("model.matrix.mFormula", envir = asNamespace("mlogit"))
environment(mymFormula) <- environment(tmpfun)
attributes(mymFormula) <- attributes(tmpfun) # don't know if this is really needed
assignInNamespace("model.matrix.mFormula", mymFormula, ns="mlogit")
mymFormula.R is an R script where I copy/pasted the contents of mlogit:::model.matrix.mFormula and added mymFormula <- before the function invocation at the top of the file.
I viewed the contents of mlogit:::model.matrix.mFormula by typing trace(mlogit:::model.matrix.mFormula, edit=TRUE) in RStudio. (Thanks to this answer for help on how to do that.)

error message in R : if (nomZ %in% coded) { : argument is of length zero

I'm very new to R (and stackoverflow). I've been trying to conduct a simple slopes analysis for my continuous x dichotomous regression model using lmres, and simpleSlope from the pequod package.
My variables:
SLS - continuous DV
csibdiff - continuous predictor (I already manually centered the variable with another code)
culture - dichotomous moderator
newmod<-lmres(SLS ~ csibdiff*culture, data=sibdat2)
newmodss <-simpleSlope(newmod, pred="csibdiff", mod1="culture")
However, after running the simpleSlope function, I get this error message:
Error in if (nomZ %in% coded) { : argument is of length zero
I don't understand the nomZ part but I assume something was wrong with my variables. What does this mean? I don't have a nomZ named thing in my data at all. None of my variables are null class (I checked them with the is.null() function), and I didn't seem to have accidentally deleted the contents of the variable (I checked with the table() function).
If anyone else can suggest another function/package that I can do a simple slope analysis in, as well, I'd appreciate it. I've been stuck on this problem for a few days now.
EDIT: I subsetted the relevant variables into a csv file.
https://www.dropbox.com/s/6j82ky457ctepkz/sibdat2.csv?dl=0
tl;dr it looks like the authors of the package were thinking primarily about continuous moderators; if you specify mod1="cultureEuropean" (i.e. to match the name of the corresponding parameter in the output) the function returns an answer (I have no idea if it's sensible or not ...)
It would be a service to the community to let the maintainers of the pequod package (maintainer("pequod")) know about this issue ...
Read data and replicate error:
sibdat2 <- read.csv("sibdat2.csv")
library(pequod)
newmod <- lmres(SLS ~ csibdiff*culture, data=sibdat2)
newmodss <- simpleSlope(newmod, pred="csibdiff", mod1="culture")
Check the data:
summary(sibdat2)
We do have some NA values in csibdiff, so try removing these ...
sibdat2B <- na.omit(sibdat2)
But that doesn't actually help (same error as before).
Plot the data to check for other strangeness
library(ggplot2); theme_set(theme_bw())
ggplot(sibdat2B,aes(csibdiff,SLS,colour=culture))+
stat_sum(aes(size=factor(..n..))) +
geom_smooth(method="lm")
There's not much going on here, but nothing obviously wrong either ...
Use traceback() to see approximately where the problem is:
traceback()
3: simple.slope(object, pred, mod1, mod2, coded)
2: simpleSlope.default(newmod, pred = "csibdiff", mod1 = "culture")
1: simpleSlope(newmod, pred = "csibdiff", mod1 = "culture")
We could use options(error=recover) to jump right to the scene of the crime, but let's try step-by-step debugging instead ...
debug(pequod:::simple.slope)
As we go through we can see this:
nomZ <- names(regr$coef)[pos_mod]
nomZ ## character(0)
And looking a bit farther back we can see that pos_mod is also a zero-length integer. Farther back, we see that the code is looking through the parameter names (row names of the variance-covariance matrix) for the name of the modifier ... but it's not there.
debug: pos_pred_mod1 <- fI + grep(paste0("\\b", mod1, "\\b"), jj[(fI +
1):(fI + fII)])
Browse[2]> pos_mod
## integer(0)
Browse[2]> jj[1:fI]
## [[1]]
## [1] "(Intercept)"
##
## [[2]]
## [1] "csibdiff"
##
## [[3]]
## [1] "cultureEuropean"
Browse[2]> mod1
## [1] "culture"
The solution is to tell simpleSlope to look for a variable that is there ...
(newmodss <- simpleSlope(newmod, pred="csibdiff", mod1="cultureEuropean"))
## Simple Slope:
## simple slope standard error t-value p.value
## Low cultureEuropean (-1 SD) -0.2720128 0.2264635 -1.201133 0.2336911
## High cultureEuropean (+1 SD) 0.2149291 0.1668690 1.288011 0.2019241
We do get some warnings about NaNs produced -- you'll have to dig farther yourself to see if you need to worry about them.

Resources