possible bug in `rbinom()` for large numbers of trials - r

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:
rbinom(1,2^32,0.95)
is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.
The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?

Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:
qbinom(runif(n), size, prob, FALSE)
And indeed:
set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619
However:
set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349
As #BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.
So, it seems like there is at least undocumented behavior and you should probably report it.

I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)
rbinom_safe <- function(n,size,prob,max.size=2^31) {
maxlen <- max(length(size),length(prob),n)
prob <- rep(prob,length.out=maxlen)
size <- rep(size,length.out=maxlen)
res <- numeric(n)
bigvals <- size>max.size
if (nbig <- sum(bigvals>0)) {
m <- (size*prob)[bigvals]
sd <- sqrt(size*prob*(1-prob))[bigvals]
res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
}
if (nbig<n) {
res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
}
return(res)
}
set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)
The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:
n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)![enter image description here][1]
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
col="blue",lty=2)

This is the intended behaviour, but there are two issues:
1) The NA induced by coercion should raise a warning
2) The fact that discrete random variables have storage mode integer should be documented.
I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

Related

How to solve nonlinear equations in R with Controls

I try to solve nonlinear equations with controls.
Here is my code:
fun <- function(x) {
b0 <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-1)+x[1]*1*(1-x[2])*x[3])/(x[1]-1) -1805*2.85*0.64
b1plus <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-1.01)+x[1]*1.01*(1-x[2])*x[3])/(1.01*(x[1]-1)) -1805*2.85*0.64*(1+0.00235)
b1minus <- (0.64*1+(1-0.64)*x[1]*(x[2]*x[1]-0.99)+x[1]*0.99*(1-x[2])*x[3])/(0.99*(x[1]-1)) -1805*2.85*0.64*(1-0.00235)
return(c(b0,b1plus,b1minus))
}
multiroot(fun,c(1.5, 0, 0))
However, the result I get is far beyond the actual results. I wish to control x1 to the range (1.5,4), x2(0,1), x3(0,10000). How can I do that?
Thank you!!
Methods like 'Newton-Raphson' in multiroot or nleqslv do not work well together with bounds constraints. One possible approach is to square and sum the components of your function
fun1 <- function(x) sum(fun(x)^2)
and then treat this as a global optimization problem where you hope for a minimum value of 0.0. For example, GenSA provides an implementation of "simulated annealing" that works reasonably well in low dimensions.
library(GenSA)
res <- GenSA(par=NULL, fn=fun1,
lower=c(1.5,0,0), upper=c(4,1,10000),
control=list(maxit=10e5))
res$value; res$par
## [1] 119.7869
## [1] 4.00 0.00 2469.44
Several tries did not find a lower function value than this one, which makes me think there is no common root in the constraint box you requested.

Vectorising two similar functions in R works for one

Today, I came across a problem: two almost identical functions work as intended before vectorisation, but after it, one works fine, and another one returns an error.
I am examining the robustness of various estimators with respect to different transformations of residuals and aggregating functions. Quantile Regression and Least Median of Squares are particular cases of what I am doing.
So I wrote the following code to see how the Least Trimean of Squares is going to work and found out that it works fine if model parameters are supplied as different arguments, but fails if they come in a vector. For instance, I need the first function for plotting (it is convenient to use outer(...) to get a value matrix for persp or just supply f(x, y) to persp3d from library(rgl), but the second one for estimation (R optimisers are expecting a vector of inputs as the first argument over which the minimisation is going to be done).
MWE:
set.seed(105)
N <- 204
x <- rlnorm(N)
y <- 1 + x + rnorm(N)*sqrt(.1+.2*x+.3*x^2)
# A simple linear model with heteroskedastic errors
resfun <- function(x) return(x^2)
# Going to minimise a function of squared residuals...
distfun <- function(x) return(mean(quantile(x, c(0.25, 0.5, 0.5, 0.75))))
# ...which in this case is the trimean
penalty <- function(theta0, theta1) {
r <- y - theta0 - theta1*x
return(distfun(resfun(r)))
}
pen2 <- function(theta) {
r <- y - theta[1] - theta[2]*x
return(distfun(resfun(r)))
}
penalty(1, 1) # 0.5352602
pen2(c(1, 1)) # 0.5352602
vpenalty <- Vectorize(penalty)
vpen2 <- Vectorize(pen2)
vpenalty(1, 1) # 0.5352602
vpen2(c(1, 1))
Error in quantile.default(x, c(0.25, 0.5, 0.5, 0.75)) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Why does vpen2, being vectorised pen2, choke even on a single input?
As jogo pointed out, vpen2 reads the elements of the input vector and tries to take the first one. The right way to go is to use something like
a <- matrix(..., ncol=2)
apply(a, 1, pen2)
This will return a vector of values from vpar2 evaluated at each row of the matrix.

What is going on with floating point precision here?

This question is in reference is an observation from a code-golf challenge.
The submitted R solution is a working solution, but a few of us (maybe just I) seems to be dumbfounded as to why the initial X=m reassignment is necessary.
The code is golfed down a bit by #Giuseppe, so I'll write a few comments for the reader.
function(m){
X=m
# Re-assign input m as X
while(any(X-(X=X%*%m))) 0
# Instead of doing the meat of the calculation in the code block after `while`
# OP exploited its infinite looping properties to perform the
# calculations within the condition check.
# `-` here is an abuse of inequality check and relies on `any` to coerce
# the numeric to logical. See `as.logical(.Machine$double.xmin)`
# The code basically multiplies the matrix `X` with the starting matrix `m`
# Until the condition is met: X == X%*%m
X
# Return result
}
Well as far as I can tell. Multiplying X%*%m is equivalent to X%*%X since X is a just an iteratively self-multiplied version of m. Once the matrix has converged, multiplying additional copies of m or X does not change its value. See linear algebra textbook or v(m)%*%v(m)%*%v(m)%*%v(m)%*%v(m)%*%m%*%m after defining the above function as v. Fun right?
So the question is, why does #CodesInChaos's implementation of this idea not work?
function(m){while(any(m!=(m=m%*%m)))0 m}
Is this caused by a floating point precision issue? Or is this caused by the a function in the code such as the inequality check or .Primitive("any")? I do not believe this is caused by as.logical since R seems to coerce errors smaller than .Machine$double.xmin to 0.
Here is a demonstration of above. We are simply looping and taking the difference between m and m%*%m. This error becomes 0 as we try to converge the stochastic matrix. It seems to converge then blow to 0/INF eventually depending on the input.
mat = matrix(c(7/10, 4/10, 3/10, 6/10), 2, 2, byrow = T)
m = mat
for (i in 1:25) {
m = m%*%m
cat("Mean Error:", mean(m-(m=m%*%m)),
"\n Float to Logical:", as.logical(m-(m=m%*%m)),
"\n iter", i, "\n")
}
Some additional thoughts on why this is a floating point math issue
1) the loop indicates that this is probably not a problem with any or any logical check/conversion step but rather something to do with float matrix math.
2) #user202729's comment in the original thread that this issue persists in Jelly, a code golf language gives more credence to the idea that this is a perhaps a floating point issue.
The different methods iterate different functions, both starting with seed value m. Function iteration only converges to a given fixed point if that fixed point is stable and the seed is within the basin of attraction of that fixed point.
In the original code, you are iterating the function
f <- function(X) X %*% m
The limit matrix is a stable fixed-point under the assumption (stated in the Code Gulf problem) that a well-defined limit exists. Since the function definition depends on m, it isn't surprising that the fixed point is a function of m.
On the other hand, the proposed variation using m = m %*% m is obtained by iterating the function
g <- function(X) X %*% X
Note that all idempotent matrices are fixed points of this function but clearly they can't all be stable fixed points. Apparently, the limiting matrix in the original fixed function is not a stable fixed point of g (even though it is a fixed point).
To really nail this down, you would need to get into the theory of matrix fixed points under function iteration to show why the fixed point in the case of g is unstable.
This is indeed a floating point math issue. To see it, see the results of this function:
test2 <- function(m) {
c <- 0
res <- list()
while (any(m!=(m=m%*%m))) {
c <- c + 1
res[[c]] <- m
}
print(c)
res
}
To test equality with some tolerance, you can use:
test3 <- function(m) {
while (!isTRUE(all.equal(m, m <- m %*% m))) 0
m
}

alternative for wilcox.test in R

I'm trying a significance test using wilcox.test in R. I want to basically test if a value x is significantly within/outside a distribution d.
I'm doing the following:
d = c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
wilcox.test(60,d)
Wilcoxon rank sum test with continuity correction
data: 60 and d
W = 4.5, p-value = 0.5347
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(60, d) : cannot compute exact p-value with ties
and basically the p-value is the same for a big range of numbers i test.
I've tried wilcox_test() from the coin package, but i can't get it to work testing a value against a distribution.
Is there an alternative to this test that does the same and knows how to deal with ties?
How worried are you about the non-exact results? I would guess that the approximation is reasonable for a data set this size. (I did manage to get coin::wilcox_test working, and the results are not hugely different ...)
d <- c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
pfun <- function(x) {
suppressWarnings(w <- wilcox.test(x,d)$p.value)
return(w)
}
testvec <- 30:120
p1 <- sapply(testvec,pfun)
library("coin")
pfun2 <- function(x) {
dd <- data.frame(y=c(x,d),f=factor(c(1,rep(2,length(d)))))
return(pvalue(wilcox_test(y~f,data=dd)))
}
p2 <- sapply(testvec,pfun2)
library("exactRankTests")
pfun3 <- function(x) {wilcox.exact(x,d)$p.value}
p3 <- sapply(testvec,pfun3)
Picture:
par(las=1,bty="l")
matplot(testvec,cbind(p1,p2,p3),type="s",
xlab="value",ylab="p value of wilcoxon test",lty=1,
ylim=c(0,1),col=c(1,2,4))
legend("topright",c("stats::wilcox.test","coin::wilcox_test",
"exactRankTests::wilcox.exact"),
lty=1,col=c(1,2,4))
(exactRankTests added by request, but given that it's not maintained any more and recommends the coin package, I'm not sure how reliable it is. You're on your own for figuring out what the differences among these procedures are and which would be best to use ...)
The results make sense here -- the problem is just that your power is low. If your value is completely outside the range of the data, for n=15, that will be a probability of something like 2*(1/16)=0.125 [i.e. probability of your sample ending up as the first or the last element in a permutation], which is not quite the same as the minimum value here (wilcox.test: p=0.105, wilcox_test: p=0.08), but that might be an approximation issue, or I might have some detail wrong. Nevertheless, it's in the right ballpark.
You can do this.
wilcox.test(60,d, exact=FALSE)

Why does glm.nb throw a "missing value" error only on very specific inputs

glm.nb throws an unusual error on certain inputs. While there are a variety of values that cause this error, changing the input even very slightly can prevent the error.
A reproducible example:
set.seed(11)
pop <- rnbinom(n=1000,size=1,mu=0.05)
glm.nb(pop~1,maxit=1000)
Running this code throws the error:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
At first I assumed that this had something to do with the algorithm not converging. However, I was surprised to find that changing the input even very slightly can prevent the error. For example:
pop[1000] <- pop[1000] + 1
glm.nb(pop~1,maxit=1000)
I've found that it throws this error on 19.4% of the seeds between 1 and 500:
fit.with.seed = function(s) {
set.seed(s)
pop <- rnbinom(n=1000, size=1, mu=0.05)
m = glm.nb(pop~1, maxit=1000)
}
errors = sapply(1:500, function(s) {
is.null(tryCatch(fit.with.seed(s), error=function(e) NULL))
})
mean(errors)
I've found only one mention of this error anywhere, on a thread with no responses.
What could be causing this error, and how can it be fixed (other than randomly permuting the inputs every time glm.nb throws an error?)
ETA: Setting control=glm.control(maxit=200,trace = 3) finds that the theta.ml algorithm breaks by getting very large, then becoming -Inf, then becoming NaN:
theta.ml: iter67 theta =5.77203e+15
theta.ml: iter68 theta =5.28327e+15
theta.ml: iter69 theta =1.41103e+16
theta.ml: iter70 theta =-Inf
theta.ml: iter71 theta =NaN
It's a bit crude, but in the past I have been able to work around problems with glm.nb by resorting to straight maximum likelihood estimation (i.e. no clever iterative estimation algorithms as used in glm.nb)
Some poking around/profiling indicates that the MLE for the theta parameter is effectively infinite. I decided to fit it on the inverse scale, so that I could put a boundary at 0 (a fancier version would set up a log-likelihood function that would revert to Poisson at theta=zero, but that would undo the point of trying to come up with a quick, canned solution).
With two of the bad examples given above, this works reasonably well, although it does warn that the parameter fit is on the boundary ...
library(bbmle)
m1 <- mle2(Y~dnbinom(mu=exp(logmu),size=1/invk),
data=d1,
parameters=list(logmu~X1+X2+offset(X3)),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(rep(-Inf,12),1e-8))
The second example is actually more interesting because it demonstrates numerically that the MLE for theta is essentially infinite even though we have a good-sized data set that is exactly generated from negative binomial deviates (or else I'm confused about something ...)
set.seed(11);pop <- rnbinom(n=1000,size=1,mu=0.05);glm.nb(pop~1,maxit=1000)
m2 <- mle2(pop~dnbinom(mu=exp(logmu),size=1/invk),
data=data.frame(pop),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(-Inf,1e-8))
Edit: The code and answer has been simplified to one sample, like in the question.
Yes, theta can approach Inf in small samples and sparse data (many zeroes, small mean and large skew). I have found that fitting glm.nb fails when the data are all zeroes and returns:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
The following code simulates small samples with a small mean and theta. To prevent the loop from crashing, glm.nb is not fitted when the data are all zeroes.
en1 <- 10
mu1 <- 0.5
size1 <- 0.5
temp <- matrix(nrow=10000, ncol=2)
# theta == Inf is rare so use a large number of reps
for (r in 1:10000){
dat1 <- rnbinom(n=en1, size=size1, mu=mu1)
temp[r, 1:2] <- c(mean(dat1), ifelse(max(dat1)!=0, glm.nb(dat1~1)$theta, NA))
}
temp <- as.data.frame(temp)
names(temp) <- c("mean1","theta1")
temp[which(is.na(temp$theta1)),]
# note that it's rare to get all zeroes in the sample
sum(is.na(temp$theta1))/dim(temp)[1]
# a log scale helps see what's happening
with(temp, plot(mean1, log10(theta1)))
# estimated thetas should equal size1 = 0.5
abline(h=log10(0.5), col="red")
text(2.5, 5, "n1 = n2 = 10", col="red", cex=2, adj=1)
text(1, 4, "extreme thetas", col="red", cex=2)
See that estimated thetas can be extremely large when the sample size is small (in the first plot below):
Lesson learnt: don't expect high quality results from glm.nb for small samples and sparse data; get larger samples (e.g. in the second plot below).

Resources