fisher.test crash R with *** caught segfault *** error - r

As title said, fisher.test crash R with *** caught segfault *** error. Here is the code to produce the error:
d<-matrix(c(1,0,5,2,1,90,0,0,0,1,0,14,0,0,0,0,0,5,0,
0,0,0,0,2,0,0,0,0,0,2,2,1,0,2,3,89),
nrow=6,byrow = TRUE)
fisher.test(d,simulate.p.value=FALSE)
I found this, since I use the fisher.test inside some functions. Running them on the data produced R to crash with the aforementioned error.
I understand that the table provided to fisher.test is ill behaved, but that kind of things should not be happening, I guess.
I would appreciate any suggestions on which conditions should be met by the contingency table in order to avoid this kind of crashes due to the fisher.test misbehavior. Also what other arguments should be set in fisher.test in order to avoid the crash, I did a little test in which
fisher.test(d,simulate.p.value=TRUE)
does not crash and produced a result.
I am asking for this since I will have to implement that to avoid future crashes in my pipeline.

I can confirm that this is a bug in R 4.2 and that it is now fixed in the development branch of R (with this commit on 7 May). I wouldn't be surprised if it were ported to a patch-release sometime soon, but that's unknown/up to the R developers. Running your example above doesn't segfault any more, but it does throw an error:
Error in fisher.test(d, simulate.p.value = FALSE) :
FEXACT[f3xact()] error: hash key 5e+09 > INT_MAX, kyy=203, it[i (= nco = 6)]= 0.
Rather set 'simulate.p.value=TRUE'
So this makes your workflow better (you can handle these errors with try()/tryCatch()), but it doesn't necessarily satisfy you if you really want to perform an exact Fisher test on these data. (Exact tests on large tables with large entries are extremely computationally difficult, as they essentially have to do computations over the set of all possible tables with given marginal values.)
I don't have any brilliant ideas for detecting the exact conditions that will cause this problem (maybe you can come up with a rough rubric based on the dimensions of the table and the sum of the counts in the table, e.g. if (prod(dim(d)) > 30 && sum(d) > 200) ... ?)
Setting simulate.p.value=TRUE is the most sensible approach. However, if you expect precise results for extreme tables (e.g. you are working in bioinformatics and are going to apply a huge multiple-comparisons correction to the results), you're going to be disappointed. For example:
dd <- matrix(0, 6, 6)
dd[5,5] <- dd[6,6] <- 100
fisher.test(dd)$p.value
## 2.208761e-59, reported as "< 2.2e-16"
fisher.test(dd, simulate.p.value = TRUE, B = 10000)$p.value
# 9.999e-05
fisher.test(..., simulate.p.value = TRUE) will never return a value smaller than 1/(B+1) (this is what happens if none of the simulated tables are more extreme than the observed table: technically, the p-value ought to be reported as "<= 9.999e-05"). Therefore, you will never (in the lifetime of the universe) be able to calculate a p-value like 1e-59, you'll just be able to set a bound based on how large you're willing to make B.

Related

Why does lsoda (in R) fail to complete running duration, with warning messages?

I am writing a numerical model in R, for an ecological system, and solving it using "lsoda" from package deSolve.
My model has 14 state variables.
I define the model, set it up fine, and give time duration according to this:
nyears<-60
ndays<-nyears*365+1
times<-seq(0,nyears*365,by=1)
Rates of change of state variables (e.g. the rate of change of variable "A1" is "dA1")are calculated according to existing values for state variables (at time=t) and a set of parameters.
Simplified example:
dA1<-Tf*A1*(ImaxA*p_sub)
Where Tf, ImaxA and p_sub are parameters, and A1 is my state variable at time=t.
When I solve the model, I use the lsoda solver like this:
out<-as.data.frame(lsoda(start,times,model,parms))
Sometimes (depending on my parameter combinations), the model run completes over the entire duration I have specified, however sometimes it stops short of the mark (still giving me output up until the solver "crashes"). When it "crashes", this message is displayed:
DLSODA- At current T (=R1), MXSTEP (=I1) steps
taken on this call before reaching TOUT
In above message, I1 = 5000
In above message, R1 = 11535.5
Warning messages:
1: In lsoda(start, times, model, parms) :
an excessive amount of work (> maxsteps ) was done, but integration was not successful - increase maxsteps
2: In lsoda(start, times, model, parms) :
Returning early. Results are accurate, as far as they go
It commonly appears when one of the state variables is getting exponentially bigger, or is tending very near to zero, however sometimes it crashes when seemingly not much change is happening. I may be wrong, but is it due to the rate of change of state-variables becoming too large? If so, why might it also "crash" when there is not a fast rate of change?
Is there a way that I can make the solver complete its task with the specified parameter values, maybe with a more relaxed tolerance for error?
Thank you all for your contributions. I looked at some of the rates, and at the point of crashing, the model was switching between two metabolic states - and the fast rate of this binary switch caused the solver to stop - rejecting the solution because the rate of change was too large. I have fixed my model by introducing a gradual switch between states (with a logistic curve) instead of this binary switch. I aknowledge that I didn;t give enough info in the original question, so thanks for the help you offered!

Diagnozing opaque errors and stabilizing / robustifying a Simulation in R

Apologies, since this question is somewhat vague and general, and is certainly not reproducible since the code is too complex. However, I suspect it could be answered by equally vague strategies of approaching these issues that are instructive and helpful.
I have coded a simulator which has a main, parallelized loop iterating through parameter values, loading them to the model and running them n times.
The issue: while the code generally works well for smaller problem dimensions, it fails at a significant frequency at higher dimensions (particularly higher n); most parameter values execute fine and output is produced, but once in a while there is no file produced. The 'post processing' then fails because of missing files.
What I know: Rerunning the function, different parameter values are effected, so this is not due to invalid parameter values, but seemingly a random failure. There have also been some runs without any problems. There was once an error message about failure to allocate vector of size xyz.
What I tried: traceback() seems to focus on the failure at the end of the sim (a symptom) but doesn't find the real cause. I also tried adding a while loop conditional on the existence of the output file, what would rerun the parameter value if it failed (see below, commented out). This seemed to help a little, but not completely.
The above leads me to suspect some threads crash somehow, and then fail to output any of the parameters assigned to it.
Questions: What strategies would you use to diagnose this issue? What methods can one implement to make such a simulation more robust to errors (diagnosed or otherwise)? What kind of operations might I be doing what can cause such failures?
Sketch of the Sim. Loop:
library(foreach)
library(doMC)
Simulator <- function(params,...)
{
[... Pre Processing...]
times<-foreach(i=1:length(params)) %dopar%
{
# while(!file.exists(paste0("output",i,".rds"))) {
run <-list()
run$par <-params[[i]]
run$data <-list()
foreach(j=1:n) %do% # Run Sim n times with params
{
run$data[[j]] <- SimRun(params[[i]],...)
}
# Combine into single array and label dimensions
run$data <- abind(run$data,along=4)
dimnames(run$data)<- headers
# Compute statistics and save them
run$stats <- Stats(run$data,params[[i]])
saveRDS(run,paste0("output",i,".rds"))
# }
[...etc...]
}
[... Post Processing....]
}
Thanks for your patience!

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following error: Quick-TRANSfer stage steps exceeded maximum (= 31834400), and although one can view the code at http://svn.r-project.org/R/trunk/src/library/stats/R/kmeans.R - I am unsure as to what is going wrong. I assume my problem has to do with the size of my dataset, but I would be grateful if someone could clarify once and for all what I can do to mitigate the issue.
I just had the same issue.
See the documentation of kmeans in R via ?kmeans:
The Hartigan-Wong algorithm
generally does a better job than either of those, but trying
several random starts (‘nstart’> 1) is often recommended. In rare
cases, when some of the points (rows of ‘x’) are extremely close,
the algorithm may not converge in the “Quick-Transfer” stage,
signalling a warning (and returning ‘ifault = 4’). Slight
rounding of the data may be advisable in that case.
In these cases, you may need to switch to the Lloyd or MacQueen algorithms.
The nasty thing about R here is that it continues with a warning that may go unnoticed. For my benchmark purposes, I consider this to be a failed run, and thus I use:
if (kms$ifault==4) { stop("Failed in Quick-Transfer"); }
Depending on your use case, you may want to do something like
if (kms$ifault==4) { kms = kmeans(X, kms$centers, algorithm="MacQueen"); }
instead, to continue with a different algorithm.
If you are benchmarking K-means, note that R uses iter.max=10 per default. It may take much more than 10 iterations to converge.
Had the same problem, seems to have something to do with available memory.
Running Garbage Collection before the function worked for me:
gc()
or reference:
Increasing (or decreasing) the memory available to R processes
#jlhoward's comment:
Try
kmeans(dataset, algorithm="Lloyd", ..)
I got the same error message, but in my case it helped to increase the number of iterations iter.max. That contradicts the theory of memory overload.

What's the lowest number R will present before rounding to 0?

I'm doing some statistical analysis with R software (bootstrapped Kolmogorov-Smirnov tests) of very large data sets, meaning that my p values are all incredibly small. I've Bonferroni corrected for the large number of tests that I've performed meaning that my alpha value is also very small in order to reject the null hypothesis.
The problem is, R presents me with p values of 0 in some cases where the p value is presumably so small that it cannot be presented (these are usually for the very large sample sizes). While I can happily reject the null hypothesis for these tests, the data is for publication, so I'll need to write p < ..... but I don't know what the lowest reportable values in R are?
I'm using the ks.boot function in case that matters.
Any help would be much appreciated!
.Machine$double.xmin gives you the smallest non-zero normalized floating-point number. On most systems that's 2.225074e-308. However, I don't believe this is a sensible limit.
Instead I suggest that in Matching::ks.boot you change the line
ks.boot.pval <- bbcount/nboots to
ks.boot.pval <- log(bbcount)-log(nboots) and work on the log-scale.
Edit:
You can use trace to modify the function.
Step 1: Look at the function body, to find out where to add additional code.
as.list(body(ks.boot))
You'll see that element 17 is ks.boot.pval <- bbcount/nboots, so we need to add the modified code directly after that.
Step 2: trace the function.
trace (ks.boot, quote(ks.boot.pval <- log(bbcount)-log(nboots)), at=18)
Step 3: Now you can use ks.boot and it will return the logarithm of the bootstrap p-value as ks.boot.pvalue. Note that you cannot use summary.ks.boot since it calls format.pval, which will not show you negative values.
Step 4: Use untrace(ks.boot) to remove the modifications.
I don't know whether ks.boot has methods in the packages Rmpfr or gmp but if it does, or you feel like rolling your own code, you can work with arbitrary precision and arbitrary size numbers.

Ignoring errors in R

I'm running a complex but relatively quick simulation in R (takes about 5-10 minutes per simulation) and I'm beginning to run it in parallel with various input values in order to test the robustness of some of my algorithms.
There seems to be one problem: some arrangements of inputs cause a fatal error within the simulation and the whole code comes crashing down, causing the simulations to end. Is there an easy way to catch the error (which may come from a variety of locations) and have it just ignore those input values and move on to the next?
It's frustrating when I set an array of inputs to check that should take 5-6 hours to run through all the simulations and I come back to find that it crashed in the first 45 minutes.
While I work on trying to fix the bug / identify inputs that push me to that error, any ideas on how to ignore / catch the errors as they come?
Thanks
I don't know how did your organize your simulations, but I guess uu have a loop where you check use new arguments at each step.
You can use tryCatch . Here I am throwing an error if I have bad input.
step.simul <- function (x) {
stopifnot(x%%2 == 1, x>0)
(x - 1)/2
}
Then using tryCatch, I flag the bad inputs with a code
that tells you about the bad input:
sapply(1:5, function(i)tryCatch(step.simul(i), error=function(e)-1000-i))
[1] 0 -1002 1 -1004 2
As you see my simulations runs over all the loop index.

Resources