how to avoid R fisher.test workspace errors - r

I am preforming a fisher's exact test on a large number of contingency tables and saving the p-val for a bioinformatics problem. Some of these contingency tables are large so I've increased the workspace as much as I can; but when I run the following code I get an error:
result <- fisher.test(data,workspace=2e9)
LDSTP is too small for this problem. Try increasing the size of the workspace.
if I increase the size of the workspace I get another error:
result <- fisher.test(data,workspace=2e10)
cannot allocate memory block of size 134217728Tb
Now I could just simulate pvals:
result <- fisher.test(data, simulate.p.value = TRUE, B = 1e5)
but Im afraid Ill need a huge number of simulations to get accurate results since my pvals may be extremely small in some cases.
Thus my question whether there is some way to preemptively check if a contingency table is too complex to calculate exactly? In those cases alone I could switch to using a large number of simulations with B=1e10 or something. Or at least just skip those tables with a value of "NA" so that my job actually finishes?

Maybe you colud use tryCatch to get desired behaviour when fisher.test fails? Something like this maybe:
tryCatchFisher<-function(...){
tryCatch(fisher.test(...)$p.value,
error = function(e) {'too big'})
}

Related

How to remove the models that failed convergence from a set of random questions?

I want to include some random replications of model estimations (e.g., GARCH model) in the question. The code uses a different data series randomly. In this process, some GARCH estimations for some random data series may not achieve numerical convergence. Therefore, I need to code the question/problem in such a way that it has to remove the models that failed convergence from the set of questions. How can I code this when I use R-exams?
Basic idea
In general when using random data in the generation of exercises, there is a chance that sometimes something goes wrong, e.g., the solution does not fall into a desired range (i.e., becomes too large or too small), or the solution does not even exist due to mathematical intractability or numerical problems (as you point out) etc.
Of course, it is best to avoid such problems in the data-generating process so that they do not occur at all. However, it is not always possible to do so or not worth the effort because problems occur very rarely. In such situations I typically use a while() loop to re-generate the random data if necessary. As this might run potentially for several iterations it is important, though, to make the probably sufficiently small that it is needed.
Worked example
A worked example can be found in the fourfold exercise that ships with the package. It randomly generates a fourfold table with probabilities that should subsequently be reconstructed from partial information in the actual exercise. In order for the exercise to be well-defined all entries of the table must be (strictly) between 0 and 1 and they must sum up to 1. The simulation code actually tries to assure that but edge cases might occur. Rather than writing more code to avoid these edge cases, a simple while() loop tries to catch them and sample a new table if needed:
ok <- FALSE
while(!ok) {
[...generate probabilities...]
tab <- cbind(c(prob1, prob3), c(prob2, prob4))
[...compute solutions...]
ok <- sum(tab) == 1 & all(tab > 0) & all(tab < 1)
}
Application to catching errors
The same type of strategy could also be used for other problems such as the ones you describe. You can wrap the model estimation into a code like
fit <- try(mymodel(...), silent = TRUE)
and then use something like
ok <- !inherits(fit, "try-error")
In addition to not producing an error you might require, say that all coefficients are positive (or something like that). Then you would do:
ok <- !inherits(fit, "try-error") && all(coef(fit) > 0)
Analogously, you could check the convergence of the model etc.

nTrials must be be greater.... issue on conjoint design

I'm trying to create a list of conjoint cards using R.
I have followed the professor's introduction, with my own dataset, but I'm stuck with this issue, which I have no idea.
library(conjoint)
experiment<-expand.grid(
ServiceRange = c("RA", "Active", "Passive","Basic"),
IdentProce = c("high", "mid", "low"),
Fee = c(1000,500,100),
Firm = c("KorFin","KorComp","KorStrt", "ForComp")
)
print(experiment)
design=caFactorialDesign(data=experiment, type="orthogonal")
print(design)
at the "design" line, I'm keep getting the following error message:
Error in optFederov(~., data, nTrials = i, approximate = FALSE, nRepeats = 50) :
nTrials must not be greater than the number of rows in data
How do I address this issue?
You're getting this error because you have 144 rows in experiment, but the nTrials mentioned in the error gets bigger than 144. This causes an error for optFederov(), which is called inside caFactorialDesign(). The problem stems from the fact that your Fee column has relatively large values.
I'm not familiar with how the conjoint package is set up, but I can show you how to troubleshoot this error. You can read the conjoint documentation for more on how to select appropriate experimental data.
(Note that the example data in the documentation always has very low numeric values, usually values between 1-10. Compare that with your Fee vector, which has values up to 1000.)
You can see the source code for a function loaded into your RStudio namespace by highlighting the function name (e.g. caFactorialDesign) and hitting Command-Return (on a Mac - probably something similar on PC). You can also just look at the source code on GitHub.
The caFactorialDesign is implemented here. That link highlights the line (26) that is throwing the error for you:
temp.design<-optFederov(~., data, nTrials=i, approximate=FALSE, nRepeats=50)
Recall the error message:
nTrials must not be greater than the number of rows in data
You've passed in experiment as the data parameter, so nrow(experiment) will tell us what the upper limit on nTrials is:
nrow(experiment) # 144
We can actually just think of the error for this dataset as:
nTrials must not be greater than 144
Ok, so how is the value for nTrials determined? We can see nTrials is actually an argument to optFederov(), and its value is set as i - often a sign that there's a for-loop wrapping an operation. And in fact, that's what we see:
for (i in ca.number: profiles.number)
{
temp.design<-optFederov(~., data, nTrials=i, approximate=FALSE, nRepeats=50)
...
}
This tells us that optFederov() is going to get called for each value of i in the loop, which will start at ca.number and will go up to profiles.number (inclusive).
How are these two variables assigned? If we look a little higher up in the caFactorialDesign() definition, ca.number is defined on lines 5-9:
num <- data.frame(data.matrix(data))
vars.number<-length(num)
levels.number<-0
for (i in 1:length(num)) levels.number<-levels.number+max(num[i])
ca.number<-levels.number-vars.number+1
You can run these calculations outside of the function - just remember that data == experiment. So just change that first line to num <- data.frame(data.matrix(experiment)), and then run that chunk of code. You can see that ca.number == 1008!!
In other words, the very first value of i in the for-loop which calls optFederov() is already way bigger than the max limit: 1008 >> 144.
It's possible you can include these numeric values as factors or strings in your definition of experiment - I'm not sure if that is an appropriate way to do this analysis. But I hope it's clear that you won't be able to use such large values in caFactorialDesign(), unless you have a much larger number of total observations in your data.

apcluster in R: Memory limitation

I am trying to run clustering exercise in R. The algorithm that I used is apcluster(). The script that I used is:
s1 <- negDistMat(df, r=2, method="euclidean")
apcluster <- apcluster(s1)
My data set is having around 0.1 million rows. When I ran the script, I got the following error:
Error in simpleDist(x[, sapply(x, is.numeric)], sel, method = method, :
negative length vectors are not allowed
When I searched online, I found out that negative length vector error occurs due to the memory limit of my RAM. My question is if there is any workaround to run apcluster() on my dataset with 0.1 million rows with the available RAM, or am I missing something that I will need to take care while running apcluster in R?
I have a machine with 8 GB of RAM.
The standard version of affinity propagation implemented in the apcluster() method will never ever run successfully on data of that size. On the one hand, the similarity matrix (s1 in your code sample) will have 100K x 100K = 10G entries. On the other hand, computation times will be excessive. I suggest you use apclusterL() instead.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

fast perl t-test function

I'm using perl+R to analyze a large dataset of samples. For each two samples, I calculate the t-test p-value. Currently, I'm using the statistics::R module to export values from perl to R, and then use the t.test function. However, this process is extremely slow. I was wondering if someone knows a perl function that will do the same procedure, in a more efficient manner.
Thanks!
The volume of data, the number of dataset pairs, and perhaps even the code you have written would probably help us identify why your code is slow. For instance, sending many small datasets to R would be slow, but can probably be sped up simply by sending all the data at once.
For a pure Perl solution, you first need to compute the test statistic (that is easy, and already done in
Statistics::TTest,
for instance), and then to convert it to a p-value (you need something like R's qt function, but I am not sure it is readily available in Perl -- you could send the T-values to R, in one block, at the end, to convert them to p-values).
You can also try PDL, in particular PDL::Stats.
The Statistics::TTest module gives you a p-value.
use Statistics::TTest;
my #r1 = map { rand(10) } 1..32;
my #r2 = map { rand(10)-2 } 1..32;
my $ttest = new Statistics::TTest;
$ttest->load_data(\#r1,\#r2);
say "p-value = prob > |T| = ", $ttest->{t_prob};
Playing around a bit, I find that the p-values that this gives you are slightly lower than what you get from R. R is apparently doing something that reduces the degrees of freedom, but my knowledge of statistics is insufficient to explain what it's doing or why. (In the above example, the difference is about 1%. If you use samples of 320 floats instead of 32, then the difference is 50% or even more, but it's a difference between 1e-12 and 1.5e-12.) If you need precise p-values, you will want to take care.

Resources