apcluster in R: Memory limitation - r

I am trying to run clustering exercise in R. The algorithm that I used is apcluster(). The script that I used is:
s1 <- negDistMat(df, r=2, method="euclidean")
apcluster <- apcluster(s1)
My data set is having around 0.1 million rows. When I ran the script, I got the following error:
Error in simpleDist(x[, sapply(x, is.numeric)], sel, method = method, :
negative length vectors are not allowed
When I searched online, I found out that negative length vector error occurs due to the memory limit of my RAM. My question is if there is any workaround to run apcluster() on my dataset with 0.1 million rows with the available RAM, or am I missing something that I will need to take care while running apcluster in R?
I have a machine with 8 GB of RAM.

The standard version of affinity propagation implemented in the apcluster() method will never ever run successfully on data of that size. On the one hand, the similarity matrix (s1 in your code sample) will have 100K x 100K = 10G entries. On the other hand, computation times will be excessive. I suggest you use apclusterL() instead.

Related

How to fix "Error: cannot allocate vector of size 265.6 Mb"

I have a very large list of dataframes (300 dataframes, each with 2 columns and 300~600 rows), and I want to join all of them with
final <- subset %>% reduce(full_join, by = "Frame_times")
When I try to do this, however, I get the following error:
Error: cannot allocate vector of size 265.6 Mb"
I am operating on 64-bit Windows 10 with the latest installation of 64-bit R (4.0.0).
I have 8gb of RAM, and
> memory.limit()
[1] 7974
> memory.size(max = TRUE)
[1] 7939.94
I have also tried the gc() function, but it did not help.
It appears that I have enough space and memory to run this, so why am I getting this error?
And how can I fix it?
Thank you very much!
You are running out of RAM. A first step to troubleshooting might be to first run this code on a smaller subset of dataframes (say, 3). Are the results (in particular, the number of rows) what you were expecting? If yes and it's really doing the right thing, then it might help to do it in batches (say 5 batches of 100). It sounds like the mostly likely scenario is that for some reason the number of rows or columns is blowing up to a much bigger number than you're expecting.
The 266Mb mentioned in the error is just the final straw; not the total memory you're using.

how to avoid R fisher.test workspace errors

I am preforming a fisher's exact test on a large number of contingency tables and saving the p-val for a bioinformatics problem. Some of these contingency tables are large so I've increased the workspace as much as I can; but when I run the following code I get an error:
result <- fisher.test(data,workspace=2e9)
LDSTP is too small for this problem. Try increasing the size of the workspace.
if I increase the size of the workspace I get another error:
result <- fisher.test(data,workspace=2e10)
cannot allocate memory block of size 134217728Tb
Now I could just simulate pvals:
result <- fisher.test(data, simulate.p.value = TRUE, B = 1e5)
but Im afraid Ill need a huge number of simulations to get accurate results since my pvals may be extremely small in some cases.
Thus my question whether there is some way to preemptively check if a contingency table is too complex to calculate exactly? In those cases alone I could switch to using a large number of simulations with B=1e10 or something. Or at least just skip those tables with a value of "NA" so that my job actually finishes?
Maybe you colud use tryCatch to get desired behaviour when fisher.test fails? Something like this maybe:
tryCatchFisher<-function(...){
tryCatch(fisher.test(...)$p.value,
error = function(e) {'too big'})
}

R Parallel Processing - Node Choice

I am attempting to process a large amount of data in R on Windows using the parallel package on a computer with 8 cores. I have a large data.frame that I need to process row-by-row. For each row, I can estimate how long it will take for that row to be processed and this can vary wildly from 10 seconds to 4 hours per row.
I don't want to run the entire program at once under the clusterApplyLB function (I know this is probably the most optimal method) because if it hits an error, then my entire set of results might be lost. My first attempt to run my program involved breaking it up into Blocks and then running each Block individually in parallel, saving the output from that parallel run and then moving on to the next Block.
The problem is that as it ran through the rows, rather than running at 7x "real" time (I have 8 cores, but I wanted to keep one spare), it only seems to be running at about 2x. I've guessed that this is because the allocation of rows to each core is inefficient.
For example, ten rows of data with 2 cores, two of the rows could run in 4 hours and the other two will take 10 seconds. Theoretically this could take 4 hours and 10 seconds to run but if allocated inefficiently, it could take 8 hours. (Obviously this is an exaggeration, but a similar situation can happen when estimates are incorrect with more cores and more rows)
If I estimate these times and submit them to the clusterApplyLB in what I estimate to be the correct order (to get the estimated times to be spread across cores to minimize time taken), they might not be sent to the cores that I want them to be, because they might not finish in the time that I estimate them to. For example, I estimate two processes to have times of 10 minutes and 12 minutes and they take 11.6 minutes and 11.4 seconds then the order that the rows are submitted to clusterApplyLB won't be what I anticipated. This kind of error might seem small, but if I have optimised multiple long-time rows, then this mix-up of order could cause two 4-hour rows to go to the same node rather than to different nodes (which could almost double my total time).
TL;DR. My question: Is there a way to tell an R parallel processing function (e.g. clusterApplyLB, clusterApply, parApply, or any sapply, lapply or foreach variants) which rows should be sent to which core/node? Even without the situation I find myself in, I think this would be a very useful and interesting thing to provide information on.
I would say there are 2 different possible solution approaches to your problem.
The first one is a static optimization of the job-to-node mapping according to the expected per-job computation time. You would assign each job (i.e., row of your dataframe) a node before starting the calculation. Code for a possible implementation of this is given below.
The second solution is dynamic and you would have to make your own load balancer based on the code given in clusterApplyLB. You would start out the same as in the first approach, but as soon as a job is done, you would have to recalculate the optimal job-to-node mapping. Depending on your problem, this may add significant overhead due to the constant re-optimization that takes place. I think that as long as you do not have a bias in your expected computation times, it's not necessary to go this way.
Here the code for the first solution approach:
library(parallel)
#set seed for reproducible example
set.seed(1234)
#let's say you have 100 calculations (i.e., rows)
#each of them takes between 0 and 1 second computation time
expected_job_length=runif(100)
#this is your data
#real_job_length is unknown but we use it in the mock-up function below
df=data.frame(job_id=seq_along(expected_job_length),
expected_job_length=expected_job_length,
#real_job_length=expected_job_length + some noise
real_job_length=expected_job_length+
runif(length(expected_job_length),-0.05,0.05))
#we might have a negative real_job_length; fix that
df=within(df,real_job_length[real_job_length<0]<-
real_job_length[real_job_length<0]+0.05)
#detectCores() gives in my case 4
cluster_size=4
Prepare the job-to-node mapping optimization:
#x will give the node_id (between 1 and cluster_size) for each job
total_time=function(x,expected_job_length) {
#in the calculation below, x will be a vector of reals
#we have to translate it into integers in order to use it as index vector
x=as.integer(round(x))
#return max of sum of node-binned expected job lengths
max(sapply(split(expected_job_length,x),sum))
}
#now optimize the distribution of jobs amongst the nodes
#Genetic algorithm might be better for the optimization
#but Differential Evolution is good for now
library(DEoptim)
#pick large differential weighting factor (F) ...
#... to get out of local minimas due to rounding
res=DEoptim(fn=total_time,
lower=rep(1,nrow(df)),
upper=rep(cluster_size,nrow(df)),
expected_job_length=expected_job_length,
control=DEoptim.control(CR=0.85,F=1.5,trace=FALSE))
#wait for a minute or two ...
#inspect optimal solution
time_per_node=sapply(split(expected_job_length,
unname(round(res$optim$bestmem))),sum)
time_per_node
# 1 2 3 4
#10.91765 10.94893 10.94069 10.94246
plot(time_per_node,ylim=c(0,15))
abline(h=max(time_per_node),lty=2)
#add node-mapping to df
df$node_id=unname(round(res$optim$bestmem))
Now it's time for the calculation on the cluster:
#start cluster
workers=parallel::makeCluster(cluster_size)
start_time=Sys.time()
#distribute jobs according to optimal node-mapping
clusterApply(workers,split(df,df$node_id),function(x) {
for (i in seq_along(x$job_id)) {
#use tryCatch to do the error handling for jobs that fail
tryCatch({Sys.sleep(x[i,"real_job_length"])},
error=function(err) {print("Do your error handling")})
}
})
end_time=Sys.time()
#how long did it take
end_time-start_time
#Time difference of 11.12532 secs
#add to plot
abline(h=as.numeric(end_time-start_time),col="red",lty=2)
stopCluster(workers)
Based on the input , it seems you are already saving the output of a task within that task.
Assuming each parallel task is saving the output as a file, you probably need an initial function that predicts the time for a particular row.
In order to do that
generate a structure with estimated time and row number
sort the the estimated time and reorder rows and run the parallel
process for each reordered rows.
This would automatically balance the workload.
We had a similar problem where the process had to be done column wise and each column took 10-200 seconds. So we generated a function to estimate time, reordered the column based on that and ran parallel process for each column.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

How to avoid the memory limit in R

I'm trying to replace values in a matrix, specifically "t"->1 and "f"->0, but I keep getting the error messages:
Error: cannot allocate vector of size 2.0 Mb
...
Reached total allocation of 16345Mb: see help(memory.size)
I'm using a Win7 computer with 16GB of memory on the 64-bit version of R in RStudio.
what i'm currently running is
a <- matrix( dataset, nrow=nrow(dataset), ncol=ncol(dataset), byrow=TRUE)
memory.size()
a[a=="t"] <- 1
where dataset is a (about) 525000x300 size data frame. the memory.size() line gives me less than 4GB used, and memory.limit() is 16GB. Why is it that the replace line would require so much memory to execute? Is there any way to do the replace without hitting the memory limit (and are there any good tips on avoiding it in general), and if so, is it going to cost me a lot of time to run it? I'm still pretty new to R so I don't know if it makes a difference depending on the data class I use and how R allocates memory...
when you call this line
a[a=="t"] <- 1
R has to create a whole new boolean matrix to index into a. If a is huge, this boolean matrix will also be huge.
Maybe you can try working on smaller sections of the matrix instead of trying to do it all in one shot.
for (i in 1:ncol(a)){
ix = (a[:,i] == "t")
a[ix,i] = 1
}
It's not fast or elegant, but it might get around the memory problem.

Resources