Running clustering analysis on a cluster causes one of the nodes to crash - r

I am a user of a Rocks 4.3 cluster with 22 nodes. I am using it to run a clustering function - parPvclust - on a dataset of 2 million rows and 100 columns (it clusters the sample names in the columns). To run parPvclust, I am using a C-shell script in which I've embedded some R code. Using the R code as it is below with a dataset of2 million rows and 100 columns, I always crash one of the nodes.
library("Rmpi")
library("pvclust")
library("snow")
cl <- makeCluster()
load("dataset.RData") # dataset.m: 2 million rows x 100 columns
# subset.m <- dataset.m[1:200000,] # 200 000 rows x 100 columns
output <- parPvclust(cl, dataset.m, method.dist="correlation", method.hclust="ward",nboot=500)
save(output,"clust.RData")
I know that the C-shell script code works, and I know that the R-code actually works with a smaller dataset because if I use a subset of the dataset (commented out above), the code runs fine and I get an output. Likewise, if I use the non-parallelized version (i.e. just pvclust), that also works fine, although running the non-parallelized version defeats the gain in speed of running it in parallel.
The parPvclust function requires the Rmpi and snow R packages (for parallelization) and the pvclust package.
The following can produce a reasonable approximation of the dataset I'm using:
dataset <- matrix(unlist(lapply(rnorm(n=2000,0,1),rep,sample.int(1000,1))),ncol=100,nrow=2000000)
Are there any ideas as to why I always crash a node with the larger dataset and not the smaller one?

Related

FactoMineR PCA takes a really long time

I'm trying to run a PCA on a really large dataset (160 000 x 20 000 variables, approx 6.3G in the file but much more when loaded in R) on a cluster. However, it is taking a high amount of time (my job was killed after 90 hours) while it was usually done in a few hours on datasets half the size.
I'm using the most basic R code possible :
data=read.table("dataset.csv", header=T, sep=',',row.names=1, fill=TRUE)
y=PCA(data, ncp=100, graph=FALSE)
Is there something wrong with what I'm doing or should I try a PCA from another package?

How to resolve error in running KModes algorithm in R

I'm working on a segmentation problem and I have a dataframe with 49 variables and 500000 observations which could be continous, binary or categorical. I'm reading only those variables which do not have any NA values in them. Also to be on the safe side I was also using na.omit option.
Now as the dataset is too large, I wass trying to incrementally run it and did a sample at 1000, 10000 and 50000 rows. It run successfully on 1000 and 10000 rows with the following code :
t1c <- t1[sample(nrow(t1),50000),-c(5,23,25,26,28,55)]
library(klaR)
segments <- kmodes(na.omit(t1c),4, iter.max = 5)
where t1 is my original dataframe. When I run this with 50000 rows, I got the following error
Error in match.arg(useNA) : 'arg' must be of length 1
Any idea as to what might be the issue here.
P.s. Also I'm trying to run PAM using daisy() as it might be a better fit as I'm researching more on this for the type of data I have but still I was wondering if kmodes was running with 10000 samples, what might be an issue with 50000

Big data. Store and analyze objects without using the RAM

I'm trying to get a dissimilarity matrix with the funciton beta.pair of the package betapart from a database with 11000 rows x 6000 columns.
However, when I execute the function R gives me the error: "cannot allocate a vector of size 10 GB".
I'm using a Linux system with 65 GB of RAM, and R 64 bits, so I shouldn't have this limitation problem, and R should use all the available memory of the CPU, but it doesn't.
Do you know how can I use any package to store the dissimilarity matrix (e.g. bigmemory, ff) and later use it to perform an UPGMA analysis storing the result in the CPU and not in the RAM.
Exactly what I would like to do is:
library(betapart)
data_base # Database with 6000 columns and 11000 rows
distance <- beta.pair (data_base, index.family="sorensen") # Object with the dissimilarity matrices.
beta_similarity <- distance$beta.sim # Select one of the dissimilarity matrices provided by the function beta.pair
library(cluster)
UPGMA<- agnes (beta_similarity ) # UPGMA analyses
Thanks in advance,

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
e.g.:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
EDIT:
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
cells[,1]
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

Parallel processing in R

I'm working with a custom random forest function that requires both a starting and ending point in a set of genomic data (about 56k columns).
I'd like to split the column numbers into subgroups and allow each subgroup to be processed individually to speed things up. I tried this (unsuccessfully) with the following code:
library(foreach)
library(doMC)
foreach(startMrk=(markers$start), endMrk=(markers$end)) %dopar%
rfFunction(genoA,genoB,0.8,ntree=100,startMrk=startMrk,endMrk=endMrk)
Where startMrk is an array of numeric variables: 1 4 8 12 16 and endMrk is another array: 3 7 11 15 19
For this example, I'd want one core to run samples 1:3, another to run 4:7, etc. I'm new to the idea of parallel processing in R, so I'm more than willing to study any documentation available. Does anyone have advice on things I'm missing for parallel-wise processing or for the above code?
The basic point here is that you're splitting up your columns into chunks, right. First, it might be better to chunk your dataset appropriately at each iteration and feed the chunks into RF. Also, foreach works just like for in some ways, so the code can be
rfs=vector('list',4)
foreach(i=1:4) %dopar% {
ind <- markers$start[i]:markers$end[i]
rfs[[i]] <- randomForest(genoA[,ind],genoB[,ind], 0.8, ntree=100)
}
I gave this in regular randomForest, but you can wrap this up into your custom code in a straightforward manner.

Resources