Consensus clustering with diceR package - r

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?

The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

Related

Octave Error: out of memory or dimension too large for Octave's index type

I am trying to run the following code in Octave. The variable "data" consists of 864 rows and 25333 columns.
clc; clear all; close all;
pkg load statistics
GEO = load("GSE59739.mat");
GEOT = tabulate(GEO.class)
data = GEO.data;
clear GEO
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
xlabel('Silhouette Value')
ylabel('Cluster')
This is the error I get when trying to run the silhouette function:
"error: out of memory or dimension too large for Octave's index type". Any idea on how I can fix it?
It appears the problem is not necessarily with your data but with the way Octave's statistics package has implemented pdist. It uses an expansion that results in an array with dimensions that do exceed the system limits, just as the error message says.
Running through your example with some dummy data of the same size, on Octave 6.4.0 and statistics 1.4.3, I get:
pkg load statistics
data = rand(864,25333);
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
error: out of memory or dimension too large for Octave's index type
error: called from
pdist at line 164 column 14
silhouette at line 125 column 16
pdist is a function to calculate the "distance" between any two rows in matrix, using one of several methods. silhouette is called using the cosine metric, and the error occurs in that calculation section:
pdist, lines 163-166 cosine block:
case "cosine"
prod = X(:,Xi) .* X(:,Yi);
weights = sumsq (X(:,Xi), 1) .* sumsq (X(:,Yi), 1);
y = 1 - sum (prod, 1) ./ sqrt (weights);
The first line calculating prod causes the error, as X = data' is 25333x864, and Xi and Yi are each 372816x1, and were formed by running nchoosek(1:rows(data),2) (producing 372816 sets of all 2 element combinations of 1:864).
X(:,Xi) and X(:,Yi) each request creation of a rows(X) x rows(Xi) array, or 25333x372816, or 9,444,547,728 elements, which for double precision data requires 75,556,381,824 Bytes or 75.6GB. Odds are your machine can't handle this.
Just checking with Matlab 2022a, it is able to run those lines without any out of memory errors in a few seconds and the test1 output is only 864x1. So it appears this excessive memory overhead is an issue specific to Octave's implementation and not inherent to the the technique.
I've filed a bug report regarding this behavior at https://savannah.gnu.org/bugs/index.php?62495, but for now the answer appears to be that the 'cosine' metric, and perhaps others as well, simply cannot be used with input data of this size.
Update: as of 19 JUN 2022, a fix for this pdist memory problem has been pushed to the statistics package repository, and will be included in the next major package release. In the meantime the updated function can be found at https://github.com/gnu-octave/statistics/blob/main/inst/pdist.m

bartMachine in caret train error : incorrect number of dimensions

I encounter a strange problem when trying to train a model in R using caret :
> bart <- train(x = cor_data, y = factor(outcome), method = "bartMachine")
Error in tuneGrid[!duplicated(tuneGrid), , drop = FALSE] :
nombre de dimensions incorrect
However, when using rf, xgbTree, glmnet, or svmRadial instead of bartMachine, no error is raised.
Moreover, dim(cor_data) and length(outcome) return [1] 3056 134 and [1] 3056 respectively, which indicates that there is indeed no issue with the dimensions of my dataset.
I have tried changing the tuneGrid parameter in train, which resolved the problem but caused this issue instead :
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-89-thread-1"
My dataset includes no NA, and all variables are either numerical or binary.
My goal is to extract the most important variables in the bart model. For example, I use for random forests:
rf <- train(x = cor_data, y = factor(outcome), method = "rf")
rfImp <- varImp(rf)
rf_select <- row.names(rfImp$importance[order(- rfImp$importance$Overall)[1:43], , drop = FALSE])
Thank you in advance for your help.
Since your goal is to extract the most important variables in the bart model, I will assume you are willing to bypass the caret wrapper and do it directly in R bartMachine, which is the only way I could successfully run it.
For my system, solving the memory issue required 2 further things:
Restart R and before loading anything, allocate 8Gb memory as so:
options(java.parameters = "-Xmx8g")
When running bartMachineCV, turn off mem_cache_for_speed:
library(bartMachine)
set_bart_machine_num_cores(16)
bart <- bartMachineCV(X = cor_data, y = factor(outcome), mem_cache_for_speed = F)
This will iterate through 3 values of k (2, 3 and 5) and 2 values of m (50 and 200) running 5 cross-validations each time, then builds a bartMachine using the best hyperparameter combination. You may also have to reduce the number of cores depending on your system, but this took about an hour on a 20,000 observation x 12 variable training set on 16 cores. You could also reduce the number of hyperparameter combinations it tests using the k_cvs and num_tree_cvs arguments.
Then to get the variable importance:
vi <- investigate_var_importance(bart, num_replicates_for_avg = 20)
print(vi)
You can also use it as a predictive model with predict(bart, new_data=new) similar to the object normally returned by caret::train(). This worked on R4.0.5, bartMachine_1.2.6 and rJava_1.0-4

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Neuralnet package in r with simple structure taking very long time , what is the issue here?

I have a problem with neuralnet function from neuralnet package in R.
I designed a simple structure with 82 feature as input and only 1 hidden layer with 10 neurons and output is 20 class and I left this line which represent neuralnet function to run above 4 hours and didn't finish !
This is the code :
nn=neuralnet(f, data = train, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)
Training of the neural network can be arbitrary long, what affects this time?
Complexity of the network (not a problem here as your network is quite small)
Size of the training data - even few thousands of samples can take quite a while, furthermore number of features also significantly increase computation time
Training algorithm and its hyperparameters - in particular for SGD based solutions - too small learning rate (or to big as it causes the oscilation)
Type of stopping criterion - there are many ways of checking whether to stop training a NN, some more expensive (validation score) than others (amplitude of gradient/number of epochs).
In your particular example your training takes at most 100,000 steps and you use rprop+ learning. Thus the most probable problem is the size of the training data. You can try to set stepmax to some much smaller value to see how much time it needs and how good is the model.
In general - neural networks are hard and slow to train, you have to deal with it or switch to other models.
You can easily predict the computation time and complexity of your code before running it on the full data with the GuessCompx package.
Create fake data with the same characteristics as yours, and 20-class Y vector and a wrapper function:
train = data.frame(matrix(rnorm(300000*82, 3), ncol=82))
train['Y'] = as.character(round(runif(300000, 1,20)))
nn_test = function(data) {
nn=neuralnet(formula=Y~., data=data, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)
}
And then do the audit:
library(GuessCompx) # get it by running: install.packages("GuessCompx")
library(neuralnet)
CompEst(train, nn_test)
#### $`TIME COMPLEXITY RESULTS`$best.model
#### [1] "NLOGN"
#### $`TIME COMPLEXITY RESULTS`$computation.time.on.full.dataset
#### [1] "1M 4.86S"
#### $`MEMORY COMPLEXITY RESULTS`$best.model
#### [1] "LINEAR"
#### $`MEMORY COMPLEXITY RESULTS`$memory.usage.on.full.dataset
#### [1] "55535 Mb"
#### $`MEMORY COMPLEXITY RESULTS`$system.memory.limit
#### [1] "16282 Mb"
See that the computation time is not a problem, but the memory usage and limitations might be impacting your computer, causing the long delay? The only nn output object takes more than 4Gb to be stored!

R problem with randomForest classification with raster package

I am having an issue with randomForest and the raster package. First, I create the classifier:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
outraster = "classified.pix"
training_band = 2
validation_band = 1
original_classes = c(125,126,136,137,151,152,159,170)
reclassd_classes = c(122,122,136,137,150,150,150,170)
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
# Reclass the training data classes as required
training_class = subs(training_class, data.frame(original_classes,reclassd_classes))
# Find pixels that have training data and prepare the data used to create the classifier
is_training = Which(training_class != 0, cells=TRUE)
training_predictors = extract(myraster, is_training)[,3:nlayers(myraster)]
training_response = as.factor(extract(training_class, is_training))
remove(is_training)
# Create and save the forest, use odd number of trees to avoid breaking ties at random
r_tree = randomForest(training_predictors, y=training_response, ntree = 201, keep.forest=TRUE) # Runs out of memory, does not allow more trees than this...
remove(training_predictors, training_response)
Up to this point, all is good. I can see that the forest was created correctly by looking at the error rates, confusion matrix, etc. When I try to classify some data, however, I run into trouble with the following, which returns all NA's in predictions:
# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, type='response', progress='text')
And gives this warning:
Warning messages:
1: In `[<-.factor`(`*tmp*`, , value = c(1, 1, 1, 1, 1, 1, ... :
invalid factor level, NAs generated
(keeps going like this)...
However, calling predict.randomForest directly works fine and returns the expected predictions (this is not a good option for me because the image is large, and I cannot store the whole matrix in memory):
# Classify the whole image and write it to file
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictor_data = extract(predictor_data, extent(predictor_data))
predictions = predict(r_tree, newdata=predictor_data)
How can I get it to work directly with the "raster" version? I know that this is possible, as shown in the examples of predict{raster}.
You could try nesting predict.randomForest within the writeRaster function and write the matrix as a raster in chunks as per the pdf included in the raster package. Before that, try the argument 'na.rm=TRUE' when calling predict in the raster function. You might also assign dummy values to the NAs in the predict rasters, then later rewriting them as NAs using functions in the raster package.
As for memory problems when calling RFs, I've had a plethora of memory issues dealing with BRTs. They're immense on disk and in memory! (Should a model be more complex than the data?) I've not had them run reliably on 32-bit machines (WinXp or Linux). Sometimes tweaking Windows memory allotment to applications has helped, and moving to Linux has helped more, but I get the most from 64-bit Windows or Linux machines, since they impose a higher (or no) limit on the amount of memory applications can take. You may be able to increase the number of trees you can use by doing this.

Resources