Octave Error: out of memory or dimension too large for Octave's index type - out-of-memory

I am trying to run the following code in Octave. The variable "data" consists of 864 rows and 25333 columns.
clc; clear all; close all;
pkg load statistics
GEO = load("GSE59739.mat");
GEOT = tabulate(GEO.class)
data = GEO.data;
clear GEO
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
xlabel('Silhouette Value')
ylabel('Cluster')
This is the error I get when trying to run the silhouette function:
"error: out of memory or dimension too large for Octave's index type". Any idea on how I can fix it?

It appears the problem is not necessarily with your data but with the way Octave's statistics package has implemented pdist. It uses an expansion that results in an array with dimensions that do exceed the system limits, just as the error message says.
Running through your example with some dummy data of the same size, on Octave 6.4.0 and statistics 1.4.3, I get:
pkg load statistics
data = rand(864,25333);
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
error: out of memory or dimension too large for Octave's index type
error: called from
pdist at line 164 column 14
silhouette at line 125 column 16
pdist is a function to calculate the "distance" between any two rows in matrix, using one of several methods. silhouette is called using the cosine metric, and the error occurs in that calculation section:
pdist, lines 163-166 cosine block:
case "cosine"
prod = X(:,Xi) .* X(:,Yi);
weights = sumsq (X(:,Xi), 1) .* sumsq (X(:,Yi), 1);
y = 1 - sum (prod, 1) ./ sqrt (weights);
The first line calculating prod causes the error, as X = data' is 25333x864, and Xi and Yi are each 372816x1, and were formed by running nchoosek(1:rows(data),2) (producing 372816 sets of all 2 element combinations of 1:864).
X(:,Xi) and X(:,Yi) each request creation of a rows(X) x rows(Xi) array, or 25333x372816, or 9,444,547,728 elements, which for double precision data requires 75,556,381,824 Bytes or 75.6GB. Odds are your machine can't handle this.
Just checking with Matlab 2022a, it is able to run those lines without any out of memory errors in a few seconds and the test1 output is only 864x1. So it appears this excessive memory overhead is an issue specific to Octave's implementation and not inherent to the the technique.
I've filed a bug report regarding this behavior at https://savannah.gnu.org/bugs/index.php?62495, but for now the answer appears to be that the 'cosine' metric, and perhaps others as well, simply cannot be used with input data of this size.
Update: as of 19 JUN 2022, a fix for this pdist memory problem has been pushed to the statistics package repository, and will be included in the next major package release. In the meantime the updated function can be found at https://github.com/gnu-octave/statistics/blob/main/inst/pdist.m

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

Error in .local(x, ...): x and y don't match

I'm new to R, and trying to fit a model using kernlab with some data that I just loaded in. However, when I try and load it in I get the error message in the subject line. I assume this means the data type of X and y are not compatible.
Here's some sample code:
data = read.delim("my-sample-file.txt")
model = ksvm(data[, 1:10], data[, 11])
When I call data[, 11] I just the raw values in the column returned to me, and I notice the typeof function returns the value integer, which I found strange. I am not using any additional packages, just trying to get something basic to work.
Thank you.
Reading the help page for ksvm shows that the Usage sections says that using x and y as the input parameters requires a matrix for x, so this should be more successful (assuming that the data object has all numeric columns. You really should be looking at your data carefully before reaching for analysis tools.):
model = ksvm( x = data.matrix(data[, 1:10]), y=data[, 11]) )
Note that you can get exactly the same error with the iris data.frame:
ksvm(x=iris[-5], y=iris$Species)
Error in .local(x, ...) : x and y don't match.
Whereas converting to matrix results in success:
ksvm(x=data.matrix(iris[-5]), y=iris$Species)
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.484488222038106
Number of Support Vectors : 57
Objective Function Value : -3.7021 -3.8304 -21.7405
Training error : 0.026667
Morals of the story: Pay attention to the 'Usage' section to give guidance on the different forms that generic functions may take. And always assume that the authors of the help page are excruciatingly correct in their description of the arguments in the 'Arguments' sections. If they say matrix, don't assume they mean anything sort of like a matrix. (But if you mutter under your breath that this seems like something that should have been anticipated and a more informative error message emitted, I would not disagree.)

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

Simulated Annealing in R: GenSA running time

I am using simulated annealing, as implemented in R's package GenSa (function GenSA), to search for values of input variables that result in "good values" (compared to some baseline) of a highly dimensional function. I noticed that setting maximum number of calls of the objective function has no effect on the running time. Am I doing something wrong or is this a bug?
Here is a modification of the example given in GenSA help file.
library(GenSA)
Rastrigin <- local({
index <- 0
function(x){
index <<- index + 1
if(index%%1000 == 0){
cat(index, " ")
}
sum(x^2 - 10*cos(2*pi*x)) + 10*length(x)
}
})
set.seed(1234)
dimension <- 1000
lower <- rep(-5.12, dimension)
upper <- rep(5.12, dimension)
out <- GenSA(lower = lower, upper = upper, fn = Rastrigin, control = list(max.call = 10^4))
Even though the max.call is specified to be 10,000, GenSA calls the objective function more than 46,000 times (note that the objective is called within a local environment in order to track the number of calls). The same problem rises when trying to specify the maximum running time via max.time.
This is an answer by the package maintainer :
max.call and max.time are soft limits that do not include local
searches that are performed before reaching these limits. The
algorithm does not stop the local search strategy loop before its end
and this may exceed the limitation that you have set but will stop
after that last search. We have designed the algorithm that way to
make sure that the algorithm isn't stopped in the middle of searching
valley. Such an option to stop anywhere will be implemented in the
next release of the package.

R problem with randomForest classification with raster package

I am having an issue with randomForest and the raster package. First, I create the classifier:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
outraster = "classified.pix"
training_band = 2
validation_band = 1
original_classes = c(125,126,136,137,151,152,159,170)
reclassd_classes = c(122,122,136,137,150,150,150,170)
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
# Reclass the training data classes as required
training_class = subs(training_class, data.frame(original_classes,reclassd_classes))
# Find pixels that have training data and prepare the data used to create the classifier
is_training = Which(training_class != 0, cells=TRUE)
training_predictors = extract(myraster, is_training)[,3:nlayers(myraster)]
training_response = as.factor(extract(training_class, is_training))
remove(is_training)
# Create and save the forest, use odd number of trees to avoid breaking ties at random
r_tree = randomForest(training_predictors, y=training_response, ntree = 201, keep.forest=TRUE) # Runs out of memory, does not allow more trees than this...
remove(training_predictors, training_response)
Up to this point, all is good. I can see that the forest was created correctly by looking at the error rates, confusion matrix, etc. When I try to classify some data, however, I run into trouble with the following, which returns all NA's in predictions:
# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, type='response', progress='text')
And gives this warning:
Warning messages:
1: In `[<-.factor`(`*tmp*`, , value = c(1, 1, 1, 1, 1, 1, ... :
invalid factor level, NAs generated
(keeps going like this)...
However, calling predict.randomForest directly works fine and returns the expected predictions (this is not a good option for me because the image is large, and I cannot store the whole matrix in memory):
# Classify the whole image and write it to file
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictor_data = extract(predictor_data, extent(predictor_data))
predictions = predict(r_tree, newdata=predictor_data)
How can I get it to work directly with the "raster" version? I know that this is possible, as shown in the examples of predict{raster}.
You could try nesting predict.randomForest within the writeRaster function and write the matrix as a raster in chunks as per the pdf included in the raster package. Before that, try the argument 'na.rm=TRUE' when calling predict in the raster function. You might also assign dummy values to the NAs in the predict rasters, then later rewriting them as NAs using functions in the raster package.
As for memory problems when calling RFs, I've had a plethora of memory issues dealing with BRTs. They're immense on disk and in memory! (Should a model be more complex than the data?) I've not had them run reliably on 32-bit machines (WinXp or Linux). Sometimes tweaking Windows memory allotment to applications has helped, and moving to Linux has helped more, but I get the most from 64-bit Windows or Linux machines, since they impose a higher (or no) limit on the amount of memory applications can take. You may be able to increase the number of trees you can use by doing this.

Resources