How to Chunking large dissimilarity / distance matrices in R? - r

I would like to cluster mix-type data that contains 50k rows and 10 features/columns. I am using R in my 64 bit PC. When I calculate dissimilarity / distance matrix with "daisy" function, I got "Error: cannot allocate vector of size X GB" error.
gower_dist <- daisy(df, metric = "gower").
This is the command to generate distance matrix. How to handle this script with chunks to avoid RAM error ?

Related

Efficient allocation of array in r,igraph

I want to create 2 matrices with the same dimensions of the adjacency matrix of the graph i have.The problem is that the graph is way to large.
Here is my code:
AjM<-as_adjacency_matrix(g,attr = "weight")
dim(AjM)
77500 77500
Alpha<-array(0,dim(AjM))
Error: cannot allocate vector of size 44.6 Gb .AjM is of size 18.8Mb.
How can i do that?( i have an 64 bit machine with 12Gb ram )
Notice that as_adjacency_matrix has an argument sparse. Assuming that your matrix has fewer than 77500 * 77500 / 2 edges, setting sparse=TRUE may reduce the size of the resulting adjacency matrix significantly. To use this option, you must have the Matrix package installed.

Compute dissimilarity matrix on parallel cores [duplicate]

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

Linear optimisation and limitation in R

We are trying to solve the next linear optimization problem:
We have:
Pij, i=1:3, j=1:30000, Pij are positive
Bi, i=1:3, integer positive
The searching result is matrix of 3 x 30000 of binary values Xij with next conditions:
Constraints:
For each j =1:30000, Sum (by index of i=1:3)Xij=1
For each i =1:3, Sum by index of (j=1:30000) Xij≤Bi
Objective: Optimize:
Maximize (Sum (by index of i=1:3) Sum (by index of j=1:30000) Pij *Xij)
The decision was to reduce this task to a linear programming task. Hence we construct one matrix with 3 x 3*j dimension for the Bi constraints and one matrix j x 3*j dimension for the Xij constrains. Then, the two matrixes should be combined vertically as constraint of task – the received matrix is 3+j x 3*j dimensional. The object vector is constructed by Pij but as vector 3*j x 1. The rhs constraint is combination between Bi (1 x 3) and vector of 1 (1 x j) – vector is 1 x 3+j.
It worked with lp or Rglpk_solve_LP.
I checked this with several combinations. It worked for j=5000, but it doesn’t worked for j=10000. We should use it for 30000 cases. The matrixes became too large.
Is it possible to solve this task in another way?
My computer has 8GB RAM. The size of matrix is 15.6GB. The returned error is:
Error: cannot allocate vector of size 15.6 Gb
What are the limitation for the linear programing procedure?
Are they come only by RAM of computer and size of matrixes?

R dynamic time warping for long time series

I'm trying to calculate the dtw distance for very long time series but I get an error that shows I cannot allocate memory for the matrix.
Here what I do:
library(dtw)
set.seed(1234)
N <- 300000
x <- rnorm(N)
y <- rnorm(N)
dtw(x,y,distance.only=TRUE)$distance
Error: cannot allocate vector of size 670.6 Gb
Is there an alternative way to calculate the dtw distance that does not need to allocate so much memory?
Idon't know this package , but From the companion paper of the package you have:
Larger problems may be addressed by approximate strategies, e.g.,
computing a preliminary alignment between downsampled time series
(Salvador and Chan 2004); indexing (Keogh and Ratanamahatana 2005); or
breaking one of the sequences into chunks and then iterating
subsequence matches.
The latter option can be implemented by something like :
lapply(split(y,1:100), ## I split y in 100 chnucks
function(z)dtw(x,z,distance.only=TRUE)$distance)
PS: By larger here , it means problems that exceed 8000 × 8000 points (close to the virtual memory limit) which it is your case here.

How to create a Large Distance Matrix?

How to allocate a huge distance matrix in an appropriate way to avoid "allocation is
unable" error. Imagine you have a 100.000 points randomly spreaded over some
space. How can one cleverly create a matrix or "dist"-object, which represents the
the half of DistMatrix. Maybe it should be another object, which will be able efficiently allocate the large number of distances.
You can get the polygonial object from the following link:
https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l
# Load required packages
library(sp)
library(maptools)
library(maps)
# Load the polygonal object
x <- readShapePoly("vg250_gem.shp")
# Sample or Pick up a large Number of Points
# this command needs some minutes to be done.
# "coord" is SpatialPoints Object
n <- 1e5
coord <- spsample(x, n, "random")
# Try to measure the distances by dist()
DistMatrix <- dist(coord#coords)
Error: negative length vectors are not allowed
# Try to measure the distances by spDists()
DistMatrix <- spDists(coord)
Error: cannot allocate vector of size (some number) MB
# It seems that the problem lies on large matrix to be created.
How is this problem solvable in R for great numbers of "n".
At this point R cannot allocate the random number of megabytes of RAM. At this point, your computer is using all of its memory somewhere else and there just isn't (some number) of MBytes available for your process to continue. You have several solutions at this point. Among them, get a machine with more RAM, close programs, or do your distance calculations in smaller batches. Try a smaller n; and when it works just repeat the process several times until you have your whole matrix of distances.

Resources