Efficient allocation of array in r,igraph - r

I want to create 2 matrices with the same dimensions of the adjacency matrix of the graph i have.The problem is that the graph is way to large.
Here is my code:
AjM<-as_adjacency_matrix(g,attr = "weight")
dim(AjM)
77500 77500
Alpha<-array(0,dim(AjM))
Error: cannot allocate vector of size 44.6 Gb .AjM is of size 18.8Mb.
How can i do that?( i have an 64 bit machine with 12Gb ram )

Notice that as_adjacency_matrix has an argument sparse. Assuming that your matrix has fewer than 77500 * 77500 / 2 edges, setting sparse=TRUE may reduce the size of the resulting adjacency matrix significantly. To use this option, you must have the Matrix package installed.

Related

How to Chunking large dissimilarity / distance matrices in R?

I would like to cluster mix-type data that contains 50k rows and 10 features/columns. I am using R in my 64 bit PC. When I calculate dissimilarity / distance matrix with "daisy" function, I got "Error: cannot allocate vector of size X GB" error.
gower_dist <- daisy(df, metric = "gower").
This is the command to generate distance matrix. How to handle this script with chunks to avoid RAM error ?

R dynamic time warping for long time series

I'm trying to calculate the dtw distance for very long time series but I get an error that shows I cannot allocate memory for the matrix.
Here what I do:
library(dtw)
set.seed(1234)
N <- 300000
x <- rnorm(N)
y <- rnorm(N)
dtw(x,y,distance.only=TRUE)$distance
Error: cannot allocate vector of size 670.6 Gb
Is there an alternative way to calculate the dtw distance that does not need to allocate so much memory?
Idon't know this package , but From the companion paper of the package you have:
Larger problems may be addressed by approximate strategies, e.g.,
computing a preliminary alignment between downsampled time series
(Salvador and Chan 2004); indexing (Keogh and Ratanamahatana 2005); or
breaking one of the sequences into chunks and then iterating
subsequence matches.
The latter option can be implemented by something like :
lapply(split(y,1:100), ## I split y in 100 chnucks
function(z)dtw(x,z,distance.only=TRUE)$distance)
PS: By larger here , it means problems that exceed 8000 × 8000 points (close to the virtual memory limit) which it is your case here.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

R Cluster Package Error Daisy() function long vectors (argument 11) are not supported in .C

Trying to convert a data.frame with numeric, nominal, and NA values to a dissimilarity matrix using the daisy function from the cluster package in R. My purpose involves creating a dissimilarity matrix before applying k-means clustering for customer segmentation. The data.frame has 133,153 rows and 36 columns. Here's my machine.
sessionInfo()
R version 3.1.0 (2014-04-10)
Platform x86_64-w64-mingw32/x64 (64-bit)
How can I fix the daisy warning?
Since the Windows computer has 3 Gb RAM, I increased the virtual memory to 100GB hoping that would be enough to create the matrix - it didn't work. I still got a couple errors about the memory. I've looked into other R packages for solving the memory problem, but they don't work. I cannot use the bigmemory with the biganalytics package because it only accepts numeric matrices. The clara and ff packages also accept only numeric matrices.
CRAN's cluster package suggests the gower similarity coefficient as a distance measure before applying k-means. The gower coefficient takes numeric, nominal, and NA values.
Store1 <- read.csv("/Users/scdavis6/Documents/Work/Client1.csv", head=FALSE)
df <- as.data.frame(Store1)
save(df, file="df.Rda")
library(cluster)
daisy1 <- daisy(df, metric = "gower", type = list(ordratio = c(1:35)))
#Error in daisy(df, metric = "gower", type = list(ordratio = c(1:35))) :
#long vectors (argument 11) are not supported in .C
**EDIT: I have RStudio lined to Amazon Web Service's (AWS) r3.8xlarge with 244Gbs of memory and 32 vCPUs. I tried creating the daisy matrix on my computer, but did not have enough RAM. **
**EDIT 2: I used the clara function for clustering the dataset. **
#50 samples
clara2 <- clara(df, 3, metric = "euclidean", stand = FALSE, samples = 50,
rngR = FALSE, pamLike = TRUE)
Use algorithms that do not require O(n^2) memory, if you have a lot of data. Swapping to disk will kill performance, this is not a sensible option.
Instead, try either to reduce your data set size, or use index acceleration to avoid the O(n^2) memory cost. (And it's not only O(n^2) memory, but also O(n^2) distance computations, which will take a long time!)

How to create a Large Distance Matrix?

How to allocate a huge distance matrix in an appropriate way to avoid "allocation is
unable" error. Imagine you have a 100.000 points randomly spreaded over some
space. How can one cleverly create a matrix or "dist"-object, which represents the
the half of DistMatrix. Maybe it should be another object, which will be able efficiently allocate the large number of distances.
You can get the polygonial object from the following link:
https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l
# Load required packages
library(sp)
library(maptools)
library(maps)
# Load the polygonal object
x <- readShapePoly("vg250_gem.shp")
# Sample or Pick up a large Number of Points
# this command needs some minutes to be done.
# "coord" is SpatialPoints Object
n <- 1e5
coord <- spsample(x, n, "random")
# Try to measure the distances by dist()
DistMatrix <- dist(coord#coords)
Error: negative length vectors are not allowed
# Try to measure the distances by spDists()
DistMatrix <- spDists(coord)
Error: cannot allocate vector of size (some number) MB
# It seems that the problem lies on large matrix to be created.
How is this problem solvable in R for great numbers of "n".
At this point R cannot allocate the random number of megabytes of RAM. At this point, your computer is using all of its memory somewhere else and there just isn't (some number) of MBytes available for your process to continue. You have several solutions at this point. Among them, get a machine with more RAM, close programs, or do your distance calculations in smaller batches. Try a smaller n; and when it works just repeat the process several times until you have your whole matrix of distances.

Resources