R running out of memory during time series distance computation - r

Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.

R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.

There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.

Related

How to improve igraph single_paths calculation efficiency

I'm using the function all_simple_paths from the igraph R package: (1) to generate the list of all simple paths in networks (object List_paths_Mp); and (2) to calculate the total number of simple paths (object n_paths).
I'm using the function in the form:
pathsMp <- unlist(lapply(V(graphMp), function(x)all_simple_paths(graphMp, from = x)), recursive =FALSE)
List_paths_Mp <- lapply(1:length(pathsMp), function(x)as_ids(pathsMp[[x]]))
n_paths<-length(List_paths_Mp)
Where:
Mp is a square matrix with either 1 or 0 values, and graphMp is the igraph graph objected obtained through the function graph_from_adjacency_matrix.
The function does what I need, but with the increase in the number of variables and interactions the processing time to identify and store the different single paths in the network grows too much and it takes very long to get the results.
In particular, using a network with 11 variables and 60 interactions, there is a total of 146338 possible simple paths. And this already takes a long time to compute. Using a bigger network, with 13 variables and 91 interactions, causes the program to take even longer times to process (after 2 hours the function still didn't run its course, and when called to stop it crashed R).
Is there a way to increase the efficiency of the task (i.e. to get results in a faster way)? Has anyone ever encountered a similar problem and found a solution? And, I know, I could use a CPU with higher processing power, but the point is to have the function to run efficiently (as much as possible) in a normal personal computer.
Edit: here I do the calculations from the graph object, but if someone else has any idea of doing the same from the adjacency matrix, I would welcome it too!

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

Creating distance matrix in R for a matrix in a higher dimensional space

I have created an euclidean distance matrix using dist() function in R.
Below is my R script. As the dimensions of matrix would be 16809 * 16809 while running this script in R I got the error message:
Error: cannot allocate vector of size 1.1 Gb
So is there any way to get rid of this error?
I haven't used parallelization in R previously. Can it be done using parallelization?
rnd.points = matrix(runif(3 * 16809), ncol = 3)
rnd.points <- rnd.points[1:5,]
ds <- dist(rnd.points)
as.matrix(ds) -> nt
nt
As #Gopola said: dist(.) computes all pairwise distances, and hence needs
O(n^2) memory. Indeed, dist() is efficient and only stores half of the symmetric n x n matrix.
If I compute dist() on a computer with enough RAM, it works nicely, and indeed creates an object ds of size 1.1 Gb ... which is not so large for today's computers.
rnd.points <- matrix(runif(3 * 16809), ncol = 3)
ds <- dist(rnd.points)
object.size(ds)
Note however that your
as.matrix(ds) -> nt
is not such a good idea as the resulting matrix nt is indeed (almost) twice the size of ds, as nt is of course a n x n matrix.
O/S has a principal limit on RAM-addressing ( smaller for a 32-bit system, larger for 64-bit system )
O/S next has a design-based limit for a max RAM a process can allocate ( +kill-s afterwards )
Had the same InRAM constraints in python and went beyond that
Sure, at some cost, but was a worth piece of experience.
python numpy has a wonderfull feature for this very scenario seamlessly inbuilt - a .memmap(). The word seamlessly is intentionally emphasised, as this is of the core importance for your problem re-formulation / re-design costs. There are tools available, but it will be your time to master 'em and to re-design your algoritm ( libraries et al ) so as these can use the new tools - guess what - SEAMLESSLY. This is the hidden part of the iceberg.
Handy R tools available:
filebacked.big.matrix which also supports an HPC cluster-wide sharing for distributed processing ( thus solving both PSPACE and PTIME dimensions of the HPC processing challenge, unless you fortunately hit the filesystem fileSize ceiling )
ff which allowslibrary(ff)pt_coords <- ff( vmode = "double", dim = c(16809, 3), initdata = 0 )pt_dists <- ff( vmode = "double", dim = c(16809, 16809), initdata = -1 )and work with it in as simple as in matrix-alike [row,column] mode to fill in the points and process their pair-wise distances et al,
?ffsave for further details on saving your resulting distances data
and last, but not least
mmap + indexing
Parallel? No.Distributed?Yes, might help with PTIME:
As noted with filebacked.big.matrix there are chances to segment the computational PSPACE into smaller segments for distributed processing and reduction of the PTIME, but the concept is in principle just a concurrent (re)-use of available resouces, not the [ PARALLEL ] system-behaviour ( while it is necessary to admit, that lot of marketing ( the bad news is that even the technology marketing has joined this unfair and knowingly incorrect practice ) texts mis-uses the word parallel / parallelism in places, where a just concurrent system-behaviour is observed ( there are not many real, true-PARALLEL, systems ) ).
Conclusion:
Big matrices are doable in R well beyond the InRAM limits, select the tools most suitable for your problem-domain and harness all the HPC-resources you may.
Error: cannot allocate vector of size 1.1 Gb is solved.
There is nothing but resources, that imposts limits and delays on our computing-ready tasks, so do not hesitate to make your move while computing resources are still available for your Project, otherwise you will find yourself, with all the re-engineered software ready, but waiting in a queue for the computing resources.

R k mean and heircal clustering takes forever time to finish

I have a data set (after normalising and preprocessing) contains a data frame that has 5 columns and 133763 rows.
I am trying to apply k means algorithm, and herical algorithm in order to do the clustering. However, my problem is that R studio keeps trying to do the calculation, and then it throws out of memory exception even though i am using mac bro i 7, 16 gb
my code for heroical clustering is:
dist.cards<-dist(cardsNorm)
as i said that takes forever running. however, if i did this
dist.cards<-dist(cardsNorm[1:10])
it works fine, that is because i just use 10 rows.
for the k mean, this is my code:
cardsKMS<-kmeans(cardsNorm,centers=3,iter.max = 100,nstart = 25)
it works fine, but when i try to measure the model using this code
a <- silhouette(cardsKMS$cluster,dist(cardsNorm))
it takes forever and never finishes calculating
help please
Creating a distance matrix between n = 133763 observations requires (n^2-n)/2 pairwise comparisons. Given that a scalar numeric requires 12 bytes of RAM the entire matrix requires about 100 GB. So unfortunately you don't have enough.
Algorithms based on distance matrices scale very poorly with increased data set size (since they are inherently quadratic in memory and CPU) so I am afraid you need to try some other clustering algorithm.

matrix calculation error

I am using R tool to calculate SVD (svd(m)) and it works on small matrix but as I pass it 20Kx20X matrix. After processing, it gives the following error
Error in svd(m) : infinite or missing values in 'x'
I checked and there is no row or column with all 0 values and no duplicate in row and
column. All columns have values.
I cannot past 20Kx20K matrix here :(
I am guessing that your problem is not related to memory size, although I am not able to process a 20Kx20K matrix on my 4GB memory machine.
The reason for this guess is that the first line of code inside svd() is the following:
if (any(!is.finite(x)))
stop("infinite or missing values in 'x'")
In other words, the svd() function test first whether there are any infinite values in your data.
This happens before any further processing. So, if you had memory problems, these would be apparent even before your call to svd().
I suggest you check for infinite values:
x <- c(0, Inf, NA, NULL)
which(!is.finite(x))
[1] 2 3
This indicates that the second and third values are considered to be not finite. In other words, any NA values in your data will cause your error.
Possibly the svd calculation itself also uses a lot of memory. If we compare to MATLAB, we see that the svd calculation allocates just as much memory as the matrix itself uses, so if you already ise 3GB of memory, the svd calculation possibly allocates another 3GB, which gives 6GB of memory.
If you're storing doubles that are 8 bytes, 20Kx20K means 8*20,000*20,000/1024/1024 ~ 3GB of RAM to hold the whole thing in memory.
I don't know how much RAM you've got available, but I'd look into what R can do to serialize the matrix out to disk as needed.
Is the matrix sparse or banded? Can you do something to decrease the amount of memory you need?
How large is the null space for you matrix? What's the condition number (ratio of largest-to-smallest eigenvalue)? A large condition number can be an indication of difficulties in solving. A matrix need not have a zero row or column to be nearly singular.
UPDATE:
Based on your comment, I'd say that RAM is the least of your problems. Sounds like it's possible to hold the entire matrix in memory - if you can address it all. You can address the entire matrix. You're running on a 64-bit OS - is your version of R 64-bit as well?
Unfortunately, one of the byproducts of SVD is to get the size of the null space.
You can get the minimum eigenvalue for your matrix using Jacobi iteration. Lanczos might be a good choice for getting the maximum eigenvalue. It'd be a lot of work to get all of them; you might just want the five smallest and largest to assess.
Anytime I experience an error with some software I immediately paste it into a Google search. At least it's comforting to know that I'm not the first to experience a particular problem:
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=Error+in+svd(m)+:+infinite+or+missing+values+in+'x'

Resources