I am trying R package apcluster on a set of objects that I want to cluster, but I'm running into performance/memory problems, and I suspect I'm not doing it right. I'd like to hear your opinion, please.
In short: I have a set of about 13000 objects. Each object is associated with a set of 2 to 5 'features'. The similarity (by which I want to cluster, eventually) between any two objects i and j is equal to the number of features they have in common divided by the total number of distinct features they 'span'. E.g. if i = {a,b,c} and j = {c,d}, then sim[i,j] = 1/4 = 0.25, because they have only 1 feature in common ({c}) and in total they describe 4 distinct features ({a,b,c,d}).
Calculating my NxN similarity matrix is not a problem in theory: it can be done using set operations if each object's features are stored as a list; or features can be pivoted to a matrix of 1's and 0's, where each column is a feature, and then R's function dist with method="binary" does the trick.
In practice however, the first problem is that such similarity calculations are extremely slow. For 13 K objects, there are about 84.5 M similarities to calculate, but this doesn't sound so bad for a modern computer. I don't understand why it should take a few hours to do that. And the set operation version, that should be quicker as far as I can tell, is actually much slower than dist. [Another package called fingerprint is supposed to deal with such cases more efficiently, but so far I haven't been able to make it work, it gives a lot of errors when trying to make what they call 'featvec' objects].
The other thing to consider is that the 2-5 features per object are not very repetitive. There may be a group of 100 or so objects with at least one feature in common between them, but then none of the other 12.9 K objects has any feature in common with these 100 objects. The consequence is that the pivoted feature matrix is very sparse (if we consider 0's as empty). There are about 4000 columns in the pivoted matrix, and each row has at most 5 1's. I wonder if this is negatively impacting the performance of dist, in that it has to multiply through a lot of 0's that could instead be ignored.
Does it seem normal to you that it should take a few hours to apply dist to a matrix like the one I described? Can you suggest a different way to calculate the similarity that takes advantage of the sparseness of the matrix?
Anyway, I managed to get the output from dist, which however had class 'dist', and was a distance matrix, not a similarity one, so I had to use 1 - as.matrix(distance_matrix) to be able to make the similarity matrix apcluster needs as input.
That's when I got the first 'memory' problem. R said the vector could not be allocated due to its size. I tried the usual tricks, but in the end I could not get more than 4 GB, and my matrices are (apparently) bigger.
I overcame this by assigning each time new matrices to their old 'self'.
And then when I submitted this painstakingly put together similarity matrix to apcluster, again the vector size error popped up, as if the first thing apcluster did was create some other large object from what I had fed it.
I had a look at as.Sparse... in apcluster, but it does not seem to help a lot, considering that you have to calculate the full matrix first anyway.
In the end the only thing that worked a little bit was 'leveraged affinity propagation' by apclusterL, which however is an approximation.
Does anybody know if and how I could do this better? E.g. is it wise to pivot the data first, or should I stick to list and set operations? Or, can the fact that the initial matrix is sparse be used to compute directly a sparse similarity matrix, rather than compute it fully and reduce it to sparse later?
Any advice would be greatly appreciated. Thanks!
BTW, yes, I saw this thread: Cluster Analysis in R on large sparse matrix ; which does not seem to have been answered conclusively.
The R interpreter is really slow.
So you should use R mostly to "drive" your program, but implement all the computations heavy stuff in C or FORTRAN.
You didn't show the code you are using, but I guess it involves nested for loops? Try to rewrite it without any for loops in R, or rewrite it in C.
But no matter what, AP clustering will always remain very slow. It involves many passes over O(n²) matrixes, i.e. it scales very badly.
Related
I'm using the function all_simple_paths from the igraph R package: (1) to generate the list of all simple paths in networks (object List_paths_Mp); and (2) to calculate the total number of simple paths (object n_paths).
I'm using the function in the form:
pathsMp <- unlist(lapply(V(graphMp), function(x)all_simple_paths(graphMp, from = x)), recursive =FALSE)
List_paths_Mp <- lapply(1:length(pathsMp), function(x)as_ids(pathsMp[[x]]))
n_paths<-length(List_paths_Mp)
Where:
Mp is a square matrix with either 1 or 0 values, and graphMp is the igraph graph objected obtained through the function graph_from_adjacency_matrix.
The function does what I need, but with the increase in the number of variables and interactions the processing time to identify and store the different single paths in the network grows too much and it takes very long to get the results.
In particular, using a network with 11 variables and 60 interactions, there is a total of 146338 possible simple paths. And this already takes a long time to compute. Using a bigger network, with 13 variables and 91 interactions, causes the program to take even longer times to process (after 2 hours the function still didn't run its course, and when called to stop it crashed R).
Is there a way to increase the efficiency of the task (i.e. to get results in a faster way)? Has anyone ever encountered a similar problem and found a solution? And, I know, I could use a CPU with higher processing power, but the point is to have the function to run efficiently (as much as possible) in a normal personal computer.
Edit: here I do the calculations from the graph object, but if someone else has any idea of doing the same from the adjacency matrix, I would welcome it too!
I have a 3 million x 9 million sparse matrix with several billion non-zero entries. R and Python do not allow sparse matrices with more than MAXINT non-zero entries, thus why I found myself using Julia.
While scaling this data with the standard deviation is trivial, demeaning is of course a no-go in a naive manner as that would create a dense, 200+ terabyte matrix.
The relevant code for doing svd is julia can be found at https://github.com/JuliaLang/julia/blob/343b7f56fcc84b20cd1a9566fd548130bb883505/base/linalg/arnoldi.jl#L398
From my reading, a key element of this code is the AtA_or_AAt struct and several of the functions around those, specifically A_mul_B!. Copied below for your convenience
struct AtA_or_AAt{T,S} <: AbstractArray{T, 2}
A::S
buffer::Vector{T}
end
function AtA_or_AAt(A::AbstractMatrix{T}) where T
Tnew = typeof(zero(T)/sqrt(one(T)))
Anew = convert(AbstractMatrix{Tnew}, A)
AtA_or_AAt{Tnew,typeof(Anew)}(Anew, Vector{Tnew}(max(size(A)...)))
end
function A_mul_B!(y::StridedVector{T}, A::AtA_or_AAt{T}, x::StridedVector{T}) where T
if size(A.A, 1) >= size(A.A, 2)
A_mul_B!(A.buffer, A.A, x)
return Ac_mul_B!(y, A.A, A.buffer)
else
Ac_mul_B!(A.buffer, A.A, x)
return A_mul_B!(y, A.A, A.buffer)
end
end
size(A::AtA_or_AAt) = ntuple(i -> min(size(A.A)...), Val(2))
ishermitian(s::AtA_or_AAt) = true
This is passed into the eigs function, where some magic happens, and the output is then processed in to the relevant components for SVD.
I think the best way to make this work for a 'centering on the fly' type setup is to do something like subclass AtA_or_AAT with a AtA_or_AAT_centered version that more or less mimics the behavior but also stores the column means, and redefines the A_mul_B! function appropriately.
However, I do not use Julia very much and have run in to some difficulty modifying things already. Before I try to dive into this again, I was wondering if I could get feedback if this would be considered an appropriate plan of attack, or if there is simply a much easier way of doing SVD on such a large matrix (I haven't seen it, but I may have missed something).
edit: Instead of modifying base Julia, I've tried writing a "Centered Sparse Matrix" package that keeps the sparsity structure of the input sparse matrix, but enters the column means where appropriate in various computations. It's limited in what it has implemented, and it works. Unfortunately, it is still too slow, despite some pretty extensive efforts to try to optimize things.
After much fuddling with the sparse matrix algorithm, I realized that distributing the multiplication over the subtraction was dramatically more efficient:
If our centered matrix Ac is formed from the original nxm matrix A and its vector of column means M, with a nx1 vector of ones that I will just call 1. We are multiplying by a mxk matrix X
Ac := (A - 1M')
AcX = X
= AX - 1M'X
And we are basically done. Stupidly simple, actually.
AX is can be carried out with the usual sparse matrix multiplication function, M'X is a dense vector-matrix inner product, and the vector of 1's "broadcasts" (to use Julia's terminology) to each row of the AX intermediate result. Most languages have a way of doing that broadcasting without realizing the extra memory allocation.
This is what I've implemented in my package for AcX and Ac'X. The resulting object can then be passed to algorithms, such as the svds function, which only depend on matrix multiplication and transpose multiplication.
I have a difficult R computation to do, and I have an option of 2 computers, called V and L, to run the code. V is supposed to be faster than L, but I did not experience this. So I decided to test it out.
As a simple test, I decided to ask them invert a 3000*3000 matrice 500 times, and record the time.
set.seed(123)
I=500
n=3000
time=matrix(NA,ncol=3,nrow=I)
for(i in 1:I){
t0<-proc.time()
x<-solve(matrix(runif(n^2),n))
mt1<-proc.time()
time[i,]<-(mt1-t0)[1:3]
}
The problem is that during a particular iteration, it got stuck. I don't know why but I suspect it is because the matrix generated was near singular. So I would like to improve the code. I can think of 3 ways:
make sure the matrix generated is easily invertible. But how do i enforce this??? Of course, any solution needs to be computationally inexpensive, otherwise the exercise becomes meaningless.
ask R to skip that iteration if solve takes too long? But again, how do I do that?
assign them a different computation task instead, any recommendation?
A random matrix is invertible with probability 1, meaning that, in practice, the probability of generating a singular (i.e. non-invertible) matrix is infinitesimally small.
Moreover, from the point of view of the algorithm that R uses to invert matrices, there is no such thing as an "easily invertible" matrix. Either the algorithm succeeds, or it determines that a matrix is singular and fails. But there is no scenario under which it tries "really hard" and takes a long time to invert a matrix. It's a deterministic algorithm which either runs into a 0 (or a value smaller than some given epsilon), in which case if fails, or else it doesn't.
On which iteration do you get stuck? Are you sure you are getting stuck on the inversion of the matrix, and it's not something like garbage collection that is taking a long time?
I can't reproduce the problem you describe. Starting with random seed 123, I can invert 500 random 3000x3000 matrices in a row, using your code, without any significant timing discrepancies. Can you find a random seed that generates a "hard to invert matrix" directly?
Sorry if this is dumb but I was just thinking I should give a shot. Say I have a graph thats huge(for example, 100 billion nodes). Neo4J supports 32 Billion and others support more or less the same, so say I cannot have the entire dataset in a database at the same time, can I run pagerank on it if its a directed graph(no loops) and each set of nodes connect to the next set of nodes(so no new links will be created backwards, only new links are created to new sets of data).
Is there a way I can somehow take the previous pagerank scores and apply them to new datasets(I only care about the pagerank for the most recent set of data but need the previous set's pagerank to derive the last sets data)?
Does that make sense? If so, is it possible to do?
You need to compute the principle eigenvector of a 100 billion by 100 billion matrix. Unless it's extremely sparse, you can not fit that inside your machine. So, you need a way to compute the leading eigenvector of a matrix when you can only look at a small part of your matrix at a time.
Iterative methods to compute eigenvectors only require that you store a few vectors at each iteration (they'll each have 100 billion elements). Those may fit on your machine (with 4 byte floats you'll need around 375GB per vector). Once you have a candidate vector of rankings you can (very slowly) apply your giant matrix to it by reading the matrix in chunks (since you can look at 32 billion rows at a time you'll need just over 3 chunks). Repeat this process and you'll have the basics of the power method which is what gets used in pagerank. cf http://www.ams.org/samplings/feature-column/fcarc-pagerank and http://en.wikipedia.org/wiki/Power_iteration
Of course the limiting factor here is how many times you need to examine the matrix. It turns out that by storing more than one candidate vector and using some randomized algorithms you can get good accuracy with fewer reads of your data. This is a current research topic in the applied math world. You can find more information here http://arxiv.org/abs/0909.4061 , here http://arxiv.org/abs/0909.4061 , and here http://arxiv.org/abs/0809.2274 . There's code available here: http://code.google.com/p/redsvd/ but you can't just use that off-the-shelf for the data sizes you're talking about.
Another way you may go about this is to look into "incremental svd" which may suit your problem better but is a bit more complicated. Consider this note: http://www.cs.usask.ca/~spiteri/CSDA-06T0909e.pdf and this forum: https://mathoverflow.net/questions/32158/distributed-incremental-svd
I am using R tool to calculate SVD (svd(m)) and it works on small matrix but as I pass it 20Kx20X matrix. After processing, it gives the following error
Error in svd(m) : infinite or missing values in 'x'
I checked and there is no row or column with all 0 values and no duplicate in row and
column. All columns have values.
I cannot past 20Kx20K matrix here :(
I am guessing that your problem is not related to memory size, although I am not able to process a 20Kx20K matrix on my 4GB memory machine.
The reason for this guess is that the first line of code inside svd() is the following:
if (any(!is.finite(x)))
stop("infinite or missing values in 'x'")
In other words, the svd() function test first whether there are any infinite values in your data.
This happens before any further processing. So, if you had memory problems, these would be apparent even before your call to svd().
I suggest you check for infinite values:
x <- c(0, Inf, NA, NULL)
which(!is.finite(x))
[1] 2 3
This indicates that the second and third values are considered to be not finite. In other words, any NA values in your data will cause your error.
Possibly the svd calculation itself also uses a lot of memory. If we compare to MATLAB, we see that the svd calculation allocates just as much memory as the matrix itself uses, so if you already ise 3GB of memory, the svd calculation possibly allocates another 3GB, which gives 6GB of memory.
If you're storing doubles that are 8 bytes, 20Kx20K means 8*20,000*20,000/1024/1024 ~ 3GB of RAM to hold the whole thing in memory.
I don't know how much RAM you've got available, but I'd look into what R can do to serialize the matrix out to disk as needed.
Is the matrix sparse or banded? Can you do something to decrease the amount of memory you need?
How large is the null space for you matrix? What's the condition number (ratio of largest-to-smallest eigenvalue)? A large condition number can be an indication of difficulties in solving. A matrix need not have a zero row or column to be nearly singular.
UPDATE:
Based on your comment, I'd say that RAM is the least of your problems. Sounds like it's possible to hold the entire matrix in memory - if you can address it all. You can address the entire matrix. You're running on a 64-bit OS - is your version of R 64-bit as well?
Unfortunately, one of the byproducts of SVD is to get the size of the null space.
You can get the minimum eigenvalue for your matrix using Jacobi iteration. Lanczos might be a good choice for getting the maximum eigenvalue. It'd be a lot of work to get all of them; you might just want the five smallest and largest to assess.
Anytime I experience an error with some software I immediately paste it into a Google search. At least it's comforting to know that I'm not the first to experience a particular problem:
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=Error+in+svd(m)+:+infinite+or+missing+values+in+'x'