I'd like to ask what is the maximum amount of data R igraph package can handle. Is it possible to write hunderds of million rows of data? I could not find anything specific on the document. The document says that it can handle huge amounts of data.
Thanks
At the moment, R/igraph does not check if graph size limits are exceeded, however, if you do exceed certain values, it may misbehave. Checks will hopefully be added for version 2.0.
To be on the safe side, follow these guidelines:
Avoid vectors or matrices with more than 2^31 - 1 ~ 2.1 billion elements. That refers to the total number of elements in the matrix, i.e. a square matrix should not be larger than 46340 by 46340.
Graphs should have fewer than 2^31 - 1 ~ 2.1 billion vertices.
Graphs should have fewer than 2^30 - 1 ~ 1 billion edges.
These are likely to remain the size limits for version 2.0 of R/igraph, except that when they are exceeded, a user-friendly error will be shown. The reason for these limits is that R still does not support 64-bit integers.
The C, Python and Mathematica interfaces of igraph will be able to handle much larger graphs as they will be using 64-bit integers on systems that support it.
Related
As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.
Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.
There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.
I am trying R package apcluster on a set of objects that I want to cluster, but I'm running into performance/memory problems, and I suspect I'm not doing it right. I'd like to hear your opinion, please.
In short: I have a set of about 13000 objects. Each object is associated with a set of 2 to 5 'features'. The similarity (by which I want to cluster, eventually) between any two objects i and j is equal to the number of features they have in common divided by the total number of distinct features they 'span'. E.g. if i = {a,b,c} and j = {c,d}, then sim[i,j] = 1/4 = 0.25, because they have only 1 feature in common ({c}) and in total they describe 4 distinct features ({a,b,c,d}).
Calculating my NxN similarity matrix is not a problem in theory: it can be done using set operations if each object's features are stored as a list; or features can be pivoted to a matrix of 1's and 0's, where each column is a feature, and then R's function dist with method="binary" does the trick.
In practice however, the first problem is that such similarity calculations are extremely slow. For 13 K objects, there are about 84.5 M similarities to calculate, but this doesn't sound so bad for a modern computer. I don't understand why it should take a few hours to do that. And the set operation version, that should be quicker as far as I can tell, is actually much slower than dist. [Another package called fingerprint is supposed to deal with such cases more efficiently, but so far I haven't been able to make it work, it gives a lot of errors when trying to make what they call 'featvec' objects].
The other thing to consider is that the 2-5 features per object are not very repetitive. There may be a group of 100 or so objects with at least one feature in common between them, but then none of the other 12.9 K objects has any feature in common with these 100 objects. The consequence is that the pivoted feature matrix is very sparse (if we consider 0's as empty). There are about 4000 columns in the pivoted matrix, and each row has at most 5 1's. I wonder if this is negatively impacting the performance of dist, in that it has to multiply through a lot of 0's that could instead be ignored.
Does it seem normal to you that it should take a few hours to apply dist to a matrix like the one I described? Can you suggest a different way to calculate the similarity that takes advantage of the sparseness of the matrix?
Anyway, I managed to get the output from dist, which however had class 'dist', and was a distance matrix, not a similarity one, so I had to use 1 - as.matrix(distance_matrix) to be able to make the similarity matrix apcluster needs as input.
That's when I got the first 'memory' problem. R said the vector could not be allocated due to its size. I tried the usual tricks, but in the end I could not get more than 4 GB, and my matrices are (apparently) bigger.
I overcame this by assigning each time new matrices to their old 'self'.
And then when I submitted this painstakingly put together similarity matrix to apcluster, again the vector size error popped up, as if the first thing apcluster did was create some other large object from what I had fed it.
I had a look at as.Sparse... in apcluster, but it does not seem to help a lot, considering that you have to calculate the full matrix first anyway.
In the end the only thing that worked a little bit was 'leveraged affinity propagation' by apclusterL, which however is an approximation.
Does anybody know if and how I could do this better? E.g. is it wise to pivot the data first, or should I stick to list and set operations? Or, can the fact that the initial matrix is sparse be used to compute directly a sparse similarity matrix, rather than compute it fully and reduce it to sparse later?
Any advice would be greatly appreciated. Thanks!
BTW, yes, I saw this thread: Cluster Analysis in R on large sparse matrix ; which does not seem to have been answered conclusively.
The R interpreter is really slow.
So you should use R mostly to "drive" your program, but implement all the computations heavy stuff in C or FORTRAN.
You didn't show the code you are using, but I guess it involves nested for loops? Try to rewrite it without any for loops in R, or rewrite it in C.
But no matter what, AP clustering will always remain very slow. It involves many passes over O(n²) matrixes, i.e. it scales very badly.
Sorry if this is dumb but I was just thinking I should give a shot. Say I have a graph thats huge(for example, 100 billion nodes). Neo4J supports 32 Billion and others support more or less the same, so say I cannot have the entire dataset in a database at the same time, can I run pagerank on it if its a directed graph(no loops) and each set of nodes connect to the next set of nodes(so no new links will be created backwards, only new links are created to new sets of data).
Is there a way I can somehow take the previous pagerank scores and apply them to new datasets(I only care about the pagerank for the most recent set of data but need the previous set's pagerank to derive the last sets data)?
Does that make sense? If so, is it possible to do?
You need to compute the principle eigenvector of a 100 billion by 100 billion matrix. Unless it's extremely sparse, you can not fit that inside your machine. So, you need a way to compute the leading eigenvector of a matrix when you can only look at a small part of your matrix at a time.
Iterative methods to compute eigenvectors only require that you store a few vectors at each iteration (they'll each have 100 billion elements). Those may fit on your machine (with 4 byte floats you'll need around 375GB per vector). Once you have a candidate vector of rankings you can (very slowly) apply your giant matrix to it by reading the matrix in chunks (since you can look at 32 billion rows at a time you'll need just over 3 chunks). Repeat this process and you'll have the basics of the power method which is what gets used in pagerank. cf http://www.ams.org/samplings/feature-column/fcarc-pagerank and http://en.wikipedia.org/wiki/Power_iteration
Of course the limiting factor here is how many times you need to examine the matrix. It turns out that by storing more than one candidate vector and using some randomized algorithms you can get good accuracy with fewer reads of your data. This is a current research topic in the applied math world. You can find more information here http://arxiv.org/abs/0909.4061 , here http://arxiv.org/abs/0909.4061 , and here http://arxiv.org/abs/0809.2274 . There's code available here: http://code.google.com/p/redsvd/ but you can't just use that off-the-shelf for the data sizes you're talking about.
Another way you may go about this is to look into "incremental svd" which may suit your problem better but is a bit more complicated. Consider this note: http://www.cs.usask.ca/~spiteri/CSDA-06T0909e.pdf and this forum: https://mathoverflow.net/questions/32158/distributed-incremental-svd
I am using R tool to calculate SVD (svd(m)) and it works on small matrix but as I pass it 20Kx20X matrix. After processing, it gives the following error
Error in svd(m) : infinite or missing values in 'x'
I checked and there is no row or column with all 0 values and no duplicate in row and
column. All columns have values.
I cannot past 20Kx20K matrix here :(
I am guessing that your problem is not related to memory size, although I am not able to process a 20Kx20K matrix on my 4GB memory machine.
The reason for this guess is that the first line of code inside svd() is the following:
if (any(!is.finite(x)))
stop("infinite or missing values in 'x'")
In other words, the svd() function test first whether there are any infinite values in your data.
This happens before any further processing. So, if you had memory problems, these would be apparent even before your call to svd().
I suggest you check for infinite values:
x <- c(0, Inf, NA, NULL)
which(!is.finite(x))
[1] 2 3
This indicates that the second and third values are considered to be not finite. In other words, any NA values in your data will cause your error.
Possibly the svd calculation itself also uses a lot of memory. If we compare to MATLAB, we see that the svd calculation allocates just as much memory as the matrix itself uses, so if you already ise 3GB of memory, the svd calculation possibly allocates another 3GB, which gives 6GB of memory.
If you're storing doubles that are 8 bytes, 20Kx20K means 8*20,000*20,000/1024/1024 ~ 3GB of RAM to hold the whole thing in memory.
I don't know how much RAM you've got available, but I'd look into what R can do to serialize the matrix out to disk as needed.
Is the matrix sparse or banded? Can you do something to decrease the amount of memory you need?
How large is the null space for you matrix? What's the condition number (ratio of largest-to-smallest eigenvalue)? A large condition number can be an indication of difficulties in solving. A matrix need not have a zero row or column to be nearly singular.
UPDATE:
Based on your comment, I'd say that RAM is the least of your problems. Sounds like it's possible to hold the entire matrix in memory - if you can address it all. You can address the entire matrix. You're running on a 64-bit OS - is your version of R 64-bit as well?
Unfortunately, one of the byproducts of SVD is to get the size of the null space.
You can get the minimum eigenvalue for your matrix using Jacobi iteration. Lanczos might be a good choice for getting the maximum eigenvalue. It'd be a lot of work to get all of them; you might just want the five smallest and largest to assess.
Anytime I experience an error with some software I immediately paste it into a Google search. At least it's comforting to know that I'm not the first to experience a particular problem:
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=Error+in+svd(m)+:+infinite+or+missing+values+in+'x'