Is it possible to do pagerank without the entire dataset? - graph

Sorry if this is dumb but I was just thinking I should give a shot. Say I have a graph thats huge(for example, 100 billion nodes). Neo4J supports 32 Billion and others support more or less the same, so say I cannot have the entire dataset in a database at the same time, can I run pagerank on it if its a directed graph(no loops) and each set of nodes connect to the next set of nodes(so no new links will be created backwards, only new links are created to new sets of data).
Is there a way I can somehow take the previous pagerank scores and apply them to new datasets(I only care about the pagerank for the most recent set of data but need the previous set's pagerank to derive the last sets data)?
Does that make sense? If so, is it possible to do?

You need to compute the principle eigenvector of a 100 billion by 100 billion matrix. Unless it's extremely sparse, you can not fit that inside your machine. So, you need a way to compute the leading eigenvector of a matrix when you can only look at a small part of your matrix at a time.
Iterative methods to compute eigenvectors only require that you store a few vectors at each iteration (they'll each have 100 billion elements). Those may fit on your machine (with 4 byte floats you'll need around 375GB per vector). Once you have a candidate vector of rankings you can (very slowly) apply your giant matrix to it by reading the matrix in chunks (since you can look at 32 billion rows at a time you'll need just over 3 chunks). Repeat this process and you'll have the basics of the power method which is what gets used in pagerank. cf http://www.ams.org/samplings/feature-column/fcarc-pagerank and http://en.wikipedia.org/wiki/Power_iteration
Of course the limiting factor here is how many times you need to examine the matrix. It turns out that by storing more than one candidate vector and using some randomized algorithms you can get good accuracy with fewer reads of your data. This is a current research topic in the applied math world. You can find more information here http://arxiv.org/abs/0909.4061 , here http://arxiv.org/abs/0909.4061 , and here http://arxiv.org/abs/0809.2274 . There's code available here: http://code.google.com/p/redsvd/ but you can't just use that off-the-shelf for the data sizes you're talking about.
Another way you may go about this is to look into "incremental svd" which may suit your problem better but is a bit more complicated. Consider this note: http://www.cs.usask.ca/~spiteri/CSDA-06T0909e.pdf and this forum: https://mathoverflow.net/questions/32158/distributed-incremental-svd

Related

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.

affinity propagation clustering on a large sparse matrix

I am trying R package apcluster on a set of objects that I want to cluster, but I'm running into performance/memory problems, and I suspect I'm not doing it right. I'd like to hear your opinion, please.
In short: I have a set of about 13000 objects. Each object is associated with a set of 2 to 5 'features'. The similarity (by which I want to cluster, eventually) between any two objects i and j is equal to the number of features they have in common divided by the total number of distinct features they 'span'. E.g. if i = {a,b,c} and j = {c,d}, then sim[i,j] = 1/4 = 0.25, because they have only 1 feature in common ({c}) and in total they describe 4 distinct features ({a,b,c,d}).
Calculating my NxN similarity matrix is not a problem in theory: it can be done using set operations if each object's features are stored as a list; or features can be pivoted to a matrix of 1's and 0's, where each column is a feature, and then R's function dist with method="binary" does the trick.
In practice however, the first problem is that such similarity calculations are extremely slow. For 13 K objects, there are about 84.5 M similarities to calculate, but this doesn't sound so bad for a modern computer. I don't understand why it should take a few hours to do that. And the set operation version, that should be quicker as far as I can tell, is actually much slower than dist. [Another package called fingerprint is supposed to deal with such cases more efficiently, but so far I haven't been able to make it work, it gives a lot of errors when trying to make what they call 'featvec' objects].
The other thing to consider is that the 2-5 features per object are not very repetitive. There may be a group of 100 or so objects with at least one feature in common between them, but then none of the other 12.9 K objects has any feature in common with these 100 objects. The consequence is that the pivoted feature matrix is very sparse (if we consider 0's as empty). There are about 4000 columns in the pivoted matrix, and each row has at most 5 1's. I wonder if this is negatively impacting the performance of dist, in that it has to multiply through a lot of 0's that could instead be ignored.
Does it seem normal to you that it should take a few hours to apply dist to a matrix like the one I described? Can you suggest a different way to calculate the similarity that takes advantage of the sparseness of the matrix?
Anyway, I managed to get the output from dist, which however had class 'dist', and was a distance matrix, not a similarity one, so I had to use 1 - as.matrix(distance_matrix) to be able to make the similarity matrix apcluster needs as input.
That's when I got the first 'memory' problem. R said the vector could not be allocated due to its size. I tried the usual tricks, but in the end I could not get more than 4 GB, and my matrices are (apparently) bigger.
I overcame this by assigning each time new matrices to their old 'self'.
And then when I submitted this painstakingly put together similarity matrix to apcluster, again the vector size error popped up, as if the first thing apcluster did was create some other large object from what I had fed it.
I had a look at as.Sparse... in apcluster, but it does not seem to help a lot, considering that you have to calculate the full matrix first anyway.
In the end the only thing that worked a little bit was 'leveraged affinity propagation' by apclusterL, which however is an approximation.
Does anybody know if and how I could do this better? E.g. is it wise to pivot the data first, or should I stick to list and set operations? Or, can the fact that the initial matrix is sparse be used to compute directly a sparse similarity matrix, rather than compute it fully and reduce it to sparse later?
Any advice would be greatly appreciated. Thanks!
BTW, yes, I saw this thread: Cluster Analysis in R on large sparse matrix ; which does not seem to have been answered conclusively.
The R interpreter is really slow.
So you should use R mostly to "drive" your program, but implement all the computations heavy stuff in C or FORTRAN.
You didn't show the code you are using, but I guess it involves nested for loops? Try to rewrite it without any for loops in R, or rewrite it in C.
But no matter what, AP clustering will always remain very slow. It involves many passes over O(n²) matrixes, i.e. it scales very badly.

How to find a point on 2-d weighted map, which will have equidistant (as close as possible) paths to multiple endpoints?

So let's say I got a matrix with two types of cells: 0 and 1. 1 is not passable.
I want to find a point, from which I can run paths (say, A*) to a bunch of destinations (don't expect it to be more than 4). And I want the length of these paths to be such that l1/l2/l3/l4 = 1 or as close to 1 as possible.
For two destinations is simple: run a path between them and take the midpoint. For more destinations, I imagine I can run paths between each pair, then they will create a sort of polygon, and I could grab the centroid (or average of all path point coordinates)? Or would it be better to take all midpoints of paths between each pair and then use them as vertices in a polygon which will contain my desired point?
It seems you want to find the point with best access to multiple endpoints. For other readers, this is like trying to found an ideal settlement to trade with nearby cities; you want them to be as accessible as possible. It appears to be a variant of the Weber Problem applied to pathfinding.
The best solution, as you can no longer rely on exploiting geometry (imagine a mountain path or two blocking the way), is going to be an iterative approach. I don't imagine it will be easy to find an optimal solution because you'll need to check every square; you can't guess by pathing between endpoints anymore. In nearly any large problem space, you will need to path from each possible centroid to all endpoints. A suboptimal solution will be fairly fast. I recommend these steps:
try to estimate the centroid using geometry, forming a search area
Use a modified A* algorithm from each point S in the search area to all your target points T to generate a perfect path from S to each T.
Add the length of each path S -> T together to get Cost (probably stored in a matrix for all sample points)
Select the lowest Cost from all your samples in the matrix (or the entire population if you didn't cull the search space).
The algorithm above can also work without estimating a centroid and limiting solutions. If you choose to search the entire space, the search will be much longer, but you can find a perfect solution even in a labyrinth. If you estimate the centroid and start the search near it, you'll find good answers faster.
I mentioned earlier that you should use a modified A* algorithm... Rather than repeating a generic A* search S->Tn for every T, code A* so that it seeks multiple target locations, storing the paths to each one and stopping when it has found them all.
If you really want a perfect solution to the problem, you'll be waiting a long time, so I recommend that you use any exploit you can to reduce wasteful calculations. Even go so far as to store found paths in a lookup table for each T, and see if a point already exists along any of those paths.
To put it simply, finding the point is easy. Finding it fast-enough might take lots of clever heuristics (cost-saving measures) and stored data.

Find solution minimum spanning tree (with conditions) when extending graph

I have a logic question, therefore chose from two explanations:
Mathematical:
I have a undirected weighted complete graph over 2-14 nodes. The nodes always come in pairs (startpoint to endpoint). For this I already have the minimum spanning tree, which considers that the pairs startpoint always comes before his endpoint. Now I want to add another pair of nodes.
Real life explanation:
I already have a optimal taxi route for 1-7 people. Each joins (startpoint) and leaves (endpoint) at different places. Now I want to find the optimal route when I add another person to the taxi. I have already the calculated subpaths from each point to each point in my database (therefore this is a weighted graph). All calculated paths are real value, not heuristics.
Now I try to find the most performant solution to solve this. My current idea:
Find the point nearest to the new startpoint. Add it a) before and b) after this point. Choose the faster one.
Find the point nearest to the new endpoint. Add it a) before and b) after this point. Choose the faster one.
Ignoring the case that the new endpoint comes before the new start point, this seams feasible.
I expect that the general direction of the taxi is one direction, this eliminates the following edge case.
Is there any case I'm missing in which this algorithm wouldn't calculate the optimal solution?
There are definitely many cases were this algorithm (which is a First Fit construction heuristic) won't find the optimal solution. Given a reasonable sized dataset, in my experience, I would guess to get improvements of 10-20% by simply taking that result and adding metaheuristics (or other optimization algo's).
Explanation:
If you have multiple taxis with a limited person capacity, it has an inherit bin packing problem, which is NP-complete (which is proven to be suboptimally solved by all known construction heuristics in P).
But even if you have just 1 taxi, it is similar to TSP: if you have the optimal solution for 10 locations and add 1 location, it can create a snowball effect in the optimal solution to make the optimal solution look completely different. (sorry, no visual image of this yet)
And if you need to any additional constraints on top of that later on, you need to be aware of these false assumptions.

Graph library implementation

I' m trying to implement a weighted graph. I know that there are two ways to implement a weighted graph. Either with a two dimensional array(adjacency matrix) or with an array of linked lists(adjacency list). Which one of the two is more efficient and faster?
Which one of the two is more efficient and faster?
That depends on your usage and the kinds of graphs you want to store.
Let n be the number of nodes and m be the number of edges. If you want to know whether two nodes u and v are connected (and the weight of the edge), an adjacency matrix allows you to determine this in constant time (in O-notation, O(1)), simply by retrieving the entry A[u,v]. With an adjacency list, you will have to look at every entry in u's list, or v's list - in the worst case, there could be n entries. So edge lookup for an adjacency list is in O(n).
The main downside of an adjacency matrix is the memory required. Alltogether, you need to store n^2 entries. With an adjacency list, you need to store only the edges that actually exist (m entries, asuming a directed graph). So if your graph is sparse, adjacency lists clearly occupy much less memory.
My conclusion would be: Use an adjacency matrix if your main operation is retrieving the edge weight for two specific nodes; under the condition that your graphs are small enough so that n^2 entries fit in memory. Otherwise, use the adjacency list.
Personally I'd go for the linked lists approach, assuming that it will often be a sparse graph (i.e. most of the array cells are a waste of space).
Went to wikipedia to read up on adjacency lists (been ages since I used graphs) and it has a nice section on the trade-offs between the 2 approaches. Ultimately, as with many either/or choices it will come down to 'it depends', based on what are the likely use cases for your library.
After reading the wiki article, I think another point in favor of using lists would be attaching data to each directed segment (or even different weights, think of walk/bike/car distance between 2 points etc.)

Resources