How to handle subgraph isomorphism algorithms with millions of points and tens of millions of edges? - graph

I implemented the VF3P algorithm, but when the size of the graph reaches millions of points and tens of millions of edges (e.g., the soc_liveJournal dataset), solving a specified-size pattern graph becomes very time-consuming (it took a whole night on a server with 16 CPU cores and 32GB of memory and didn't finish).
Is there any good algorithm that can solve subgraph isomorphism on ultra-large-scale graphs?

Related

All pairs shortest paths in graph directed with non-negative weighted edges

I have a directed graph with non-negative weighted edges where there are multiple edges between two vertices.
I need to compute all pairs shortest path.
This graph is very big (20 milion vertices and 100 milion of edges).
Is Floyd–Warshall the best algorithm ? There is a good library or tool to complete this task ?
There exists several all-to-all shortest paths algorithms for directed graphs with non-negative cycles, Floyd-Warshall being probably the most famous, but with the figures you gave, I think you will have in any case memory issues (time could be an issue, but you can find all-to-all algorithm that can be easily and massively parallelized).
Independently of the algorithm you use, you will have to store the result somewhere. And storing 20,000,000² = 400,000,000,000,000 paths length (if not the full paths themselves) would use hundreds of terabytes, at the very least.
Accessing any of these results would probably be longer than calculating one shortest path (memory wall), which can be done in less than a milisecond (depending on the graph structure, you can find techniques that are much, much faster than Dijkstra or any priority queue based algorithm).
I think you should look for an alternative where computing all-to-all shortest paths is not required, to be honnest. Or, to study the structure of your graph (DAG, well structured graph easy to partition/cluster, geometric/geographic information ...) in order to apply different algorithms, because in the general case, I do not see any way around.
For example, with the figures you gave, an average degree of about 5 makes for a decently sparse graph, considering its dimensions. Graph partitioning approaches could then be very useful.

DAG-shortest path vs Dijkstra algorithm

I have implemented the Dijkstra algorithm from the pseudocode found in the reference "Introduction to Algorithms", 3rd edition by Cormen, for the single-source shortest path problem.
My implementation was made on python using linked lists to represent graphs in an adjacency list representation. This means that the list of nodes is a linked list and each node has a linked list to represent the edges of each node. Furthermore, I didn't implement or use any binary heap or fibonacci heap for the minimum priority queue that the algorithm needs, so I search for each node in O(V) time inside the linked list of nodes when the procedure needs to extract the next node with the smallest distance from the source.
On the other hand, the reference also provides an algorithm for DAG's (which I have implemented) using a topological sort before applying the relaxation procedure to all the edges.
With all these context, I have a Dijkstra algorithm with a complexity of
O(V^2)
And a DAG-shortest path algorithm with a complexity of
O(V+E)
By using the timeit.default_timer() function to calculate the running times of the algorithms, I have found that the Dijkstra algorithm is faster that the DAG algorithm when applied to DAGs with positive edge weights and different graph densities, all for 100 and 1000 nodes.
Shouldn't the DAG-shortest path algorithm be faster than Dijkstra for DAGs?
Your running time analysis for both algorithms is correct and it's true that DAG shortest path algorithm is faster than Dijkstra's algorithm for DAGs.
However, there are 3 possible reasons for your testing results:
The graph you used for testing is very dense. When the graph is very dense, E ≈ V^2, so the running time for both algorithms approach O(V^2).
The number of vertices is still not large enough. To solve this, you can use a much larger graph for further testing.
The initialization of the DAG costs a lot of running time.
Anyway, DAG shortest path algorithm should be faster than Dijkstra's algorithm theoretically.

Igraph R Community Detection

We are using igraph with R to detect communities with a large dataset of 7 million nodes and around 100 million edges.
We used several methods (cluster_walktrap, cluster_edge_betweenness, cluster_fast_greedy) and for every method, we have no result after 3-4 hours of processing. We are using a 64Gb memory server and during the process we still have free memory (not a memory problem), but no result at the end.
Do you have an idea what it might be?
Thank you

Use Apache Giraph as Neo4j with Big Amount of Data

I was trying having some tests on Neo4j calculating shortest path between 2 nodes.
With 100k nodes and 10 million edges (100 edges each node), shortest path algo was run in 0.4-3s
With 200k nodes and 40 million edges (200 edges each node), it takes at least from 40s or more.
My computer obviously isn't intended for Big Data analysis, but I don't even know if buying a server with 128GB ram a bunch of processor more, could solve the second test in a reasonable time. (Do you think it could?)
Certainly with 1 million nodes or more, neo4j will not help me out anymore.
I have spent many hours looking online for some way to use Giraph like Neo4J: having some sort of API, (even in java) through which I can run a query and output a result. But nothing..
Thanks in advance

Betweenness centrality for relatively large scale data

Using R,I try to calculate Betweenness centrality for about 1 million nodes and more than 20 million edges. To do so I have a pretty decent machine with 128GB ram and 4*2.40GHz CPU and a 64bit windows.
Yet, using betweeness() of Igraph takes ages. I am wondering is there any quick solution? would it be faster, if I use Gephi?!

Resources