How to implement Floyd warshall algorithm without adjacency list? - graph

I have a sparse graph in form of dictionary and i want to calculate the shortest path between all pairs of nodes in graph. However the graph is sparse in nature and thus its representation in adjacency matrix form will be really inefficient. So i was wondering if there is way through which we can use the dictionary in Floyd warshall theorem.
dict format
d={parent:[childnode1, childnode2,...], parent2[child....]}

Related

All pairs shortest paths in graph directed with non-negative weighted edges

I have a directed graph with non-negative weighted edges where there are multiple edges between two vertices.
I need to compute all pairs shortest path.
This graph is very big (20 milion vertices and 100 milion of edges).
Is Floyd–Warshall the best algorithm ? There is a good library or tool to complete this task ?
There exists several all-to-all shortest paths algorithms for directed graphs with non-negative cycles, Floyd-Warshall being probably the most famous, but with the figures you gave, I think you will have in any case memory issues (time could be an issue, but you can find all-to-all algorithm that can be easily and massively parallelized).
Independently of the algorithm you use, you will have to store the result somewhere. And storing 20,000,000² = 400,000,000,000,000 paths length (if not the full paths themselves) would use hundreds of terabytes, at the very least.
Accessing any of these results would probably be longer than calculating one shortest path (memory wall), which can be done in less than a milisecond (depending on the graph structure, you can find techniques that are much, much faster than Dijkstra or any priority queue based algorithm).
I think you should look for an alternative where computing all-to-all shortest paths is not required, to be honnest. Or, to study the structure of your graph (DAG, well structured graph easy to partition/cluster, geometric/geographic information ...) in order to apply different algorithms, because in the general case, I do not see any way around.
For example, with the figures you gave, an average degree of about 5 makes for a decently sparse graph, considering its dimensions. Graph partitioning approaches could then be very useful.

MST on Complete graph to cluster them (for cosine similarity)

I need to cluster (let's say given as parameter k), words (that I
store in array List) according to their cosine similarity. I have stored my all words as vertexes in list in a complete ,weighed, and undirected graph (that uses adjacency list), and put their cosine similarity values on edges. As I understand I need to use MST (Kruskals Algorithm) for clustering process.
However since, my graph is complete graph and MST used for connected graphs, I am kind of confused how to use it on complete graph? Or I am doing wrong by using complete graph?
This is my wordList:
[directors, producers, film, movie, black, white, man, woman, person, man, young, woman, science, fiction, thrilling, realistic, lovely, stunning, criminals, zombies, father, son, girlfriend, boyfriend, nurse, soldier, professor, college]
And I need to cluster them by MST so that if k (number of clusters) is 2 it will be like this (2 clusters according to their similarities):
boyfriend,college,father,girlfriend,man,nurse,person,professor,son,woman,young
criminals,directors,fiction,film,lovely,movie,producers,science,stunning,thrilling,zombies
It's standard to use minimum spanning trees on complete graphs.
You will often find the runtime complexity given for this case separately. You may want to check if Prim's is faster than Kruskal's on a complete graph.
Clustering with the minimum spanning tree is also known as Single-Link clustering, and the fast SLINK algorithm is closely related to Prim's MST algorithm. But the output format is more suitable for clustering.

Graph Querying on edge

Attributed Graphs are most commonly represented as an adjacency matrix or a list where nodes are considered first class citizens. There are many graph queries such as neighborhood, shortest path, page rank, connected component that operate on these matrix and list structures on nodes. The attributes of the node/edge can also be stored apart from the connections.
Another representation of the graph is an incidence matrix where the incident edges of a node are recorded in a matrix. I understand they represent exactly the same information as previous node-based methods.
My question is, are there any graph queries/workloads/algorithms that can benefit from the incidence matrix structure rather than using the node-based structures i.e. favoring an edge-based structure? When exactly are the incidence matrix used?
I can think of only one case where incidence matrix may prove faster:
Finding the degree of a node or finding adjacent nodes is an operation with complexity O(V) when using an adjacency matrix and O(E) when using an incidence matrix.
Usually E>V, but this may not be the case if the graph has many 0-degree nodes. Since finding adjacent nodes is a basic operation, many algorithms may prove to be faster on such graphs.

graphs representation : adjacency list vs matrix

I'm preparing for a coding interview, and was refreshing my mind on graphs. I was wondering the following : in all places I've seen, it is assumed that adjacency lists are more memory efficient than adjacency matrices for large sparse graphs, and should thus be preferred in that case. In addition, computing the number of outgoing edges from a node requires O(N) in a matrix while it's O(1) in a list, as well as which are the adjacent nodes in O(num adjacent nodes) for the list instead of O(N) for the matrix.
Such places include Cormen et al.'s book, or StackOverFlow : Size of a graph using adjacency list versus adjacency matrix? or Wikipedia.
However, using a sparse matrix representation like with Compressed Row Storage representation, the memory requirement is just in O(number of non-zeros) = O(number of edges), which is the same as using lists. The number of outgoing edges from a node is O(1) (it is directly stored in CRS), and the adjacent nodes can be listed in O(num adjacent nodes).
Why isn't it discussed ? Should I assume that CSR is a kind of adjacency list representation of the graph represented by the matrix ? Or is the argument that matrices are memory intensive flawed because they don't consider sparse matrix representations ?
Thanks!
Not everyone uses sparse matrix representations every day (I just happen to do so :), so I guess nobody thought of them. They are a kind of intermediate between adjacency lists and adjacency matrices, with performance similar to the first if you pick the right representation, and are very convenient for some graph algorithms.
E.g., to get a proximity matrix over two hops, you just square the matrix. I've successfully done this with sparse matrix representations of the Wikipedia link structure in modest amounts of CPU time.

JUNG graphs: vertex similarity?

I have a JUNG graph containing about 10K vertices and 100K edges, and I'd like to get a measure of similarity between any pair of vertices.
The vertices represent concepts (e.g. dog, house, etc), and the links represent relations between concepts (e.g. related, is_a, is_part_of, etc).
The vertices are densely inter-linked, so a shortest-path approach doesn't give good results (the shortest paths are always very short).
What approaches would you recommend to rank the connectivity between vertices?
JUNG has some algorithms to score the importance of vertices, but I don't understand if there are measures of similarity between 2 vertices.
SimPack seems also promising.
Any hints?
The centrality scores don't measure similarity of pairs of vertices, but some kind of (depending on the method) centrality of single nodes of the network in general. Therefore this approach is possibly not what you want.
SimPack indeed has a nice goal set out, but for graphs it implements isomorphism-based comparations, which rather compare multiple graphs for similarity than pairs of nodes of one given graph. Therefore this is out of scope for now.
What you are seeking are so-called graph clustering methods (also called network module determination or network community determination methods), which divide the graph (network) into multiple partitions so that the nodes in each partition are more strongly interconnected with each other than with nodes of other partitions.
The most classic method is maybe the betweenness centrality clustering of Newman & Girvan where you can exploit the dendrogram for similarity calculation, and it is in JUNG. Of course there are throngs of methods nowadays. You may want to try (shameless plug) our ModuLand method, or read the fine table of module detection algorithms at the end of the Electronic Supplementary Material. That is an overlapping graph clustering method family, that is its result for each node is a vector containing the strengths of belonging to any respective cluster of the network. Pairwise node similarity is easy to derive from pairs of these node-to-cluster vectors.
Graph clustering is non-trivial, and possible you would need to adapt any method for very precise domain-specific results, but that's up to the reader ;) Good luck!

Resources