Reduce openstreetmap graph size in networkx

Reduce openstreetmap graph size in networkx - graph

I have a graph (transformed from OSMNX) of London's walk path, containing 667.588 edges, with different highway attributes (the street types in openstreetmap). Running a shortest_path algorithm is quite slow (4 seconds). To improve the speed, I want to largely reduce the number of edges in a systematic way without losing main connections/city structures, but not sure how to do it? Any suggestions? Is there a way to group some close nodes to a more important one, thus reduce the size?

You can extract edges with desired highway types from your main graph G:
highways_to_keep = ['motorway', 'trunk', 'primary']
H = nx.MultiDiGraph()
for u,v,attr in G.edges(data=True):
if attr['highway'] in highways_to_keep:
H.add_edge(u,v,attr_dict=attr)
H.node[u] = G.node[u]
H.node[v] = G.node[v]
Here, we first initialized an empty MultiDiGraph, which is a type of graph used by OSMnx, then populate it with data from the main graph G, if the 'highway' attribute is in our list of highways_to_keep. You can find more about highway types in this OpenStreetMap page.
Our graph is a valid NetworkX graph, but you need to do one more thing before you can take advantage of OSMnx functionalities as well. if you execute G.graph, you will see graph attributes which contains crs (coordinate reference system) and some other things. you should add this information into your newly create graph:
H.graph = G.graph
here is the plot of H , osmnx.plot_graph(H):

It depends what type of network you're working with (e.g., walk, bike, drive, drive_service, all, etc.). The drive network type would be the smallest and prioritize major routes, but at the expense of pedestrian paths and passageways.
OSMnx also provides the ability to simplify the graph's topology with a built-in function. This is worth doing if you haven't already as it can reduce graph size by 90% sometimes while correctly retaining all intersection and dead-end nodes, as well as edge geometries, faithfully.

The above solution does not work anymore since the networkx library has changed. Specifically
H.node[u] = G.node[u]
is not supported anymore.
The following solution relies on the osmnx.geo_utils.induce_subgraph and used a node list as an argument to this function.
highways_to_keep = ['motorway', 'trunk', 'primary', 'secondary', 'tertiary']
H = nx.MultiDiGraph() # new graph
Hlist = [] # node list
for u,v,attr in G.edges(data=True):
if "highway" in attr.keys():
if attr['highway'] in highways_to_keep :
Hlist.append(G.nodes[u]['osmid'])
H = ox.geo_utils.induce_subgraph(G, Hlist)

The osmnx simplification module worked for me in this case https://osmnx.readthedocs.io/en/stable/osmnx.html#module-osmnx.simplification:
osmnx.simplification module
Simplify, correct, and consolidate network topology.
osmnx.simplification.consolidate_intersections(G, tolerance=10, rebuild_graph=True, dead_ends=False, reconnect_edges=True)
Consolidate intersections comprising clusters of nearby nodes.
osmnx.simplification.simplify_graph(G, strict=True, remove_rings=True)
Simplify a graph’s topology by removing interstitial nodes.

Related

Convert a map with houses into a graph

I am curious how map software (Google/Bing maps) convert a map into a graph in the backend.
Now if we add houses between intersections 1 and 2, then how would the graph change. How do map software keep track of where the houses are?
Do they index the intersection nodes and also have smaller "subnodes" (between 1 and 2 in this case)? Or do they do this by having multiple layers? So when a user enters a home number, it looks up where the home is (i.e. between which vertices the home is located). After that, they simply apply a shortest path algorithm between those two node and at the beginning and the end, they basically make the home node go to one of the main vertices.
Could someone please give me a detailed explanation of how this works? Ultimately I would like to understand how the shortest path is determined given two the "address" of two "homes" (or "subnodes").

I can only speak for GraphHopper, not for the closed source services you mentioned ;)
GraphHopper has nodes (junctions) and edges (connection between those junctions), nearly exactly how your sketch looks like. This is very fast for the routing algorithms as it avoids massive traversal overhead of subnodes. E.g. in an early version we used subnodes everytime the connection was not straight (e.g. curved street) and this was 8 times slower and so we avoided those 'pillar' nodes and only used the 'tower' nodes for routing.
Still you have to deal with two problems:
How to deal with queries starting on the edge at e.g. house number 1? This is solved via introducing virtual nodes for every query (which can contain multiple locations), and you also need the additional virtual edges and hide some real edges. In GraphHopper we create a lightweight wrapper graph around the original graph (called QueryGraph) which handles all this. It then behaves exactly like a normal 'Graph' for every 'RoutingAlgorithm' like Dijkstra or A*. Also it becomes a bit hairy when you have multiple query locations on one edge, e.g. for a route with multiple via points. But I hope you get the main idea. Another idea would be to do the routing for two sources and two targets but initialized with the actual distance not with 0 like it is normally done for the first nodes. But this makes the routing algorithms more complex I guess.
And as already stated, most of the connections between junctions are not straight and you'll have to store this geometry somewhere and use it to draw the route but also to 'snap a location to the closest road' do finally do the actual routing. See LocationIndexTree for code.
Regarding the directed graphs. GraphHopper stores the graph via undirected edges, to handle oneways it stores the access properties for every edge and for every vehicle separately. So we avoid storing two directed edges and all of its properties (name/geometry/..), and make the use case possible "oneway for car and twoway for bike" etc. It additionally allows to traverse an edge in the reverse direction which is important for some algorithms and e.g. the bidirectional Dijkstra. This would not be possible if the graph would be used to model the access property.
Regarding 'nearly exactly how your sketch looks like': node 1, 3, 7 and 8 would not exist as they are 'pillar' nodes. Instead they would only 'exist' in the geometry of the edge.

To represent the connectivity of a road network, you want your directed road segments to be the graph nodes and your intersections to be collections of directed edges. There is a directed edge from X to Y if you can drive along X and then turn onto or continue on Y.
Consider the following example.
a====b====c
|
| <--one way street, down
|
d
An example connectivity graph for this picture follows.
Nodes
ab
ba
bc
cb
bd
Edges
ab -> bc
ab -> bd
cb -> ba
cb -> bd
Note that this encodes the following information:
No U-turns are allowed at the intersection,
because the edges ab -> ba and cb -> bc are omitted.
When coming from the right a left turn onto the vertical road is allowed,
because the edge cb -> bd is included.
With this representation, each node (directed road segment) has as an attribute all of the addresses along its span, each marked at some distance along the directed road segment.

Clustering in Gephi (Louvain Method)

I have started to work with gephi to help me display a dataset.
The dataset contains:
tags (terms for a certain picture) as nodes
Normalized Google Similarity Distance between those tags as edges with a weight (between 0 und 1)
Every tag is connected to every other tag, as long as they both belong to the same picture. So I have one cluster of nodes and edges for every picture.
I have now imported this dataset to gephi in the following format:
nodes: id, label
edges: target, source, weight (between 0 and 1)
Like 500 nodes and 6000 edges.
My problem now is that after importing all those nodes and edges the graph looks kind of bunched with no real order. Every cluster of every picture is mixed into other clusters of other pictures.
Now using Modularity as Partition algorithm (which should use the Louvain method) the graph is getting colored, each color represent a picture. Now I can split this mess, using the Force Atlas 2 Layout.
I now have a colored graph with something like 15 clusters (every cluster represent 1 picture)
Now I want to cluster those clusters again using tags (nodes) according to their Normalized google distance (weight of the edges), which should then be tags which are somewhat equal in their meaning.
I hope you guys understand what I want to accomplish.
I can also upload a picture to clarify it.
Thanks a lot

I don't think you can do that with the standard version of Gephi. You would need to develop a plugin to implement the very last step of your process.
Gephi is good for visualizing and browsing graphs, but (for now) there are more complete tools when it comes to processing topological properties. for instance, the igraph library (available in C, R and python) might be more appropriate for you. And note that you can use a file format compatible with both Gephi and igraph, which allows you to use both tools on the same data.

I was able to solve my problem. I had to import every one of these 15 clusters on their own. In this way i could use the Modularity method on just those few.

NetworkX-style spring model layout for directed graphs in Graphviz / PyGraphviz

NetworkX is mostly for graph analysis, PyGraphviz mostly for drawing, and they're designed to work together. However, there's at least one respect in which NetworkX's graph drawing (via MatPlotLib) is superior to PyGraphviz's graph drawing (via Graphviz), namely that NetworkX has a spring layout algorithm (accessible via the spring_layout function) specifically for directed graphs while PyGraphviz has several spring layout algorithms (accessible via the neato program, and others) that lay out directed graphs as if they were undirected graphs. The only Graphviz / PyGraphviz layout program that really handles direction in a graph is dot, but dot creates hierarchical layouts, not force-directed layouts.
Here is an example that shows the difference between NetworkX and PyGraphviz for spring layouts of directed graphs:
import networkx as nx
import pygraphviz as pgv
import matplotlib.pyplot as ppt
edgelist = [(1,2),(1,9),(3,2),(3,9),(4,5),(4,6),(4,9),(5,9),(7,8),(7,9)]
nxd = nx.DiGraph()
nxu = nx.Graph()
gvd = pgv.AGraph(directed=True)
gvu = pgv.AGraph()
nxd.add_edges_from(edgelist)
nxu.add_edges_from(edgelist)
gvd.add_edges_from(edgelist)
gvu.add_edges_from(edgelist)
pos1 = nx.spring_layout(nxd)
nx.draw_networkx(nxd,pos1)
ppt.savefig('1_networkx_directed.png')
ppt.clf()
pos2 = nx.spring_layout(nxu)
nx.draw_networkx(nxu,pos2)
ppt.savefig('2_networkx_undirected.png')
ppt.clf()
gvd.layout(prog='neato')
gvd.draw('3_pygraphviz_directed.png')
gvu.layout(prog='neato')
gvu.draw('4_pygraphviz_undirected.png')
1_networkx_directed.png:(http://farm9.staticflickr.com/8516/8521343506_0c5d62e013.jpg)
2_networkx_undirected.png:(http://farm9.staticflickr.com/8246/8521343490_06ba1ec8e7.jpg)
3_pygraphviz_directed.png:(http://farm9.staticflickr.com/8365/8520231171_ef7784d983.jpg)
4_pygraphviz_undirected.png:(http://farm9.staticflickr.com/8093/8520231231_80c7eab443.jpg)
The third and fourth figures drawn are basically identical but for the arrowheads (the whole figure has been rotated, but apart from that, there's no difference). However, the first and second figures are differently laid out - and not just because NetworkX's layout algorithm introduces an element of randomness.
Repeatedly running the code above shows that this is not a chance occurrence. NetworkX's spring_layout function was apparently written on the assumption that if there is an arc from one node to another, the second node should be closer to the centre of the graph than the first (i.e. that if the graph described in edgelist is directed, node 2 should be closer to node 9 than nodes 1 and 3 are, node 6 should be closer to node 9 than node 4 is, and node 8 should be closer to node 9 than node 7 is; this doesn't always work perfectly as we see from nodes 4 and 5 in the first figure above, but that's a small issue compared to getting both 2 and 9 near the centre and the 'error' from my point of view is very slight). In other words, NetworkX's spring_layout is both hierarchical and force-directed.
That is a nice feature, because it makes core/periphery structures more obvious in directed graphs (where, depending on the assumptions you're working with, nodes without incoming arcs can be considered to be part of the periphery even if they have large numbers of outgoing arcs). #skyebend has explained below why most layout algorithms treat directed graphs as if they were undirected, but the graphs above show (a) that NetworkX treats them differently, and (b) that it does so in a principled way that is helpful for analysis.
Can this be replicated using PyGraphviz / Graphviz?
Unfortunately the documentation and the commented source code for NetworkX's spring_layout (actually fruchterman_reingold_layout) function provide no clue as to why NetworkX produces the result that it does.
This is the result of using PyGraphviz to draw the network using the NetworkX spring_layout function (see my own answer to this question below).
5_pygraphviz_plus_networkx.png:
(http://farm9.staticflickr.com/8378/8520231183_e7dfe21ab4.jpg)

Okay, I think I figured it out so I'm going to answer my own question. I don't think it can be done in PyGraphviz per se. However, one can instruct PyGraphviz to take the node positions from NetworkX but peg them (using !) so that the neato program is prevented from actually doing anything except rubber-stamping the node positions calculated by spring_layout. Add the following lines of code to the above:
for k,v in pos1.iteritems():
gvd.get_node(k).attr['pos']='{},{}!'.format(v[0]*10,v[1]*10)
gvd.layout(prog='neato')
gvd.draw('5_pygraphviz_plus_networkx.png')
The result is not perfect -- I had to multiply the co-ordinates by 10 in order to stop the nodes from being drawn on top of each other, which is (obviously) a kludge -- but it's an improvement, i.e. the nodes with 0 indegree are on the outside (benefit of laying out with NetworkX) and there are proper arrowheads that don't get swallowed up by the nodes themselves (benefit of drawing with PyGraphviz).
I am aware that this isn't strictly what I asked for, though (i.e. a solution using PyGraphviz / Graphviz itself).
If somebody can provide a better solution I'll be happy!
EDIT: Nobody's provided a better solution to the problem as articulated above, so I'm going to accept my own answer to signal that it actually works. However, I'm also voting up skyebend's answer because - although it doesn't solve the problem - it's a very useful contribution to understanding the underlying issues.

Graphviz also has an fdp and sfdp layout mode for doing force directed placement of nodes which is analogous to a spring layout. I'm not familiar with NetworkX, but it seems gvu.layout(prog='fdp') might work? If NetworkX allows passing additional arguments to Graphviz there are a number of configurable layout parameters you could tweak that may give you a layout closer to what you want. See Graphviz docs: http://www.graphviz.org/content/attrs
However, the fdp layouts treat the network as an undirected graph. Most 'spring' layouts I know of also treat networks as undirected because they must transform them into a Euclidean space (the screen) in which distances are symmetric. One exception would be 'magnetic' spring layouts which also attempt to align arcs so they are pointing in a similar direction to convey hierarchy, as a sort neato/dot hybrid.
Algorithm implementations may also differ in how they transform the network distances in an directed network to undirected weights/distances to be optimized by the layout. You may want to do this step explicitly yourself if you want more control over the way directed arcs are interpreted.

shortest path search in a map represented as 2d shapes

I have a small library of a few shortest path search algorithms. They were developed for simple undirected graphs (the normal representation - vertices and edges). Now I'd like to somehow apply them on a bit different scenario - where the maps are represented as 2-dimensional shapes, connected by shared edges (edges of the polygons, that is). In this scenario, the search can start/end either at a map object or some point (x,y). What would be the best approach? Try to apply the algorithms onto shapes? or try to extract a 'normal' graph out of the shapes (I have preprocessing time available)? Any advice would be much appreciated, as I'm really not sure which way to go, and I don't have enough time (and skill) to explore many options...
Thanks a lot

What's the "path" you're looking for? A list of the shapes to traverse? (Otherwise you just draw a straight line between start+end points.)
It's easy to preprocess it into a format where the shapes are vertices and are connected by edges when the shapes share a polygon side. Then, just pass it off to your existing library to get the answer.

Clustered Graphs Visualization Techniques

I need to visualize a relatively large graph (6K nodes, 8K edges) that has the following properties:
Distinct Clusters. Approximately 50-100 Nodes per cluster and moderate interconnectivity at the cluster level
Minimal (5-10 inter-cluster edges per cluster) interconnectivity between clusters
Let global edge overlap = The edge overlaps caused by directly visualizing a graph of Clusters = {A, B, C, D, E}, Edges = {Pentagram of those clusters, which is non-planar by the way and will definitely generate edge overlap if you draw it out directly}
Let Local Edge Overlap = the above but { A, B, C, D, E } are just nodes.
I need to visualize graphs with the above in a way that satisfies the following requirements
No global edge overlap (i.e. edge overlaps caused by inter-cluster properties is not okay)
Local edge overlap within a cluster is fine
Anyone have thoughts on how to best visualize a graph with the requirements above?
One solution I've come up with to deal with the global edge overlap is to make sure a cluster A can only have a max of 1 direct edge to another cluster (B) during visualization. Any additional inter-cluster edges between cluster A -> C, A -> D, ... are disconnected and additional node/edges A -> A_C, C -> C_A, A -> A_D, D -> D_A... are created.
Anyone have any thoughts?

Prefuse has some good graph drawing link text algorithms built in and it seems to handle fairly large graphs relatively well. You might try Flow Map Layout which is built on top of Prefuse.

Given your objectives, I think that the Fruchterman-Reingold algorithm does a pretty decent job of preventing edge overlap. See for example this screenshot of a network consisting of multiple components drawn using the Fruchterman-Reingold algorithm. IGraph has built-in support for this algorithm (as does Networkx I believe) and is really fast.

There is a program built on top of Prefuse called SocialAction. You have to request the code from the author, but it does a lot of statistical analysis on the graph for you, such as identifying subgraphs. I've used it on a graph with more than 18,000 nodes, and although it is very slow at that scale it still works.

Although it may be silly to ask at this point, had you tried out http://www.graphviz.org/ ?

I haven't seen too many graph visualization tools that support separating clusters within a graph visually. One option might be to take a look at WilmaScope. It looks to have some support for cluster based layouts.

Organic layout manages fairly well clustered graphs in yFiles framework. Try first in yEd to see if it does what needed. It is probably reasonable to use nested graphs alias groups for each cluster. Organic layout has feature called Group Layout Policy which can be used if layout needs to be done using different principles for inter-cluster and intra-cluster edges, with incremental layouting. With some effort, one can translate graph into GraphML to avoid manual work.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex