Routing Engine Using OpenStreetMap Data - math

As part of my academic project, I have to build a routing engine based on data supplied from OSM. I have looked at the data model of OSM and I'm all fine with it. However, I'm having trouble converting an OSM XML file into a graph structure (nodes and edges) that I can use to apply search algorithms (Dijkstra, A* etc.) on. I would like the graph to be stored in memory to allow fast read/write.
So can anyone shed light or suggest techniques on how this can be done, or even provide pointers for further research.
Please note that I'm not allowed to re-use existing routing engines as this would defeat the purpose of doing the project.

All you need to do is:
create a node for every <node> item
every <way> entry is a sequenced list of <nd> items, each of which is a backreference to a node. So for each <way>, you iterate pairwise through its <nd>s and create an arc between the two nodes referenced.
You can do this in one pass using a streaming XML parser, since the XML data defines all the nodes before the ways.
The data doesn't intrinsically include distances, so you need to calculate that from the latlon of each node. You should also take account of the road type (highway=*) and the access info (access=*) in your routing, and you probably also want to ignore ways that are not traversable (eg waterway=stream) but that's all down to your specific situation.
http://wiki.openstreetmap.org/wiki/Elements

Related

Understanding Neo4j Algo write back option

I've been looking into Neo4j graph algorithms and I've seen that a number of algorithms are only available in the write back format while others have both stream and write back implementations. However, I haven't been able to find anything explaining the difference between the two.
So my questions are:
When and Why is write back a better implementation than stream? (basically what are the advantages and disadvantages of write back)
How does write back handle graph alterations? (if we add/remove nodes or edges from the graph after running the algorithm is there any way to tell that the property is now invalid?)
From what I see all graph algos have the stream & write behaviour, except some that only have the stream one (pretty much all the path algos).
Graph algos consume a lot of resources (they work on the entire graph), so if you have a big dataset, it will take times.
That's why it's really usefull to write back the result, it allows you to make some cypher queries based on the result of your graph algos.
For your question about the invalidation, there is no internal mecanism to do it, but with APOC, you can create a trigger to invalidate the result when a node is created/deleted or when you add/remove a relationship.

Practical usage for linked data

I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.

Mapping GPS coordinates to an area

I have devices moving across the entire country that report their GPS positions back to me. What i would like to do is to have a system that maps these coordinates to a named area.
I see two approaches to this:
Have a database that defines areas as polygons stretching between various GPS coords.
Use some form of webservice that can provide the info for me.
Either will be fine. It doesn't have to be very accurate at all, as i only need to know the region involved so that i know which regional office to call if something wrong happens with the device.
In the first approach, how would you build an SQL table that contained the data? And what would be your approach for matching a GPS coordinate to one of the defined areas? There wouldn't be many areas to define, and they'd be quite large, so manually inputting the values defining the areas wouldn't be a problem.
In the case of the second approach, does anyone know a way of programatically pulling this info off the web on demand? (I'd probably go for Perl WWW::Mechanize in this case). "close to Somecity" would be enough.
-
PS: This is not a "do the work for me" kind of question, but more of a brainstorming request. pseudo-code is fine. General theorizing on the subject is also fine.
In the first approach, how would you build an SQL table that contained
the data? And what would be your approach for matching a GPS
coordinate to one of the defined areas?
Asume: An area is defined as an closed polygon.
You match the GPS coordinate by simply calling a point inside polygon method, like
boolean isInside = polygon.contains(latitude, longitude);
If you have few polygons you can do a brute force search through all existing polygons.
If you have many of them and each (ten-) thousands of points, the you want to use a spatial grid, like a quadtree or k-d tree, to reduce the search to the relevant polygons.
method.
this process is called reverse geocoding, many services providers such as google, yahoo, and esri provide services that will allow to do this thing
they will return the closest point of interest or address, but you can keep the administrative level you are interested in
check terms of use to see which service is compatible with your intended usage

Working with google maps api

I am trying to build a map based query interface for my website and I am having difficulty finding a starting point besides http://developer.google.com. I assume this is a rather simple task but I feel as though I am on a wild goose chase. Anyway the problem is the existing site places people into a category based on their address (primarily the zip code), this is not working out because of odd shapes and user density so I would like to solve the problem by creating custom zones.
I am not looking for a proprietary solution because I would really like to accomplish this on my own, I just need some better places to start or better suggestions for searches.
I understand that I will need to create a map with my predetermined polygons.
I understand how to create a map with polygons via js.
I do not understand how data will request which zone it is within and how it will return it as a hash I can store. eg. user=>####, zone=>####, section=>#####
http://blog.appdelegateinc.com./point-in-polygon-checking-with-google-maps.html
has some JS you can add to give the ability to test whether a point is within a polygon (sample: http://blog.appdelegateinc.com./static/samples/point_in_polygon.html ) using this approach: http://en.wikipedia.org/wiki/Point_in_polygon#Ray_casting_algorithm
I think as you place the markers, you'll hold them in an array (of objects)...then loop through, doing some sort of reduction of which polygons to test, testing those that remain, if inPoly, set marker.zone and marker.section to whatever suits your needs

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Resources