Store graph nodes with edges in relational database - graph

what is the best way, how to store graph and edges is relational database? I work on school project where i need store cities and distance between them ... next i will be able to find shortest path between point A and point B.
My solution is:
Store cities in one table
Nodes(city_Id, city_name, ...)
and second table will represent graph edges with unique pair
Edges(cityA_Id, cityB_Id, distance, time)
Is it good way or exist something better? Thx

Your approach is fine. To prevent undirected edges from being recorded twice, I would ensure that cityA_Id < cityB_Id.

Related

Gremlin query to find the entire sub-graph that a specific node is connected in any way to

I am brand new to Gremlin and am using gremlin-python to traverse my graph. The graph is made up of many clusters or sub-graphs which are intra-connected, and not inter-connected with any other cluster in the graph.
A simple example of this is a graph with 5 nodes and 3 edges:
Customer_1 is connected to CreditCard_A with 1_HasCreditCard_A edge
Customer_2 is connected to CreditCard_B with 2_HasCreditCard_B edge
Customer_3 is connected to CreditCard_A with 3_HasCreditCard_A edge
I want a query that will return a sub-graph object of all nodes and edges connected (in or out) to the queried node. I can then store this sub-graph as a variable and then run different traversals on it to calculate different things.
This query would need to be recursive as these clusters could be made up of nodes which are many (inward or outward) hops away from each other. There are also many different types of nodes and edges, and they all must be returned.
For example:
If I specified Customer_1 in the query, the resulting sub-graph would contain Customer_1, Customer_3, CreditCardA, 1_HasCreditCard_A, and 3_HasCreditCard_A.
If I specififed Customer_2, the returned sub-graph would consist of Customer_2, CreditCard_B, 2_HasCreditCard_B.
If I queried Customer_3, the exact same subgraph object as returned from the Customer_1 query would be returned.
I have used both Neo4J with Cypher and Dgraph with GraphQL and found this task quite easy in these two langauges, but am struggling a bit more with understanding gremlin.
EDIT:
From, this question, the selected answer should achieve what I want, but without specifying the edge type by changing .both('created') to just .both().
However, the loop syntax: .loop{true}{true} is invalid in Python of course. Is this loop function available in gremlin-python? I cannot find anything.
EDIT 2:
I have tried this and it seems to be working as expected, I think.
g.V(node_id).repeat(bothE().otherV().simplePath()).emit()
Is this a valid solution to what I am looking for? Is it also possible to include the queried node in this result?
Regarding the second edit, this looks like a valid solution that returns all the vertices connected to the starting vertex.
Some small fixes:
you can change the bothE().otherV() to both()
if you want to get also the starting vertex you need to move the emit step before the repeat
I would add a dedup step to remove all duplicate vertices (can be more than 1 path to a vertex)
g.V(node_id).emit().repeat(both().simplePath()).dedup()
exmaple: https://gremlify.com/jngpuy3dwg9

Gremlin query to get in and out edges for a given Vertex

I’m just playing with the Graph API in Cosmos DB
which uses the Gremlin syntax for query.
I have a number of users (Vertex) in the graph and each have ‘knows’ properties to other users. Some of these are out edges (outE) and others are in edges (inE) depending on how the relationship was created.
I’m now trying to create a query which will return all ‘knows’ relationships for a given user (Vertex).
I can easily get the ID of either inE or outE via:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').inE('knows')
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').outE('knows')
where '7112138f-fae6-4272-92d8-4f42e331b5e1' is the Id of the user I’m querying, but I don’t know ahead of time whether this is an in or out edge, so want to get both (e.g. if the user has in and out edges with the ‘knows’ label).
I’ve tried using a projection and OR operator and various combinations of things e.g.:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').where(outE('knows').or().inE('knows'))
but its not getting me back the data I want.
All I want out is a list of the Id’s of all inE and outE that have the label ‘knows’ for a given vertex.
Or is there a simpler/better way to model bi-directional associations such as ‘knows’ or ‘friendOf’?
Thanks
You can use the bothE step in this case. g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').bothE('knows')

How to reduce the number of same edge label between two Vertex in Titan

Let's say we have two type of Vertex: LOGIN_USER(property:user_id) and IP(property:ip), EDGE between them is : LOGIN(property:session_id, login_time).
This model's problem is that two many edges between one USER and IP(Can be thousands).
Is there anyway to reduce the edge number of the two vertexes and at the same time can keep property: sessionId and login_time? We want to filter these two properties for some query.
Edge property doesn't support cardinality:list which vertex property support.
If put all edge property into Vertex, does it impact performance to fetch the Vertex?
When titan load property for a Vertex?? When traversal to a Vertex, let's g.V(1).next(), does Titan load all Property for the Vertex?
When you say "thousands" of edges between USER and IP, do you think it could actually be "millions" or "tens of millions" or more? If not, then "thousands" should not be a problem for Titan with vertex centric indices. Index your edge properties and you should have fast ordering and traversals.
When you start to get deep into "millions", you might start to experience some problems - for me that has always been with processing global queries with titan-hadoop as the Vertex and its edges must be held in memory. That can make for some trouble spots when you're doing global analytics. From an operational perspective, Titan was always happy to keep writing edges into the millions on a vertex, but I'd tend to avoid it. Of course, much of my experience with this came before vertex cutting in Titan 1.0:
Cutting a vertex means storing a subset of that vertex’s adjacency
list on each partition in the graph. In other words, the vertex and
its adjacency list is partitioned thereby effectively distributing the
load on that single vertex across all of the instances in the cluster
and removing the hot spot.
which you might experiment with as you start to grow supernodes into the millions.
I suppose the other option for supernodes in the millions of edges would be to model around it. Perhaps you introduce some structure between USER and IP. Convert that single LOGIN edge to some vertices/edges that might introduce a time concept between them like:
USER -> LOGIN_YEAR -> LOGIN_MONTH -> IP
So now, instead of creating just one edge between USER and IP you create a LOGIN_YEAR vertex and a LOGIN_MONTH vertex.

Hadoop MapReduce implementation of shortest PATH in a graph, not just the distance

I have been looking for "MapReduce implementation of Shortest path search algorithms".
However, all the instances I could find "computed the shortest distance form node x to y", and none actually output the "actual shortest path like x-a-b-c-y".
As for what am I trying to achieve is that I have graphs with hundreds of 1000s of nodes and I need to perform frequent pattern analysis on shortest paths among the various nodes. This is for a research project I am working on.
It would be a great help if some one could point me to some implementation (if it exists) or give some pointers as to how to hack the existing SSSP implementations to generate the paths along with the distances.
Basically these implementations work with some kind of messaging. So messages are send to HDFS between map and reduce stage.
In the reducer they are grouped and filtered by distance, the lowest distance wins. When the distance is updated in this case, you have to set the vertex (well, some ID probably) where the message came from.
So you have additional space requirement per vertex, but you can reconstruct every possible shortest path in the graph.
Based on your comment:
yes probably
I will need to write another class of the vertex object to hold this
additional information. Thanks for the tip, though it would be very
helpful if you could point out where and when I can capture this
information of where the minimum weight came from, anything from your blog maybe :-)
Yea, could be a quite cool theme, also for Apache Hama. Most of the implementations are just considering the costs not the real path. In your case (from the blog you've linked above) you will have to extract a vertex class which actually holds the adjacent vertices as LongWritable (maybe a list instead of this split on the text object) and simply add a parent or source id as field (of course also LongWritable).
You will set this when propagating in the mapper, that is the for loop that is looping over the adjacent vertices of the current key node.
In the reducer you will update the lowest somewhere while iterating over the grouped values, there you will have to set the source vertex in the key vertex by the vertex that updated to the minimum.
You can actually get some of the vertices classes here from my blog:
Source
or directly from the repository:
Source
Maybe it helps you, it is quite unmaintained so please come back to me if you have a specific question.
Here is the same algorithm in BSP with Apache Hama:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/SSSP.java

How to model airport/flight data in a graph database like neo4j

I need to model airline flight data in a graph database (I am specifically working with neo4j, though I will consider others if that becomes problematic). My question is more about how to model this data in a way that will ease traversal and discovery of different flight options. A few specific examples of the type of data I would like to both store and later query:
1) A direct flight scenario like JFK->LAX. Seems straightforward, simple two node relationship. But there are many flights that may be of interest between these two nodes. So, if I need to store individual flight detail, is that best in an array on the relationship between the JFK and LAX nodes?
2) A flight scenario with multiple stops, like JFK->LAX->SAN. In this scenario, it seems like there modeling the relationship between the three nodes may be of limited utility if I'm interested in the departure and arrival city? i.e. I could have a relationship from JFK->SAN and the fact that there is a layover in LAX could be a property on that relationship?
If I need to query or traverse the graph based on arrays of data in relationships between nodes, and those arrays become large (e.g. 100 different flights between JFK and LAX), will that introduce performance or scalability problems?
Hopefully this question isn't too open-ended - I'm just trying to avoid building something that works for a small example model with ~5 nodes but can't scale to hundreds of airports and tens of thousands of flights.
Hundreds of airports and tens of thousands of flights is still a very small data set and I'd be surprised if that would be a problem in neo4j.
Off the top of my head you could perhaps have all the airports as their own nodes and each route could be its own node with relationships to all the airports it touches, maybe with an "order" property on each relationship which is local to the route.
(ROUTE1)---------
/ \ \
*order=1/ \*order=2 \*order=3
v v v
(JFK) (LAX) (SAN)
I'm sure there are better solutions.
Check out Neo4J's contribution page
One of the winners of their contest was a gist describing US Flights and Airports it is very well done
This link may be useful for you http://maxdemarzi.com/?s=flights, http://gist.neo4j.org/?6619085

Resources