I'm trying to wrap my mind around graph data right now. I'm finding it difficult to think in terms of property graphs. On the vertex centric indeces docs page, there is an example involving twitter data. The Gremlin code is:
g = TitanFactory.open(conf)
// graph schema construction
g.makeKey('name').dataType(String.class).indexed(Vertex.class).make()
time = g.makeKey('time').dataType(Long.class).make()
if(useVertexCentricIndices)
g.makeLabel('tweets').sortKey(time).make()
else
g.makeLabel('tweets').make()
g.commit()
// graph instance construction
g.addVertex([name:'v1000']);
g.addVertex([name:'v10000']);
g.addVertex([name:'v100000']);
g.addVertex([name:'v1000000']);
for(i=1000; i<1000001; i=i*10) {
v = g.V('name','v' + i).next();
(1..i).each {
v.addEdge('tweets',g.addVertex(),[time:it])
if(it % 10000 == 0) g.commit()
}; g.commit()
}
The explanation is that each edge represents someone tweeting a tweet vertex. This doesn't make sense to me as a schema. Why should any two nodes be connected? If the answer is that the edge connects different tweets that a user has tweeted, then one edge connects more than one node. This would mean that Titan is a hypergraph, which I thought it wasn't.
In short, can someone explain this example better than the docs?
The example in the wiki is a bit over-simplified and is designed to convey the concept of vertex-centric indices. On its own, it might not be he best thing to use for purposes of understanding how to model a schema in general. That said, I think the model still makes basic sense (at least in that light).
If the answer is that the edge connects different tweets that a user
has tweeted, then one edge connects more than one node.
I'm not sure where you see that in the code. I see 4 user vertices who are doing the tweeting (v1000, v10000, etc). The for loop iterates each user and adds tweet edges for each. On each creation of an edge a new vertex is created to represent the tweet. Perhaps I'm misunderstanding you but in that sense an edge does not connect more than two vertices. It only connects from user vertex into tweet vertex.
Related
Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.
I am planing to use ArangoDB and I am faced with a problem I don't know how to solve. I would like to do simple traversals but in my case but there are two requirements that I don't know how to solve:
I will not know in advance the type of vertices than an edge will connect to. I want to be able to connect edge of one type to any vertex on any side.
For one vertex, I want to retrieve all connected vertices (depth 1) no matter the edge type.
For the requirement 1, an example would be a Tag vertex (to tag some entity with some information) and I want to be able to tag any vertex using i.e. HasTag edge in a named graph. From what I currently see is that I need to define the "From" collections ("To" collection is the Tag collection) and this is limited to 10 collections. Since I could have 100 or more From collections I don't see how to solve this with named graphs.
Option would be to use anonymous graphs but then I have a problem in the second requirement. I also want to have an option, when given a vertex, to find all connected vertices (depth = 1) no matter the type of an edge. In an anonymous graph I would need to specify all of the edge collections in a query and again, there could be 100 or more of them. I don't know if there is a limit to this number but I would assume there is one - maybe I'm mistaken since I haven't yet tried it out.
Has anyone any idea how to solve this with ArrangoDB? I really like the database but I would like it to be more "typeless", that is, that I wouldn't have to define the type of vertex collection an edge can connect to.
Best regards
Tomaz
You can have more than 10 vertex collections in a named graph. The limitation of 10 only exists in the webUI. Creating the named graph over the ArangoShell or the server console will work.
I need help writing a resilient, mapping (graph building) algorithm. Here is the problem:
Imagine you have a text oriented virtual reality(TORG/MUD) where you can send movement commands (n, s, w, e, up, down, in, out ... etc.) through telnet to move your character from room to room. And the server sends back corresponding room description after each movement step. Your task is to generate a graph that represents the underlying map structure, so that you can simply do a DFS on the client side to figure out how to get from one room to another. Also you want to design the system so that minimum user input is required
Here are the assumptions:
The underlying graph topology on the server never change.
Room descriptions are not unique. Most of the rooms have distinct descriptions, but some of the rooms have the exact same description. Room description are changed slightly once in a while(days or weeks)
Your movement may fail randomly with a small probability, and you will get an error message instead of the new room description, such as "You stop to wait for the wagon to pass", "The door is locked", and your character will still be in the current room.
You cannot assume the unit spacial distance for each movement. For example you may have a topology like the one shown below, so assuming unit distance for each neighboring room and assigning a hard coordinate to each room is not going to work. However you may assume that the relative direction to be consistent, that is there will be no loop in a topological sort along X(west, east) and Y(south, north).
Objective: given a destination that you have visited before, the algorithm guarantees to eventually move you to that location, and will find the shortest path most of the time. Mistakes are allowed, but the algorithm should be able to detect and correct the mistakes on the fly.
Example graph:
A--B---B
| | <- k
C--D-E-F
I have already implemented a very simple solution that would record the room descriptions and construct a graph. The following is an example of a graph representation my program generates in json. The "exits" are movement direction mapped to node id. -1 represents an un-mapped room. If the user walks in a direction and detect a -1 in the graph representation, the algorithm will attempt to find nodes already in the graph. If nodes with the same description are found, it will prompt the user to decide whether the newly seen room is one of the old nodes. If not, it adds a new node and connect it to the graph.
"node": [
{
"description": "You are standing in the heart of the Example city. There is a fountain with large marble statue in it...",
"exits": {
"east": -1,
"north": 31,
"south": 574,
"west": 42
},
"id": 0,
"name": "cot",
"tags": [],
"title": "Center of Town",
"title_color": "\u001b[1m\u001b[36m Center of Town\u001b[0;37;40m"
},
{
...
This simple solution requires human input detect loops when building the graph. For example, in the graph shown above, assume same letters represent same room descriptions. If you start mapping at the first B, and to left, down, right...till you perform movement k, now you see B again, but mapper cannot determine whether it is the B it has seen before.
In short I want to be able to write a resilient graph building algorithm that takes a walk (possibly infinite) in a hidden target graph and generate(and keep updating) a graph that can (hopefully) as similar as the target graph. We then use the generated graph to help navigate in the target graph. Is there an existing algorithm for this category of problems?
I also thought about applying some machine learning techniques to this problem, but I am unable to write out a concrete model. I am thinking along the lines of defining a list of features for each room we see (room description, exits, neighboring nodes), and each time we see a room we attempt to find the graph node that best fit the features, and based on some update rule(like Winnow or Perceptron) update the description we see based on some mistakes detection metrics.
Any thoughts/suggestions would be very much appreciated!
Many MU*s will give you a way to get a unique identifier for rooms. (On MUSH and its offshoots, that’s think %L.) Others might be set up to describe rooms you’ve already been to in an abbreviated form. If not, you need some way to determine whether you have been in a room before. A simple way would be to compute a hash of enough information about each room to get a unique key. However, a maze might be deliberately set up to trick you into thinking you’re in a different location. Wizardry in particular was designed to make old-school players mapping the dungeon ny hand tear their hair out when they realized their map had to be wrong, and the Zork series had a puzzle where objects you dropped to mark your place in the maze would get moved around while you were elsewhere. In practice, coding your tool to solve these puzzles is unlikely to be worth it.
You should be able to memoize a table of all-pairs-shortest-paths and update it to match the subgraph you’ve explored whenever you search a new exit. This could be implemented as a N×N table where row i, column j tells you the next step on the shortest path from location i to location j. Normally, for a directed graph. Even running Dijkstra’s algorithm each time should suffice, but in practice each move adds one room to the map and doesn’t add a shorter path between many other rooms. You would want to automatically map connections between rooms you’ve already been too (unless they’re supposed to be hidden) and not force the explorer to tediously walk through each individual exit and back to see where it goes.
If you can design the map, you can also lay out the areas so that they’re easy to navigate between, and then you can keep your tables small: each only needs to contain maps of individual areas you’ve deliberately laid out as mazes to explore. That is, if you want to go from one dungeon to another, you just need to look up the nearest exit, and then directions between the dungeons on the world map, not one huge table that grows quadratically with the number of locations in the entire game. For example, if you’ve laid out the world as a nested hierarchy where a room is in a building on a street in a neighborhood of a city in a region of a country on a continent on a planet, you can just store locations as a tree, and navigating from one area to the others is just a matter of walking up the tree until you reach a branch on the path to your destination.
I’m not sure how machine learning with neural networks or such would be helpful here; if the kind of trick you’re looking out for is the possibility that a room that appears to be the same as one you’ve seen before is really a duplicate, the way to handle that would be to maintain multiple possible maps at once on the assumption that apparently-identical rooms are or are not duplicates, a garden of forked paths.
I have been tinkering with Graphs for some time, with the objective that I implement appropriate portions of the server-side stack using them. I have used Scala-Graph and Neo4J, and I am learning Spark GraphX. In almost all the applications I have implemented, the model has been that of a Property Graph (Node -> Edge -> Node, with attributes).
When designing the graph (DAGs to be precise), if I spot a strong and directed relationship between two nodes, I set up an edge from one node to one node. This is obvious and intuitive. If a Person likes a Site, an edge with property 'likes' connects them. Thus:
[Nirmalya] -- (Likes) --> [StackOverFlow]
[John] -- (Likes) --> [StackOverFlow]
[Ted] -- (Likes) --> [GoogleGroups ]
[Nirmalya] -- (Likes) --> [Neo4J]
Now, using outgoing edges, I can easily find out which sites Nirmalya likes.
But, when I want to find out who else likes what Nirmalya likes (i.e.,John), I tend to think that I should create an edge from Site-type Node to Person-type Node also (with property 'isLikedBy'), so that the path is obvious and the traversal is intuitive. Every Person and Site must be connected in both the directions, so that I can reach the other from either to answer queries like this one.
[Nirmalya] -- (Likes) --> [StackOverFlow] -- (IsLikedBy) --> [John]
But from many examples given by experts, I see that this is not prescribed. Instead, this is achieved by making use of operators like incoming. In other words, if two Nodes have an edge set up between them, I don't need to set both the directions of the edge explicitly (just 'likes' is sufficient, 'isLikedBy' is superfluous). Implementation of adjacency matrix makes this possible perhaps but I get a bit confused because I am being allowed to derive a contra-direction even when that direction is not explicit in the DAG.
My question is where is the gap in my understanding? Is it that 'IsLikedBy' direction should ideally be present, but we are optimizing? Alternatively, is it that there can be UseCases where such bidirectional edges are necessary and I need to spot them? Am I completely missing a theoretical underpinning?
I will be glad to become wiser.
I think it depends on the software. I can speak for Neo4j, but not for the other tools that you mentioned ;)
In Neo4j relationships are designed to be traversable both forwards and backwards without a performance cost. This applies both to traversing in the Java APIs as well as using Cypher. You can query both specifying a direction of incoming/outgoing as well as querying for relationships without concern for the direction and it should also be the same performance characteristics.
I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.