I'm looking at using Titan to create a scalable geospatial data store (I'm thinking R trees). In the documentation, there is a GeoShape query, and the docs say that titan can do geo data with Lucene or ElasticSearch. However, it seems like this would be very slow because traversing nodes in cassandra is essentially doing join queries in cassandra which is a really bad idea. I think I might be misunderstanding the data representation.
I read the Titan Data Model doc, and I still don't quite get it. If all the edges are stored in a Cassandra row, then Titan would still have to "join" on a vertex table. One way to solve this would be to make the column value equal to the edge property data, and then you could neatly package the vertex data and the edge data into the row. However, this breaks down when you want to do queries deeper than 1 node, and we're back to the joining problem again.
So. Is titan emulating join queries in Cassandra? - and - How performant is it at geo lookups under these conditions?
I think the question conflates edge traversal with geospatial index lookups. These are separate at both the API and implementation levels. The index is not illustrated in the data model pictures.
Let's make this a little bit more specific. Say I run Titan with ES and Cassandra using Murmur3Partitioner or RandomPartitioner. I declare an ES geospatial index over edges called "place", as documented in the Getting Started page. Looking up edges by geospatial queries, such as this "WITHIN" in the Getting Started docs, first hits ES. ES returns IDs Titan can use to lookup the associated vertex/edge data in Cassandra quickly, without doing an analog to relational joins.
The cost of these edge lookups by geospatial data should be roughly equivalent to the cost of ES's WITHIN implementation (which I think is delegated to Spatial4j), plus the lookups Titan makes on Cassandra after getting IDs, which should be roughly linear in the number of edges found by ES. This is just back-of-the-envelope estimation, so please take it with a big grain of salt.
After I get place edges by geo matching, if I then want to run arbitrary traversals in the neighborhood of each edge in the set, then I would have a look at rooting a MultiQuery on the head/tail vertices and enabling database-level caching. If the query misses cache or cache is cold/disabled, then Titan will still attempt to retrieve all edges the traversal cares about in a single Cassandra slice per vertex, when possible. If you're concerned about Titan's edge traversal efficiency, then you might find Boutique Graph Data with Titan interesting.
HTH
Related
Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.
I am researching about graph databases. I stumbled into SQL Server 2017 and learned that they added the option to use a graph database. But I have some uncertainties about the performance. I watched several Youtube videos, tutorials and papers about this SQL Server 2017 Graph. For example this page.
With the image above in mind. When I try to find a node, is it true that the time complexity is O(n)? And is the performance in other graph databases like Neo4j similar? I am only talking about node lookup and not shortest path algorithms etc.
I also have a feeling that the graph functionality in SQL Server is just a relational database in disguise. Is this correct?
Thanks in advance.
There is a big difference between a graph database and a relational database with graph capabilities, in the sense of how the data is stored.
To summarise simply, when a triple ( aka 2 nodes connected by a relationship ) is stored, the underlying database difference will be :
Neo4j, the triple is stored as a graph on disk, nodes have pointers to the relationships they have, so during retrieval it will just be pointer chasing from nodes
SQL like : one node is stored in one table, the other node is stored in another table, yet you can query as a graph but the operation will be really making a JOIN
Based on those two facts, we can say that in native graphs the join is performed at write time compared to having joins at query time in non-native graphs.
Be very careful when you hear distributed graphs, partitions, planet scale and the like. If you start having relationships that have to be traversed over the network you will always suffer performance issues. Most of the distributed graphs platforms note also that for maximum performance you have to store everything on one partition (which defeats the partitioning purpose).
I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?
count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?
I have an HBase database that stores adjacency lists for a directed graph, with the edges in each direction stored in a pair of column families, where each row denotes a vertex. I am writing a mapreduce job, which takes as its input all nodes which also have an edge pointing from the same vertices as have an edge pointed at some other vertex (nominated as the subject of the query). This is a little difficult to explain, but in the following diagram, the set of nodes taken as the input, when querying on vertex 'A', would be {A, B, C}, by virtue of their all having edges from vertex '1':
To perform this query in HBase, I first lookup the vertices with edges to 'A' in the reverse edges column family yielding {1}, and the, for every element in that set, lookup the vertices with edges from that element of the set, in the forward edges column family.
This should yield a set of key-value pairs: {1: {A,B,C}}.
Now, I would like to take the output of this set of queries and pass it to a hadoop mapreduce job, however, I can't find a way of 'chaining' hbase queries together to provide the input to a TableMapper in the Hbase mapreduce API. So far, my only idea has been to provide another initial mapper which takes the results of the first query (on the reverse edges table), for each result, performs the query on the forward edges table, and yields the results to be passed to a second map job. However, performing IO from within a map job makes me uneasy, as it seems rather counter to the mapreduce paradigm (and could lead to a bottleneck if several mappers are all trying to access HBase at once). Therefore, can anyone suggest an alternative strategy for performing this sort of query, or offer any advice about best practices for working with hbase and mapreduce in such a way? I'd also be interested to know if there's any improvements to my database schema that could mitigate this problem.
Thanks,
Tim
Your problem is not flowing so well with the Map/Reduce paradigm. I've seen the shortest path problem solved by many M/R chained together. This is not so efficient but needed to get the global view at the reducer level.
In your case, it seems that you could perform all the requests within your mapper by following the edges and keeping a list of seen nodes.
However, performing IO from within a map job makes me uneasy
You should not worry about that. Your data model is absolutely random and trying to perform data locality will be extremely hard therefore you don't have much choice but to query all this data across the network. HBase is designed to handle large parallel queries. Having multiple mapper query on disjoint data will yield into a well distribution of request and a high throughput.
Make sure to keep small block size in HBase tables to optimize your reads and have as little as possible HFile for your regions. I'm assuming your data is quite static here so doing a major compaction will merge the HFile together and reduce the number of files to read.
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.