I'm running Nebula Graph database on AWS with the Twitter dataset (3 graph spaces), and each space has a data volume of around 500GB.
I know that the compaction process is quite time-consuming. Can I run compaction for all 3 graph spaces at the same time? I don't want to get an OOM. I don't see a caution in their docs, but I still doubt it.
Related
I'm trying to figure out how to find outliers in our graph. In particular nodes with more than N edges where N could be some high number. Our graph has over 2 billion nodes. Is there an efficient way to do this?
At that scale you probably are going to want to multi thread the queries and send requests to the server in batches. A good approximation for client threads is 2 times the number of vCPU on the server. If you are able to send lists of IDs that will be most efficient. Otherwise you will need to do a lot of range steps. Each thread would then do something like query the below for multiple sets of ID ranges:
g.V(<list of IDs>).filter(out().count().is(gt(x)))
You would then collect all the outliers in the application. I think you should approach this as a bit of a batch task that may take a while to complete.
The alternative would be to use Neptune Export to export the graph and load it into Spark and run a degree query using something like GraphFrames.
With a reasonably large instance I think the technique of using multiple threads will work, especially if you are able to easily generate the lists of vertex IDs you are looking for in each query. Spreading the queries across multiple read replicas will also speed things up.
I have a dataset that contains a few hundred paths that represent a series of recorded events. I want to find out how long the average event path is. It's stored in a quite simple graph with about 50k nodes on Azure Cosmos Graph DB.
I have a gremlin query that looks like this:
g.V().hasLabel('user').out('has_event').repeat(out('next').simplePath()).until(out().count().is(0))
Traverse all events until the end of the events chain.
However, I'm getting the following error: Gremlin Query Execution Error: Exceeded maximum number of loops on a repeat() step. Cannot exceed 32 loops. I wasn't aware of the 32 loops limit, the idea I had for my analysis will exceed paths of 32 step length many times.
Is there a way to achieve what I'm trying to do or does cosmos DB really 'stop' after 32 loops? How about other graph dbs like Neo4j?
That is a limit of CosmosDB. I'm not aware of other TinkerPop implementations that have this limit, so choosing another graph database would likely solve your problem. I suspect that the limit is in place to prevent a runaway fan-out of a query as the structure of your graph will greatly affect performance.
I want to ask a question about graph database.
First im using networkx in python and creating graph in memory, but when i reach more nodes - my RAM was not enough.
So, for next time i try to neo4j. Its nice, write graph on disk, but its slow(how i think. With index and other things, more slow than networkx). Now i create 500k nodes and 2000000 relationships, try to find path between two nodes, and neo4j just stuck on my server.
I hear about orientdb, but not try yet now.
So, i need advice, what the best graph database, who can write graph on disk?
Big thanks to you.
PS want only open-source graph database
First of all there are real or native graph databases or non native graph databases. The native graph databases really organize your data in a graph structure and connect the nodes to each other, while the non native are using some kind of model to store your graph representation. You can simply represent a graph as Adjacency matrix which is a table and you maybe could be stored in a row key store with columns (but that wouldn't be very effective and stupid in my opinion). So you first need to ask yourself if you really need a graph database? Second you need to think about the operations read und write you want to perform.
There is not best (graph) database. But there are many different databases for many different use cases - so you need to identify your exact use case and than you can think about the database.
For your tries with neo4j: Writing in neo4j is indeed very slow if you do it wrong. May you like to have a look at this question and answer about write performance of neo4j.
Almost all graph database can write graph on disk.
But if you're doing some calculation, such as shortest path for very deep search (dozens hop), memory is much much more important than disk.
I've been experimenting with Titan over the past few weeks and would like some pointers on the way forward, plus a few specific questions. The purpose of the project is to store log data on a Cassandra cluster (for this question let's use the example of web traffic) and represent relationships in a Titan graph. All nodes are modelled as having an entity value and type (e.g. "google.com","hostname"), and edges have a label (e.g. "connects") as well as several attributes of the relationship (timestamp, flow length and so on).
Once this data is stored in cassandra and represented as a Titan graph, I plan to use d3 code to generate visualisations. At the end of the tunnel I am hoping to be able to build large-scale, interactive, complex graph networks that look something like this: http://goo.gl/CVEd55
My current setup is as follows:
A python script to convert log files into vertices.csv and edges.csv files for Gremlin to load in
Titan Server 0.4 (using CassandraThrift as the storage backend) - gremlin script to load converted data into Titan
Python script that uses NetworkX to open a RexPro connection, allowing the analyst to enter a custom Gremlin query, outputting the result as a JSON
Local web front-end that uses the generated JSON and d3 to display the results of the query as a graph
Ideally as a test base case, I would like the user to be able to type a Gremlin query into the web front-end and be directed to a page containing an interactive d3 graph of the result.
My specific questions are are follows:
What is the process for assigning attributes to edges? I have had trouble finding sample code that helps me represent the graph using the model listed above.
My gremlin script to load data into Titan uses bg.commit() to create a batch graph which is later referenced in the RexPro connection conn= RexProConnection('localhost,8184,'bg'). This was working originally but after changing my load script, clearing the graph in Gremlin and then reloading, the RexPro connection cannot be opened due to the graph bg apparently not existing. What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
Apologies for the long post, but if anyone could share their expertise that would be much appreciated!
For d3 visualization, you can use force directed graph. There are a few variations of them.
Relationship Graph
https://vida.io/documents/qZ5SJdRJfj3XmSXYJ
Force Layout Tree
https://vida.io/documents/sy7vzWW7BJEvKdZeL
If your network contains a large number of node and edges, you'll need to cluster data before visualizing. You can use tools like Gephi, NodeXL to perform clustering. Then use clustered data to build force directed visualization.
What is the process for assigning attributes to edges?
The process is the same as adding properties to vertices. Get an Edge instance then do:
Edge e = g.addEdge(v1,v2,'label')
e.setProperty('weight',0.1d)
As for:
What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
You don't want a reference to a BatchGraph after loading as it comes with limitations that will prevent you from querying. It sounds like you should just configure "yourgraph" in rexster.xml, when you load through your script, simply wrap your rexster.xml configured Graph in your code, and perform your load operations against it. When you want to query it, simply reference "yourgraph" instead of "bg".
conn = RexProConnection('localhost,8184,'yourgraph')
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
It's hard to say if that's "easy" as that depends on factors outside of just the technology. I'll say that it's possible to to build an interface that accepts Gremlin queries (your wrote SQL, but I assume you meant Gremlin), passes them to Rexster and gets back an answer. What you do with that answer is up to you, but as far as Rexster's part plays into it, I don't see why that would be a problem.
I'm trying to use rrdtool to monitor Access Points and what I'd like is to have separate rrd file for each access point, which is something I'm not sure how to do. Anyway if I can do that then for each site I'd be able to get a graph from different rrd databases according to site location. However when I want to see a company level graph I'd like to aggregate data across multiple rrd databases and get that to show on one graph, so if bandwidth is measured for two devices in two separate rrd databases then I would like to get an "average" of these two data sources and show it in my graph for the site that has these access points. Is this possible? I'm quite new to thinking in RRD way and rrdtool so please do let me know if there are better ways of doing this.
Also how RRD uses space internally? From what I read so far, there are people saying the size of file never gets bigger for RRD database. On the other side people asking about how much of file size it would accumulate over years. So I'm kind of confused here. I thought it would be holding stuff in memory and writing to disk based on consolidated functions.
Can I generate pie charts from rrdtool as well? I need to find number of users connected to a access point and it would be good if I can show that as a pie chart for total number of users connected to an access point at any given time for a given site. For instance,
access point 1: 20
access poin 2: 40
access poin 3: 1
If I can generate a pie chart for that it would be sliced according to the number of users.
Sorry it's quite a few questions. If rrdtool doesn't make a big difference then I might as well use Mysql as I have running mysql server in production. And I can produce graphs on the fly using some funky flash stuff too. If someone can enlighten me on pros and cons of using RRD over any RDBMS for time series data that would be amazing.
Many Thanks guys!!
You can aggregate data from multiple RRDs into one graph; you'd use the CDEF command in your rrdgraph statement to combine DEFs from individual databases.
rrd files stay the same size unless you explicitly resize them by adding rows. Older data is aged out and replaced with new data. (Hence the name "round robin database".)
pie charts...I dunno. :) I've never seen it, but that certainly doesn't mean it's not possible.
Have you read the basic tutorial? http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html That might help you decide what to do.
Cacti is what you are after I would say;
It is a web front end to rrdtool (and much more). You can create devices, add them, set up graphs and it will poll them for data into RRD files. You can have all kinds of graphs, and create aggregate ones etc. You can also query against rrd files for monthly/weekly/yearly/any-time-frame statistics you like.
Everything you have asked for can be done with Cacti except for pie charts.