I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).
Related
I read in Neo4j documentation a section about how to make queries that depends on time more efficient:
One way to model time-specific data and relationships is by including
data in the relationship type. Because Neo4j is optimized specifically
for traversing relationships between entities, you can often improve
query performance by specifying a date as the relationship type and
only traversing particular dated relationships.
But I was wondering, using this technique you will have to repeat the same things any time you want to make the time-based-queries more efficient. For example if you want to query the posts created by specific user at specific date you have to add (similarly to AirportDay) something like UserDay.
My question is there a possible way to model time universally in your graph?, so that time become the main entry-point to query events and activities in the graph.
There's no answer to modelling time universally in your graph. It depends on your use cases.
The example in your post is one way to optimise non-performant queries that traverse too many relationships of the same type from a node.
You could also store time as a property on the node, and index it.
And then there's the option of a timetree https://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
To summarise, it depends on your use cases- usually no need to prematurely optimise.
I am working on an application that uses both relationnal and graph databases (sqlite and neo4j). I am trying to see if I can't get rid of sqlite to use only neo4j, and I am confronted to a problem of redundancy.
Let say I have nodes that represent audio tracks. I want to store of what musical genre each track is. With hundreds of thousands of nodes, I don't think repeating "South-African Psytrance" as a string property is a good idea, and I am pretty sure that creating a "South-African Psytrance" node and linking it to all concerned nodes is an even worse idea (bottleneck?).
Am I right if I say that using 1) properties takes too much space, and using 2) relationships is a bad design for this particular problem?
The current code uses the sqlite db to store a set of musical genres, and their indexes as properties in nodes (which are converted to their string representation in the application layer).
Is there a way to use only neo4j and avoid bottlenecks and redundancy?
Option 1 is definitely NOT the way to go, as it will waste space and is antithetical to good graph DB design.
Option 2 is the classic way you would do this with a graph DB. There are many examples of neo4j DBs with very large numbers of relationships per node. And neo4j currently supports up to 34 billion relationships in a DB, so there is little danger that you will exceed a capacity limit. So, I would recommend that you at least try using this approach.
There are also a few blogs about people using neo4j for storing similar data. For example:
http://neo4j.com/blog/musicbrainz-in-neo4j-part-1/
http://neo4j.com/blog/fun-with-music-neo4j-and-talend/
http://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
[EDITED]
As the slides mentioned by #Pawamoy imply, there is actually a third option. That is, you can create a specific node label for each genre, and apply the appropriate genre label (a node can have more than one) to every track node. This would allow you to avoid using relationships for genres. However, it would tend to "muddy" the label space, since labels at least feel like "node types", and a "music genre" is not an "album track". Also, neo4j supports a very limited number of labels per node, and the maximum number of labels in a DB is also relatively small. So, I would not use this approach unless there was a definite advantage to doing so and the capacity limits are not an issue.
I'm building a system which allows the user to call N number of different graphs through an API. Currently I have a working prototype which pulls graphs from CouchDB. However, for obvious reasons, I would like to move to a graph DB. My understanding is that Neo4J can only handle one graph at a time or requires so sort of tagging system to not mix graphs. Neither of those approaches seem optimal. What's the best practice approach for this?
A few more things to note: I will be calling these graphs and manipulating them with something like networkx, and I've considered storing the graphs in a "regular" DB then moving them to Neo4J as the requests come in, which seems pretty intense.
Neo4j does not have a concept of multiple databaseslike most relational databases do using CREATE DATABASE. In Neo4j there is one graph space which you can use.
So you have 2 options:
use seperate Neo4j instances (single or clustered) for each graph, maybe using Neo4j in embedded mode is helpful here
use one Neo4j instance (single or clustered) and store your data in distinct subgraphs. If the subgraphs need some interconnections you can use labels to identify to which subgraph a certain node belongs.
While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.
I'm trying to use rrdtool to monitor Access Points and what I'd like is to have separate rrd file for each access point, which is something I'm not sure how to do. Anyway if I can do that then for each site I'd be able to get a graph from different rrd databases according to site location. However when I want to see a company level graph I'd like to aggregate data across multiple rrd databases and get that to show on one graph, so if bandwidth is measured for two devices in two separate rrd databases then I would like to get an "average" of these two data sources and show it in my graph for the site that has these access points. Is this possible? I'm quite new to thinking in RRD way and rrdtool so please do let me know if there are better ways of doing this.
Also how RRD uses space internally? From what I read so far, there are people saying the size of file never gets bigger for RRD database. On the other side people asking about how much of file size it would accumulate over years. So I'm kind of confused here. I thought it would be holding stuff in memory and writing to disk based on consolidated functions.
Can I generate pie charts from rrdtool as well? I need to find number of users connected to a access point and it would be good if I can show that as a pie chart for total number of users connected to an access point at any given time for a given site. For instance,
access point 1: 20
access poin 2: 40
access poin 3: 1
If I can generate a pie chart for that it would be sliced according to the number of users.
Sorry it's quite a few questions. If rrdtool doesn't make a big difference then I might as well use Mysql as I have running mysql server in production. And I can produce graphs on the fly using some funky flash stuff too. If someone can enlighten me on pros and cons of using RRD over any RDBMS for time series data that would be amazing.
Many Thanks guys!!
You can aggregate data from multiple RRDs into one graph; you'd use the CDEF command in your rrdgraph statement to combine DEFs from individual databases.
rrd files stay the same size unless you explicitly resize them by adding rows. Older data is aged out and replaced with new data. (Hence the name "round robin database".)
pie charts...I dunno. :) I've never seen it, but that certainly doesn't mean it's not possible.
Have you read the basic tutorial? http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html That might help you decide what to do.
Cacti is what you are after I would say;
It is a web front end to rrdtool (and much more). You can create devices, add them, set up graphs and it will poll them for data into RRD files. You can have all kinds of graphs, and create aggregate ones etc. You can also query against rrd files for monthly/weekly/yearly/any-time-frame statistics you like.
Everything you have asked for can be done with Cacti except for pie charts.