Hi all I'm playing around orientdb to evaluate his inclusion into a new project.
Here is my problem.
Looking at the use cases I'm going to have a lot of super nodes (nodes which will have at least 5-10k outgoing relations) and I think that those nodes could be an irritating bottleneck on highly concurrent access.
The entire database must serve 20 departments, every department owns a partition of the data and those "blocks" are not accessible from the other departments.
Every department's partition share about the 60% of the data structure schema and the other 40% of the schema is department independent...
At system level I have a couple of agents which have complete read access to the graph for data analysis and profiling and every department can have is own profiling agent which will profile only his partition data.
Now. My question is
Is possible to create "indipendent" sub graphs into an orient graph database?
Thanks to everybody for the time and the help.
Marco
You can model this use case inside your domain as graph:
root -> * departments -> other nodes
In this way department cross the graph starting for own department node.
To use something already done look at this post by Marko Rodriguez (the main author of Blueprints and Gremlin language): http://thinkaurelius.com/2012/04/06/multitenant-graph-applications/
And this recent project to run a partition graph on top of OrientDB's Blueprints implementation: https://github.com/tinkerpop/blueprints/wiki/Partition-Implementation
Related
I am researching about graph databases. I stumbled into SQL Server 2017 and learned that they added the option to use a graph database. But I have some uncertainties about the performance. I watched several Youtube videos, tutorials and papers about this SQL Server 2017 Graph. For example this page.
With the image above in mind. When I try to find a node, is it true that the time complexity is O(n)? And is the performance in other graph databases like Neo4j similar? I am only talking about node lookup and not shortest path algorithms etc.
I also have a feeling that the graph functionality in SQL Server is just a relational database in disguise. Is this correct?
Thanks in advance.
There is a big difference between a graph database and a relational database with graph capabilities, in the sense of how the data is stored.
To summarise simply, when a triple ( aka 2 nodes connected by a relationship ) is stored, the underlying database difference will be :
Neo4j, the triple is stored as a graph on disk, nodes have pointers to the relationships they have, so during retrieval it will just be pointer chasing from nodes
SQL like : one node is stored in one table, the other node is stored in another table, yet you can query as a graph but the operation will be really making a JOIN
Based on those two facts, we can say that in native graphs the join is performed at write time compared to having joins at query time in non-native graphs.
Be very careful when you hear distributed graphs, partitions, planet scale and the like. If you start having relationships that have to be traversed over the network you will always suffer performance issues. Most of the distributed graphs platforms note also that for maximum performance you have to store everything on one partition (which defeats the partitioning purpose).
I'm currently working on my first application that uses a Graph database (Neo4J). I'm in the process of modelling my graph on a whiteboard. My colleague and I are in a pickle on whether or not we should introduce a 'collection node'.
We have something like this (Cypher syntax, Fictive example):
(parking:Parking) - Parking node
(car:Car) - Car node
Obviously, a Parking can have multiple Cars, let's say it can have up to 1mio cars.
Is it, in this case, better to introduce a new node:
(carCollection:CarCollection) - Car collection node?
A Parking could have a rel to the 'Car collection node' which can have a lot of cars. This should avoid a simple query being performed on the Parking node it self (let's say you want to query the number of available seats) to lose performance.
Is this a good idea? Or is this bogus and should you model it as it is, and does this not influence performance?
If anyone can provide a link or book with some graph modelling best practices, that would be awesome as well :).
Thx in advance.
Gr
Kwinten
anyhow, there is no way of a performance enhancer once you need to have 1mil nodes for each car.
if you will simply query your parking node with just one car, it will be as fast as if you have just 1 car in the car collection.
if you will need to return all 1 mil cars, than there is no enhancer. (the main problem, however, would be simply the net connection to stream all the data).
you can play with labels, but i suggest to keep the millions of relations directly to the parking node. but if you could provide us with an example scenario with a query, than we can figure maybe smthnig out
I am designing a graph database for eligibility rules. Some eligibility rules require that a user select 2 particular products (Product A and Product B) to qualify for Product C.
Is it possible to create a graph edge with 2 starting nodes?
I would think this would break what I think is the fundamental building block of a graph db - its adjacency list. But if this was possible, it would be very powerful for my application.
Update 6/16
More specifically, I'm looking to create a directed edge with 2 starting nodes, and 1 ending node. So, in biz rules terms: IF Node=A AND Node=B THEN Node=C. The real world relationship is this: If customer buys Product A and Product B, then customer qualifies for Product C.
Usually, to model a hypergraph in Neo4j, you end up creating an intermediate "group node" that connects all of the nodes you want to connect, then bridging off of that node to the other node. It's not a true hypergraph, but rather a representation of a hypergraph using the tools provided.
Here's an example:
http://www.markhneedham.com/blog/2013/10/22/neo4j-modelling-hyper-edges-in-a-property-graph/
Yes you can have multiple starting nodes in Neo4j, not sure about other graph db.
START a=node(0), b=node(1)
RETURN a,b
You should refer to http://docs.neo4j.org/chunked/stable/query-start.html for more details. Starting from Neo4j 2.0 start node is optional, Cypher will try and infer starting points from your query based on label and where clause.
Edit
I have edited the answer based on the updated question. What you need is a hypergraph. As Wes Freeman mentioned, to model a hypergraph Neo4j you will need to create an intermediate node that connects your other two nodes and the the third node. In you scenario a user will have a PURCHASED relationship with the two products(A and B) kinda like (:User {Id: 1})-[:PURCHASED]->(:Product {Name:A}). Then you will have to create an intermediate node like ProductQualifier (I am very bad at naming things) having a relationship from user like (:User {Id:1})-[:QUALIFIES]->(:ProductQualifier {Id:1}). The Product qualifier will then have 3 relations, two to Product A and B respectively and a third to Product C,
(:Product {Name: 'B'})<-[:HAS]-(:ProductQualifier {Id:1})-[:HAS]->(:Product {Name: 'A'})
(ProductQualifier {Id:1}-[:ELIGIBLE]->(:Product {Name: 'C'})
This ought to do what you want.
A second approach that you can take is use a database that inherently supports hypergraphs, something like Hypergraphdb, thus discarding the burden of creating extra node. I haven't had any occasion to use it though I wanted to try it out for quite some time, so I don't know in much details about its API's or its limitations, however it is fairly well known graph database.
Note: As mentioned I am very bad at naming things. You should probably change the label names to more suitable to your business model.
I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.