I am designing a graph database for eligibility rules. Some eligibility rules require that a user select 2 particular products (Product A and Product B) to qualify for Product C.
Is it possible to create a graph edge with 2 starting nodes?
I would think this would break what I think is the fundamental building block of a graph db - its adjacency list. But if this was possible, it would be very powerful for my application.
Update 6/16
More specifically, I'm looking to create a directed edge with 2 starting nodes, and 1 ending node. So, in biz rules terms: IF Node=A AND Node=B THEN Node=C. The real world relationship is this: If customer buys Product A and Product B, then customer qualifies for Product C.
Usually, to model a hypergraph in Neo4j, you end up creating an intermediate "group node" that connects all of the nodes you want to connect, then bridging off of that node to the other node. It's not a true hypergraph, but rather a representation of a hypergraph using the tools provided.
Here's an example:
http://www.markhneedham.com/blog/2013/10/22/neo4j-modelling-hyper-edges-in-a-property-graph/
Yes you can have multiple starting nodes in Neo4j, not sure about other graph db.
START a=node(0), b=node(1)
RETURN a,b
You should refer to http://docs.neo4j.org/chunked/stable/query-start.html for more details. Starting from Neo4j 2.0 start node is optional, Cypher will try and infer starting points from your query based on label and where clause.
Edit
I have edited the answer based on the updated question. What you need is a hypergraph. As Wes Freeman mentioned, to model a hypergraph Neo4j you will need to create an intermediate node that connects your other two nodes and the the third node. In you scenario a user will have a PURCHASED relationship with the two products(A and B) kinda like (:User {Id: 1})-[:PURCHASED]->(:Product {Name:A}). Then you will have to create an intermediate node like ProductQualifier (I am very bad at naming things) having a relationship from user like (:User {Id:1})-[:QUALIFIES]->(:ProductQualifier {Id:1}). The Product qualifier will then have 3 relations, two to Product A and B respectively and a third to Product C,
(:Product {Name: 'B'})<-[:HAS]-(:ProductQualifier {Id:1})-[:HAS]->(:Product {Name: 'A'})
(ProductQualifier {Id:1}-[:ELIGIBLE]->(:Product {Name: 'C'})
This ought to do what you want.
A second approach that you can take is use a database that inherently supports hypergraphs, something like Hypergraphdb, thus discarding the burden of creating extra node. I haven't had any occasion to use it though I wanted to try it out for quite some time, so I don't know in much details about its API's or its limitations, however it is fairly well known graph database.
Note: As mentioned I am very bad at naming things. You should probably change the label names to more suitable to your business model.
Related
Hi I am new to graph database modeling and have some doubts about expressing an endorsment for a service provided by a Person. The use case is the following. PersonA gives Endorsement to a Service provided by PersonB.
The key point is that If I am recipient of the endorsment, I would like to know who has endorsed me. I have come up with several scenarios on how I could potentialy do that, but because of my lack of experience I have doubts on what would be the best aproach.
Scenario 1.
Endorsment is expressed direcly as a relationship and the service falls as a property under the endorsment So it will look like:
PersonA-------ENDORSE{service}--->PersonB
Scenario 2
I model an entity named Service. The problem is that when I do the relationship "ENDORSE" to service I would loose information on who am I endorsing. So I would have to keep a property in the relationship on who am I endorsing. Then the PersonB would AQUIRE endorsment for the SERVICE but he would not know who has actualy givern the endorsment. So..... it will look like this:
PERSONA----ENDORSE{personB}--->Service------ENDORSMENT{personA}--->PERSONB
Does this make sense ?
Scenario 3:
I normalize the second relationship "ENDORSMENT" and exclude the personA as a property , but than I need to query all Person to find out who have they endorsed.
How would you model this kind of relationship ?
Two important principles for validating a data model for a graph database:
if an entity or fact can be used more than once, then it should be stored
as the node
if the relationship of two nodes requires to store node
identifiers, then this relationship must be transformed into a node
So #Raj pointed the right way, in which case the model might look like this:
I recommend you read this:
https://neo4j.com/graph-databases-book/
http://patterns.dataincubator.org/book/
The second approach looks good, you don't have to add these properties on relationships.
It's possible to get person A who endorsed person B for service S.
The only issue with this is there will be multiple nodes for any service S. If that's not acceptable.
You can replace the Service node in the second approach with Endorse node E and connect this E to service node S.
So there will be four types of nodes.
EDIT:
Adding an image for clarification.
Rename REL1 and REL2 as you wish.
#Stdob suggested some good names for these relationships.
I'm currently working on my first application that uses a Graph database (Neo4J). I'm in the process of modelling my graph on a whiteboard. My colleague and I are in a pickle on whether or not we should introduce a 'collection node'.
We have something like this (Cypher syntax, Fictive example):
(parking:Parking) - Parking node
(car:Car) - Car node
Obviously, a Parking can have multiple Cars, let's say it can have up to 1mio cars.
Is it, in this case, better to introduce a new node:
(carCollection:CarCollection) - Car collection node?
A Parking could have a rel to the 'Car collection node' which can have a lot of cars. This should avoid a simple query being performed on the Parking node it self (let's say you want to query the number of available seats) to lose performance.
Is this a good idea? Or is this bogus and should you model it as it is, and does this not influence performance?
If anyone can provide a link or book with some graph modelling best practices, that would be awesome as well :).
Thx in advance.
Gr
Kwinten
anyhow, there is no way of a performance enhancer once you need to have 1mil nodes for each car.
if you will simply query your parking node with just one car, it will be as fast as if you have just 1 car in the car collection.
if you will need to return all 1 mil cars, than there is no enhancer. (the main problem, however, would be simply the net connection to stream all the data).
you can play with labels, but i suggest to keep the millions of relations directly to the parking node. but if you could provide us with an example scenario with a query, than we can figure maybe smthnig out
I have an interesting situation. I am allowing users to provide their own data sources to be imported into neo4j. The data sources could be the same across different users, but I would like cypher queries to only query nodes and relations specified by a particular user's sources.
I can think of several ways to do this:
Separate neo4j instances for each user
Tag nodes and relationships by user
Currently node duplicates are prevented by indexes so I would have to alter that approach since nodes which already exist simply cause a new relationship to that node. Number of relationships to a node are used in my analysis so separating relationships by user are important.
I will have to update an existing graph database to account for these new attributes. I'm thinking that tagging relationships might be the way to go. Any thoughts pro/con against this approach? This way I can include the user tag as a relationship parameter.
Thoughts?
Henry
You can tag all your users with labels and use these even to tag the source:
http://docs.neo4j.org/chunked/preview/query-match.html#match-get-all-nodes-with-a-label
While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.
I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).