Preventing cycles in a neo4j graph where relationships are dated - graph

Note: I apologize for not being able to include the images directly in the post, I don't have enough reputation in stackoverflow yet.
I'm brand new to graph databases, but I'm trying to understand if a graph database is suited for my problem. Here we go!
I have a set of users who can relate to each other via a "Parent" relationship (i.e., they can be built into a tree / hierarchy). The "Parent" relationship from one user to another is said to "begin" as of a certain date, and the relationship only "ends" if/when another "Parent" relationships exists between the same users and with a later date.
Every user can have <= 1 parent as of any particular date, and can have >=0 children. I.e., at most one parent at a time, and no limit to the number of children.
I've read blog posts dealing with dated relationships, but they don't seem to address this level of complexity:
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
My challenge:
For a given set of users with existing "Parent" relationships, determine whether a new "Parent" relationship can be added "as of" a certain date WITHOUT creating a cycle anywhere in the timeline.
To help visualize an example, imagine we have four users Alice, Bob, Carlos, and David.
-----------------------------------------
| User | Date | Parent |
|-----------|---------------|-----------|
| Alice | 09/13/2012 | Bob |
| Alice | 04/01/2021 | David |
| Bob | 01/31/2020 | Carlos |
| Carlos | 02/14/2008 | David |
-----------------------------------------
Here is a (highly abstract) picture representing the current state of the data (time flows to the right):
[Initial state of the data as timeline]
https://i.stack.imgur.com/qdcbL.png
So in this example Alice has Bob as a parent from 9/13/2012 until 4/1/2021, at which point she begins to have David as a parent. Bob has no parent until 1/31/2020, at which point he gets Carlos as a parent. Etc.
I need to be able to determine whether an update/insert will create a cycle in the "parent" hierarchy at any point in time. So, for example, I'd like to be able to determine that it would be INVALID to set Carlos's parent to be Alice as of 10/22/2020 because then there would be a cycle in the hierarchy for the period between 10/22/2020 and 4/1/2021 (i.e., Alice-->Bob-->Carlos-->Alice). To help visualize it:
[Invalid addition creates a cycle in the timeline]
https://i.stack.imgur.com/xA2vv.png
But I also need to be able to determine that it would be VALID to set Carlos's parent to Alice as of 10/22/2021, as drawn here:
[Valid addition with no cycles in timeline]
https://i.stack.imgur.com/9u0P4.png
In terms of modeling the data, I started by thinking of two different models.
First:
I tried having my only nodes be "Users," and having my "Parent" relationships include a date in the relationship type. Since the range of dates is huge and the dates themselves are not known in advance, I'm not sure this is a good idea but decided to give it a shot anyway.
Diagram:
[graph diagram with dated relationships]
https://i.stack.imgur.com/ZuPDR.png
Cypher:
CREATE (n0:User {name: "Alice"})-[:P_2012_09_13]->(:User {name: "Bob"})-[:P_2020_01_31]->(:User {name: "Carlos"})-[:P_2008_02_14]->(:User {name: "David"})<-[:P_2021_04_01]-(n0)
Second:
I tried creating "UserDay" nodes to capture the date element, thereby reducing the range of relationship types to only two (i.e., a 1:many "HAS" relationship from User to UserDay, then a 1:1 "P" relationship from UserDay to User).
Diagram:
[graph diagram with user-days]
https://i.stack.imgur.com/W60bp.png
Cypher:
CREATE (n8:UserDay {date: "2021-04-01"})<-[:HAS]-(:User {name: "Alice"})-[:HAS]->(:UserDay {date: "2012-09-13"})-[:P]->(:User {name: "Bob"})-[:HAS]->(:UserDay {date: "2020-01-31"})-[:P]->(:User {name: "Carlos"})-[:HAS]->(:UserDay {date: "2008-02-14"})-[:P]->(:User {name: "David"})<-[:P]-(n8)
Given a source User, destination User, and start date, I need to be able to determine if a cycle would be created in the hierarchy for any time in the timeline.
Carlos, Alice, 10/22/2020 ----> should be invalid
Carlos, Alice, 10/22/2021 ----> should be valid
I've been spinning my wheels reading through neo4j docs and googling, and finally decided to ask my very first question on stackoverflow! Please let me know if you have any questions or if anything I've said is unclear.
Thanks in advance!

Related

How to query Gremlin when multiple connections between nodes are present

I'm trying to build a suggestion engine using Gremlin but I'm having a hard time trying to understand how to create a query when multiple nodes are connected by different intermediate nodes.
Playground:
https://gremlify.com/alxrvpfnlo9/2
Graph:
In this simple example I have two users, both like cheese and bread. But User2 also likes sandwiches, which seems a good suggestion for User1 as he shares some common interests with User2
The question I'm trying to answer is: "What can I suggest to User1 based on what other users like?"
The answer should be: Everything that other users that like the same things as User1 likes, but excluding what User1 already like. In this case it should return a sandwich
So far I have this query:
g.V(2448600).as('user1')
.out().as('user1Likes')
.in().where(neq('user1')) // to get to User2
.out().where(neq('user1Likes')) // to get to what User2 likes but excluding items that User1 likes
Which returns:
Sandwich, bread, Sandwich (again), cheese
I think that it returns that data because it walks through the graph by the Cheese node first, so Bread is not included in the 'user1Likes' list, thus not excluded in the final result. Then it walks through the Bread node, so cheese in this case is a good suggestion.
Any ideas/suggestions on how to write that query? Take into consideration that it should escalate to multiple users-ingredients
I suggest that you model your problem differently. Normally the vertex label is used to determine the type of the entity. Not to identify the entity. In your case, I think you need two vertex labels: "user" and "product".
Here is the code that creates the graph.
g.addV('user').property('name', 'User1').as('user1').
addV('user').property('name', 'User2').as('user2').
addV('product').property('name', 'Cheese').as('cheese').
addV('product').property('name', 'Bread').as('bread').
addV('product').property('name', 'Sandwiches').as('sandwiches').
addE('likes').from('user1').to('cheese').
addE('likes').from('user1').to('bread').
addE('likes').from('user2').to('cheese').
addE('likes').from('user2').to('bread').
addE('likes').from('user2').to('sandwiches')
And here is the traversal that gets the recommended products for "User1".
g.V().has('user', 'name', 'User1').as('user1').
out('likes').aggregate('user1Likes').
in('likes').
where(neq('user1')).
dedup().
out('likes').
where(without('user1Likes')).
dedup()
The aggregate step aggregates all the products liked by "User1" into a collection named "user1Likes".
The without predicate passes only the vertices that are not within the collection "user1Likes".

Janusgraph - How to hide the edge relation between two vertices and establish / retrieve again based on a condition?

I'm new to the Janusgraph Database. I have a requirement where I need to hide the relation (edge) between two vertices without dropping them and later I should able retrieve / establish the same relation again between those vertices based on condition.
I only know how to drop the edges but I don't know how to retrieve/restore the relation again. Could you please help me out here.
Thanks a lot for your time.
If you want to 'restore' the connections I think you shouldn't drop them at all.
Just keep a property on the edge that indicates the edge state (active/inactive) or maybe keep a start and end date on the edge.
This way when you traverse your graph you need to makes sure to use only the active edges, but the old ones can still easily found if you want to restore them.
for example:
g.addV('person').property('id', 'bob').property('name', 'Bob')
g.addV('person').property('id', 'alice').property('name', 'Alice')
g.addV('person').property('id', 'eve').property('name', 'Eve')
g.V('bob').addE('friend').to(g.V('alice'))
g.V('bob').addE('friend').to(g.V('eve'))
So Bob friends with Alice and Eve:
g.V('bob').out('friend').values("name")
==>Alice
==>Eve
Let say Bob and Alice had a fallout, and they are no longer friends:
g.V('bob').outE('friend').where(inV().hasId('alice')).property('status', 'inactive')
now you can query only Bob active friends, without dropping the old edges:
g.V('bob').outE('friend').not(has('status', 'inactive')).inV().values("name")
==> Eve

Separate tables vs map lists - DynamoDB

I need your help. I am quite new to databases.
I'm trying to get set up a table in DynamoDB to store info about TV shows. It seems pretty simple and straightforward but I am not sure if what I am doing is correct.
So far I have this structure. I am trying to fit everything about the TV shows into one table. Seasons and episodes are contained within a list of maps within a list of maps.
Is this too much layering?
Would this present a problem in the future where some items are huge?
Should I separate some of these lists of maps to another table?
Shows table
Ideally, you should not put a potentially unbounded list in a single row in DynamoDB because you could end up running into the item size limit of 400kb. Also, if you were to read or write one episode of one show, you consume capacity as if you are reading or writing all the episodes in a show.
Take a look at the adjacency list pattern. It’s a good choice because it will allow you to easily find the seasons in a show and the episodes in a season. You can also take a look at this slide deck. Part of the way through, it talks about hierarchical data, which is exactly what you’re dealing with.
If you can provide more information about your query patterns, I can give you more guidance on how to model your data in the table.
Update (2018-11-26)
Based on your comments, it sounds like you should use composite keys to establish hierarchical 1-N relationships.
By using a composite sort key of DataType:ItemId where ItemId is a different format depending on the data type, you have a lot of flexibility.
This approach will allow you to easily get the seasons in the show, get all episodes in all seasons, get all episodes in a particular season, or even get all episodes between season 1, episode 5 and season 2 episode 5.
hash_key | sort_key | data
----------|-----------------|----------------------------
SHOW_1234 | SHOW:SHOW_1234 | {name:"Some TV Show", ...
SHOW_1234 | SEASON:SE_01 | {descr:"In this season, the main character...
SHOW_1234 | EPISODE:S01_E01 | {...
SHOW_1234 | EPISODE:S01_E02 | {...
Here are the various key condition expressions for the queries I mentioned:
hash_key = "SHOW_1234" and sort_key begins_with("SEASON:") – gets all seasons
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:") – gets all episodes in all season
hash_key = "SHOW_1234" and sort_key begins_with("EPISODE:S02_") – gets all episodes in season 2
hash_key = "SHOW_1234" and sort_key between "EPISODE:S01_E5" and "EPISODE:S02_E5" – gets all episodes between season 1, episode 5 and season 2 episode 5

Doctrine Inheritance Mapping

I have Vehicle and Person classes. I have items that can be assigned to EITHER a person or a vehicle. I looked into inheritance mapping and I am visualizing my assignment table to be the one that would use the inheritance mapping but I'm not sure if I am correct. I would expect my assignment table to look like:
ID | item_id | type (vehicle/person) | entityId (the ID of the vehicle or person)
____ |____________|________________________|_____________________________________________
1 | 1 | person | 1
2 | 2 | vehicle | 1
Can someone explain the correct mapping to use and maybe an example?
Inheritance mapping is for when you have (for example) an Address which could be a Person Address vs. a Business Address. You can have a doctrine Superclass Address that PersonAddress and BusinessAddress inherits from. Doctrine Superclass This isn't the proper approach for your situation.
The proper solution would be to make a ManyToMany (Many people have many items) association from Person to Item and ManyToMany (Many vehicles have many items) from Vehicle to Item. If you don't want and item assigned to both a person and a vehicle, you'll have to work that logic into your controllers.
It's possible to have your Item be sort of the primary object with associations to either a person or vehicle and include the object type. This can and does work, yet it really complicates your application significantly. I've modeled a current project in that fashion, but I had good reason to go that route.

Optimize complex scenario in Cucumber

I have been working on an automation project where I have to write cucumber test for search filter. Search filter works dynamically where parameters are nested - next parameter are populated based on previous parameter e.g. On selecting "Subscribers" next parameters in dropdown are "Name", "City", "Network". Likewise, on selecting "Service Desk", parameters in subsequent dropdown are "Status", "Ticket no.", "Assignee". I am using Scenario Outline as below:
Scenario Outline: As a user, I can search records
Given I am on search page
When I search on "<category>" and "<nestedfilter>"
Then I see records having "<category>" category
Examples:
|category |nestedfilter|
|Subscribers |Name |
|Subscribers |City |
|Subscribers |Network |
|Service Desk|Status |
|Service Desk|Ticket no. |
|Service Desk|Assignee |
The filter could be more complex as there could be more nested filters based on previous nested filters.
All I need to know if there could be a more efficient way to handle this problem? For example passing data table to step_definition for which I am not too sure.
Thanks
If you really need the order of your items to be preserved, use a data table instead of a scenario outline.
A scenario outline is a shorthand notation for multiple scenarios. The execution of each scenario is not guaranteed. Or at least it would be a mistake to assume a specific execution order. The order of the items in a data table will not change if you use a List as argument and therefore a lot safer in your case.
A common mistake with Cucumber is to use Scenario Outline and example tables to do some sort of semi-exhaustive testing. This tends to hide lots of interesting things about the functionality being developed.
I would start writing single features for the searches you are working with and explore what those searches are and why they are important. So if we start with your first one we get ...
Note: all of the following assumes a background step Given I am searching
When I search on subscribers and name
Then I should see records for subscribers
and with the second one
When I search on subscribers and city
Then I should see records for subscribers
Now it becomes clear that there is a serious flaw in these scenarios, as both scenarios are looking for the same result.
So what you are actually testing is that
The subscribers search has name and city filters
A subscriber search should return subscriber results
Now you can refactor and get
When I do a subscriber search
Then I should see city, name, network filters
When I do a subscriber search
Then I should only see subscriber results
note: This is already much more efficient as you have reduced the number of scenarios from 3 to 2, and reduced the number of searches you have to do from 3 to 1.
Now I have no idea if this is what you want to do, but this is what your current scenario is doing. However because you are using an Outline and Example tables you can't see this.
The fact that you have a drop-down and nested filters is an implementation detail, which describes how the user is trying to achieve what they want to achieve.
If you think of what you're trying to do as examples of how the system behaves, rather than tests, it might be easier. You're not looking for something exhaustive. You also want your scenarios to be specific, so that you're illustrating them with realistic data and concrete examples. If you would commonly have some typical data available, that's a perfect thing to set up using Background.
So for instance, I might have scenarios like:
Background:
Given I have subscribers
| Name | City | Network | Status | etc.
| Bob | Rome | ABC | Alive | ...
| Sam | Berlin | ABC | Dead | ...
| Sue | Berlin | DEF | Dead | ...
| Ann | Berlin | DEF | Alive | ...
| Jon | London | DEF | Dead | ...
Scenario: First level search
Given I'm on the search page
When I search for Subscribers who are in Rome
Then I should see Bob
But not Sue or Jon.
Scenario: Second level search
Given I'm on the search page
When I search for Subscribers in Berlin on the ABC network
Then I should see Sam
But not Sue or Ann
etc.
The full-system scenarios should be just enough to understand what's going on. Don't use BDD for regression. It can help with that, but scenarios will rapidly become slow and unmaintainable if you try to cover every case. Delegate to integration and unit tests where appropriate (see "the testing pyramid").

Resources