How To Exclude Certain Relationship for Real-Time Prediction in Neo4j - graph

So I'm working on a real-time prediction matter, for example, I have a node (A) (:Person) and he has friends and node (B) as (:Games)
so node (A) has liked a certain Game and his friends liked other games so I recommend those other games for him But the matter is that I need to exclude the games which he is already liked or played.
it seems to be easy around the 'NOT' command but I couldn't find the right code for it yet although I've tried a lot of ways
the one seems closest for me is like:
match (A:Person)-[:Friend]-(n:Person)
where A <> n
with distinct n
match (n)-[:LIKED]-(B:Game)-[:ON]-(:steam), (k:Person{name:'John'})
where not ((k)-[:LIKED]-(:Game)-[:ON]-(:steam))
return B
which has to recommend the games John's friends liked without the games which John already liked.
anyway, when I Run this, the Graph just freezes for a while and then shutdown which is another problem I want to ask for.
Thanks for help

The last WHERE clause has very few constraints on it, and probably explains the hang/timeout. It may help to have a variable name for each label, either to constrain the query or to receive the nodes. more like this
where not ((k:Person{name:'John'})-[:LIKED]->(B:Game)-[:ON]->(C:steam))
return B
specify directional -> relationships (as above) in cypher queries if possible, usually it provides the answer you want, and is faster.
adding the variable name, and relationship direction also makes the query easier to read, to understand what it is doing, and if you need to look at the nodes/relationships values when debugging.
I may be wrong, but the :steam label doesn't look right to me. What are example values? I'm wondering if you meant to have a :service node, and steam would be a node instance?
Note: if you provide create nodes/rels script to create a small example of this database (e.g. a dozen nodes with these relationships) it would be easier to provide a working cypher example.

If you want to find the distinct games on Steam that John's friends liked but John has not yet liked or played, something this should work:
MATCH (j:Person{name:'John'})-[:FRIEND]-(:Person)-[:LIKED]->(g:Game)-[:ON]-(:steam)
WHERE NOT (j)-[:LIKED|PLAYED]->(g)
RETURN DISTINCT g

Related

Gremlin: Shortest logistical route between A and B while respecting schedules + other constraints

Preface
I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.
Question
Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning
of "path") between origin and destination be "queried" (i.e., can I
use the Gremlin console with a single statement)? Or is the
use-case of such complexity that I will effectively need to write a
program to accomplish it?
Use-Case / Detail
I hope to answer the question:
Starting at ORIGIN on DAY, can I get to DESTINATION while
respecting [CONDITIONS]?
The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".
What are the conditions?
Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)
I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.
The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅
There are a number of ways you might model this. One way I have seen used effectively is to essentially have two graphs. This first just knows about routes. You use that one to find ways to get from A to Z in x-hops. Then using the second graph, which tracks actual flights, using the results from the first search you look for flights within the time constraints you need to impose. So there is really the data modeling question and then the query writing part. Obviously the data model should enable the queries to be as efficient as possible.
There are a couple of useful blog posts related to your question. They mention Neo4j but are really quite generic and mainly focus on the data modeling aspects of your question.
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
I would focus on the data model, and once you have that, focus on the Gremlin queries. Amazon Neptune also now supports openCypher as an alternative property graph query language.
If you already have a data model worked out and can share a sample, I'm happy to update the answer with an example query or two.

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Firestore data modeling for library books + wishlist data

I'm working on a library app, and am using Firestore with the following (simplified) two collections books and wishes:
Book
- locationIds[] # Libraries where the book is in stock
Wish
- userId # User who has wishlisted a book
- bookId # Book that was wishlisted
The challenge: I would like to be able to make a query which gets a list of all Book IDs which have been wishlisted by a user AND are currently available in a library.
I can imagine two ways to solve this:
APPROACH 1
Copy the locationIds[] array to each Wish, containing the IDs of every location having a copy of that book.
My query would then be (pseudocode):
collection('wishes')
.where('userId' equals myUserId)
.where('locationIds' contains myLocationId)
But I expect my Wishes collection to be pretty large, and I don't like the idea of having to update the locationIds[] of all (maybe thousands) of wishes whenever a book's location changes.
APPROACH 2
Add a wishers[] array to each Book, containing the IDs of every user who has wishlisted it.
Then the query would look something like:
collection('books')
.where('locationIds' contains myLocationId)
.where('wishers' contains myUserId)
The problem with this is that the wishers array for a particular book may grow pretty huge (I'd like to support thousands of wishes on each book), and then this becomes a mess.
Help needed
In my opinion, neither of these approaches are ideal. If I had to pick one, I will probably go with Approach 1 simply because I don't want my Book object to contain such a huge array.
I'm sure I'm not the first person to come across this sort of problem, is there a better way?
You could try dividing the query in two different requests. For instance, in pseudocode:
wishes = db.collection('wishes').where('userId', '==', myUserId)
book_ids = [wish.bookId for wish in wishes]
books = db.collection('books').where('bookId', 'in', book_ids)
result = [book.bookId for book in books if book.locationIds]
Notice that this is just an example, this code probably doesn't work, since I haven't tested it and the keywork in just supports 10 values. But you get the idea. A good idea would be adding the length of the locationIds or whether it's empty or not in a separate attribute so you could omit the last iteration querying the books with:
books = db.collection('books').where('bookId', 'in', book_ids).where('hasLocations', '==', True)
Although you would still have to iterate to only get the bookId.
Also, you should avoid using arrays in Firestore since it doesn't have native support for them, as explained in their blog.
Is it mandatory to use NoSQL? Maybe you could do this M:M relation better in SQL. Bear in mind that I'm no database expert though.

Graph database design: Should I add relationships, or just traverse

I have recently started exploring graph databases and Neo4J, and would like to work with my own data. At the moment I've hit some confusion. I've created an example image to illustrate my issue. In terms of efficiency, I'm wondering which option is better (and I want to get it right now in early days before I start handling larger amounts).
Option A: Using only the blue relationships, I can work out whether things are related to, or come under, the Ancient group. This process will be done many many times, however it is unlikely to be more than ~6 generations.
Option B: I implement the red relationships, so that it is much faster to work out if young structures belong to the Ancient group.
I'm trying not to use Labels in this scenario, as I'm trying to use labels for a specific purpose to simplify my life (linking structures across seperate networks), and I'm not sure if I should have a label to represent a node that already exists.
In summary, I'm wondering whether adding a whole new bunch of relationships, whilst taking more space, is worth it, or whether traversing to find all relatives is such a simple/inexpensive task that it isn't worth doing so. Or alternatively, both options are viable and this isn't a real issue at all. Thanks for reading.
I'd go with Option A. One of the strengths of Neo4j is that it traverses relationships very efficiently and quickly, and so, there is no need to materialise relationships (sometimes, relationships are materialised in complex and/or extremely large graphs, but this is not your case).
Not sure why you don't want to use labels? Labels serve to group nodes into sets of the same type, and are also index backed- this makes it much faster to find the starting point of your query (index lookup over full database scan).

Neo4j data modelling for often updates

Currently I have 50 000 Book nodes ((Book {views:1, likes:1})) each Book has properties likes and views. I am updating those counters every 10 minutes, but each update I am creating new node ((:Update {time:timestamp(), views: 1, likes: 1)) and linking it with the Book ((:Update)-[:UPDATED]->(:Book) . I want to do that because I want to show a visual presentation how views and likes change during time.
I am storing views and likes as properties because this data comes from 3rd party source and I can't do (:User)-[:READ]->(:Book).
I am struggling to come up with a better solution for handling views and likes updates. Updates happen every 10 minutes and my graph grows very quickly. Queries like 'get last month's updates' become very slow because Neo4j doesn't use indexes if the Cypher query where clause was used with range operators > or <.
Example query 1. It takes 2sec to complete because it does over a million DB hits.
MATCH (u:Update)-[r:UPDATED]->(b:BOOK)
WITH b, count(r) as totalUpdates
RETURN b.id, totalUpdates
ORDER BY totalUpdates
DESC LIMIT 5
Example query 2. Takes ages, not usable at all
MATCH (u:Update)-[r:UPDATED]->(b:Book)
WHERE u.time<1448928000000 AND u.time>1446336000000
RETURN b, collect(u) AS updates
I was looking for data modelling solutions to solve this mess and I was not able to find any. Now I am confused about whether Neo4j is a bad choice for this, or may I need more CPU power to handle it, or there is a much better way to structure a graph like this.
Any suggestions or links to useful info are very welcome! Thank you.

Resources