I'm trying to build a suggestion engine using Gremlin but I'm having a hard time trying to understand how to create a query when multiple nodes are connected by different intermediate nodes.
Playground:
https://gremlify.com/alxrvpfnlo9/2
Graph:
In this simple example I have two users, both like cheese and bread. But User2 also likes sandwiches, which seems a good suggestion for User1 as he shares some common interests with User2
The question I'm trying to answer is: "What can I suggest to User1 based on what other users like?"
The answer should be: Everything that other users that like the same things as User1 likes, but excluding what User1 already like. In this case it should return a sandwich
So far I have this query:
g.V(2448600).as('user1')
.out().as('user1Likes')
.in().where(neq('user1')) // to get to User2
.out().where(neq('user1Likes')) // to get to what User2 likes but excluding items that User1 likes
Which returns:
Sandwich, bread, Sandwich (again), cheese
I think that it returns that data because it walks through the graph by the Cheese node first, so Bread is not included in the 'user1Likes' list, thus not excluded in the final result. Then it walks through the Bread node, so cheese in this case is a good suggestion.
Any ideas/suggestions on how to write that query? Take into consideration that it should escalate to multiple users-ingredients
I suggest that you model your problem differently. Normally the vertex label is used to determine the type of the entity. Not to identify the entity. In your case, I think you need two vertex labels: "user" and "product".
Here is the code that creates the graph.
g.addV('user').property('name', 'User1').as('user1').
addV('user').property('name', 'User2').as('user2').
addV('product').property('name', 'Cheese').as('cheese').
addV('product').property('name', 'Bread').as('bread').
addV('product').property('name', 'Sandwiches').as('sandwiches').
addE('likes').from('user1').to('cheese').
addE('likes').from('user1').to('bread').
addE('likes').from('user2').to('cheese').
addE('likes').from('user2').to('bread').
addE('likes').from('user2').to('sandwiches')
And here is the traversal that gets the recommended products for "User1".
g.V().has('user', 'name', 'User1').as('user1').
out('likes').aggregate('user1Likes').
in('likes').
where(neq('user1')).
dedup().
out('likes').
where(without('user1Likes')).
dedup()
The aggregate step aggregates all the products liked by "User1" into a collection named "user1Likes".
The without predicate passes only the vertices that are not within the collection "user1Likes".
Related
I have person vertex, has_vehicle edge and vehicle vertex which models vehicle ownership use case. The graph path is person -> has_vehicle -> vehicle.
I want to implement a Gremlin query which associates a vehicle to a person only if
The person does not have a vehicle
AND
The input vehicle is not associated with a person yet.
I followed the fold-coalesce-unfold pattern and came out with following Gremlin query with nested coalesce
g.V().hasLabel('person').has('name', 'Tom').as('Tom').outE('has_vehicle').fold().coalesce(
__.unfold(), // check if Tom already have a vehicle
g.V().has('vehicle', 123).as('Vehicle').inE('has_vehicle').fold().coalesce(
__.unfold(), // check if vehicle 123 is already associated with a person
__.addE('has_vehicle').from('Tom').to('Vehicle') // associate the vehicle to Tom
)
)
Is there a way to eliminate the nested coalesce? If I have multiple criteria, it would be too complex to write the query.
This might be a case where a couple of where(not(...)) patterns, rather than nesting coalesce steps works well. For example, we might change the query as shown below.
g.V().hasLabel('person').has('name', 'Tom').as('Tom').
where(not(outE('has_vehicle'))).
V().has('vehicle', 123).as('Vehicle').
where(not(inE('has_vehicle'))).
addE('has_vehicle').from('Tom').to('Vehicle')
So long as the V steps do not fan out and yield multiple Tom or Vehicle nodes that should work and is easy to extend by adding more to the where filters as needed.
As as a side note, the not steps used above should work even if not wrapped by where steps, but I tend to find it just reads better as written.
This rewrite does make an assumption that you are able to tolerate the case where Tom already has a car and the query just ends there. In that case no vertex or edge will be returned. If you did a toList to run the query you would get an empty list back in that case however to indicate nothing was done.
I am using gremlin QL on AWS Neptune Database to generate Recommendations for a user to try new food items. The problem that I am facing is that the recommendations need to be in the same cuisine as the user likes.
We are given with three different types of nodes which are- "User", "the cuisine he likes" and "the category of the cuisine" that it lies in.
In the picture above, the recommendations for "User 2" would be "Node 1" and "Node 2". However "Node 1" belongs to a different category which is why we cannot recommend that node to "User2". We can only recommend "Node 2" to the user since that is the only node that belongs to the same category as the user likes. How do I write a gremlin query to achieve the same?
Note- There are multiple nodes for a user and multiple categories that these nodes belong to.
Here's a sample dataset that we can use:
g.addV('user').property('name','ben').as('b')
.addV('user').property('name','sally').as('s')
.addV('food').property('foodname','chicken marsala').as('fvm')
.addV('food').property('foodname','shrimp diavolo').as('fsd')
.addV('food').property('foodname','kung pao chicken').as('fkpc')
.addV('food').property('foodname','mongolian beef').as('fmb')
.addV('cuisine').property('type','italian').as('ci')
.addV('cuisine').property('type','chinese').as('cc')
.addE('hasCuisine').from('fvm').to('ci')
.addE('hasCuisine').from('fsd').to('ci')
.addE('hasCuisine').from('fkpc').to('cc')
.addE('hasCuisine').from('fmb').to('cc')
.addE('eats').from('b').to('fvm')
.addE('eats').from('b').to('fsd')
.addE('eats').from('b').to('fkpc')
.addE('eats').from('b').to('fmb')
.addE('eats').from('s').to('fmb')
Let's start with the user Sally...
g.V().has('name','sally').
Then we want to find all food item nodes that Sally likes.
(Note: It is best to add edge labels to your edges here to help with navigation.)
Let's call the edge from a user to a food item, "eats". Let's also assume that the direction of the edge (they must have a direction) goes from a user to a food item. So let's traverse to all foods that they like. We'll save this to a temporary list called 'liked' that we'll use later in the query to filter out the foods that Sally already likes.
.out('eats').aggregate('liked').
From this point in the graph, we need to diverge and fetch two downstream pieces of data. First, we want to go fetch the cuisines related to food items that Sally likes. We want to "hold our place" in the graph while we go fetch these items, so we use the sideEffect() step which allows us to go do something but come back to where we currently are in the graph to continue our traversal.
sideEffect(
out('hasCuisine').
dedup().
aggregate('cuisineschosen')).
Inside of the sideEffect() we want to traverse from food items to cuisines, deduplicate the list of related cuisines, and save the list of cuisines in a temporary list called 'cuisinechosen'.
Once we fetch the cuisines, we'll come back to where we were previously at the food items. We now want to go find the related users to Sally based on common food items. We also want to make sure we're not traversing back to Sally, so we'll use simplePath() here. simplePath() tells the query to ignore cycles.
in('eats').
simplePath().
From here we want to find all food items that our related users like and only return the ones with a cuisine that Sally already likes. We also remove the foods that Sally already likes.
out('eats').
where(without('liked')).
where(
out('hasCuisine').
where(
within('cuisineschosen'))).
values('foodname')
NOTE: You may also want to add a dedup() here after out('eats') to only return a distinct list of food items.
Putting it altogether...
g.V().has('name','sally').
out('eats').aggregate('liked').
sideEffect(
out('hasCuisine').
dedup().
aggregate('cuisineschosen')).
in('eats').
simplePath().
out('eats').
where(without('liked')).
where(
out('hasCuisine').
where(
within('cuisineschosen'))).
values('foodname')
Results:
['kung pao chicken']
At scale, you may need to use the sample() or coin() steps in Gremlin when finding related users as this can fan out really fast. Query performance is going to be based on how many objects each query needs to traverse.
I am querying my graph where it has the following nodes:
Customer
Account
Fund
Stock
With the following relationships:
HAS (a customer HAS an account)
PURCHASED (an account PURCHASES a fund or stock)
HOLDS (a fund HOLDS a stock)
The query I am trying to achieve is returning all Customers that have accounts that hold Microsoft through a fund. The following is my query:
MATCH (c:Customer)-[h:HAS]->(a:Account)-[p:PURCHASED]-(f:Fund)-[holds:HOLDS]->(s:Stock {ticker: 'MSFT'})
WHERE exists((f)-[:HOLDS]->(s:Stock))
AND exists ((f:Fund)-[holds]->(s:Stock))
AND NOT exists((a:Account {account_type: 'Individual'})-[p:PURCHASED]->(s:Stock))
RETURN *
This almost gets me the desired results but I keep getting 2 relationships out of the Microsoft stock that is tied to an Individual account where I do not want those included.
Any help would be greatly appreciated!
Result:
Desired Result:
There is duplications in your query. Lines 2 and 3 are the same. Line 2 is a subgraph of Line 1. Then you are using the variables a, p and s more than once in line 1 and line 4. Below query is not tested but give it a try. Please tell me if it works for you or not.
MATCH (c:Customer)-[h:HAS]->(a:Account)-[p:PURCHASED]-(f:Fund)-[holds:HOLDS]->(s:Stock {ticker: 'MSFT'})
WHERE NOT exists((:Account{account_type: 'Individual'})-[:PURCHASED]->(:Stock))
RETURN *
It seems to me that you should just uncheck the "Connect result nodes" option in the Neo4j Browser:
I'm wondering if it's possible to use a sub-selection as an exclusion query in orient db (v2.0). Or if it's necessary to export separate queries and process in Java/PHP/etc.
For instance, say we have the following graph for Hogwarts.
Vertices
People, Houses, Classes
Edges
is_at (subclasses is_student, is_faculty), was_at (alumni), is_taking, is_teaching, belongs_to
How would we find all the alumni who aren't also faculty? Is it possible to do so as a single query or using LET somehow?
How would we find all the faculty who are teaching a course on, say, time travel, that have no students who belong to the house gryffindor?
Thanks,
Lindsay
The .size() operator should work: http://orientdb.com/docs/2.0/orientdb.wiki/SQL-Methods.html#size
select from People where out('is_faculty').size() = 0
Use out('...') or in('...') based on your graph.
How would we find all the faculty you are teaching a course on, say, time travel, that have no students who belong to the house gryffindor?
I don't have much information on your graph and classes, but that could be something like:
select from Classes where ClassName='time travel' and in('is_teaching')[Id=yourId] and in('is_taking').out('belongs_to')[Name='gryffindor'].size() = 0
Again, use in() or out() accordingly to your graph.
A totally neo4j noob is talking here,
I like to create a graph to store a set of users, a typical user is as follows:
CREATE
(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),
(node_2 {Address:"Newyork",CountryCode:"US"}),
(node_3 {Location:"Hidden"}),
(node_4 {phoneNumber:11111}),
(node_5 {InternetEmailAddress:"peter#peterland.com")
now the problem is,
Every time I execute this I add 5 more nodes.
I know I need to use a unique key, but all example I saw can use a unique key for a specific node. So how can I make sure a user doesn't get added if it already exists(I can use email address as unique key).
how do I update the nodes if some changes occur. for example, after a week I want to update the graph to contain the following instead of the previous one.(no duplicates)
CREATE(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),(node_2 {Address:"Newyork",CountryCode:"US"}),(node_3 {Location:"public"}),(node_4 {phoneNumber:11111}),(node_5 {InternetEmailAddress:"peter#peterland.com"),(node_6 {status:"Jailed"})
(NOTE the new update changed location to "public" and added a new node for peter
Seeing as you had a load of nodes anyway.
Some of the data you have modelled as Nodes are probably properties as the other answer suggests, some are possibly correctly modelled as Nodes and one could probably form the or a part of the relationship.
Location public/hidden can be modelled in one of three ways, as a property on the Person, as a property between the Person and the Location or as the relationship type. To understand that first you need to have a relationship.
Your address at the moment is another Node, I think this is correct, but possibly you would want two nodes, related something like this:
(s:State)-[:IN_COUNTRY]-(c:Country)
YMMV and clearly that a US centric model, but you can extend it easilly enough.
Now you could create Peter with a LIVES_IN relationship:
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN]->(s), (s)-[:IN_COUNTRY]->(c)
For speed you are better off modelling two relationships which could be LIVES_IN_PUBLIC and LIVES_IN_HIDDEN which means to perform that update that you want above then you have to delete the one and create the other. However, if speed is not of the essence, it is common also to use properties on the relationship.
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]->(c)
So your complete Q&A:
CREATE (p:Person {fullName:"Peter Parker",firstName:"peter",familyName:"parker", phoneNumber:1111, internetEmailAddress:"peter#peterland.com"}),
(s:State {name:"New York"}), (c:Country {code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]-(c)
MATCH (p:Person {internetEmailAddress:"peter#peterland.com"})-[li:LIVES_IN]->()
SET li.public = true, p.status = "jailed"
When adding other People you probably do not want to recreate States and Countries, rather you want to match them, and possibly Merge them, but we'll stick to Create.
MATCH (s:State{name:"New York"})
CREATE (p:Person{name:"John Smith", internetEmailAddress:"john#google.com"})-[:LIVES_IN{public:false}]->(s)
John Smith now implicitly lives in the US too as you can follow the relationship through the State Node.
Treatise complete.
I think you're modeling your data incorrectly here - you're setting up each property of the person as a separate node, which is not a good idea. You don't have any linkages between those nodes, so with this data pattern, later on you won't be able to tell what Peter Parker's address is. You're also not using node labels, which I think could really help here.
The quick question to your answer about updating nodes is that you have to MATCH them, then use SET to modify a property. So if you had a person, you might do this:
MATCH (p:Person { FullName: "Peter Parker" })
SET p.Address = "123 Fake Street"
RETURN p;
But notice I'm making assumptions about the way your data is structured. I'll take that same data you provided, this might be a better way of creating it:
CREATE (node_1:Person {FullName:"Peter Parker",
FirstName:"peter",
FamilyName:"parker",
Address:"Newyork",CountryCode:"US",
Location:"Hidden",
phoneNumber:11111,
InternetEmailAddress:"peter#peterland.com"});
The difference with this suggestion is that I'm putting all the properties into a single node (instead of one property per node) and I'm applying the Person label to the node.
If you structured the data like this, then the update query I provided would work. Structuring the data like you have it, it's not possible to update Peter Parker's address, because there's no relationship between your node_1 and node_2