Order results by number of coincidences in edge properties - gremlin

I'm working on a recommendation system that recommends other users. The first results should be the most "similar" users to the "searcher" user. Users respond to questions and the amount of questions responded in the same way is the amount of similarity.
The problem is that I don't know how to write the query
So in technical words I need to sort the users by the amount of edges that has specific property values, I tried with this query, I thought it should work but it doesn't work:
let query = g.V().hasLabel('user');
let search = __;
for (const question of searcher.questions) {
search = search.outE('response')
.has('questionId', question.questionId)
.has('answerId', question.answerId)
.aggregate('x')
.cap('x')
}
query = query.order().by(search.unfold().count(), order.asc);
Throws this gremlin internal error:
org.apache.tinkerpop.gremlin.process.traversal.step.util.BulkSet cannot be cast to org.apache.tinkerpop.gremlin.structure.Vertex
I also tried with multiple .by() for each question, but the result was not ordered by the amount of coincidence.
How can I write this query?

When you cap() an aggregate() it returns a BulkSet which is a Set that has counts for how many times each object exists in that Set. It behaves like a List when you iterate through it by unrolling each object the associated size of the count. So you get your error because the output of cap('x') is a BulkSet but because you are building search in a loop you are basically just calling outE('response') on that BulkSet and that's not valid syntax as has() expects a graph Element such as a Vertex as indicated by the error.
I think you would prefer something more like:
let query = g.V().hasLabel('user').
outE('response');
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.or(...search).
groupCount().
by(outV())
order(local).by(values, asc)
I may not have the javascript syntax exactly right (and I used spread syntax in my or() to just convey the idea quickly of what needs to happen) but basically the idea here is to filter edges that match your question criteria and then use groupCount() to count up those edges.
If you need to count users who have no connection then perhaps you could switch to project() - maybe like:
let query = g.V().hasLabel('user').
project('user','count').
by();
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.by(outE('response').or(...search).count()).
order().by('count', asc);
fwiw, I think you might consider a different schema for your data that might make this recommendation algorithm a bit more graph-like. A thought might be to make the question/answer a vertex (a "qa" label perhaps) and have edges go from the user vertex to the "qa" vertex. Then users directly link to the question/answers they gave. You can easily see by way of edges, a direct relationship, which users gave the same question/answer combination. That change allows the query to flow much more naturally when asking the question, "What users answered questions in the same way user 'A' did?"
g.V().has('person','name','A').
out('responds').
in('responds').
groupCount().
order(local).by(values)
With that change you can see that we can rid ourselves of all those has() filters because they are implicitly implied by the "responds" edges which encode them into the graph data itself.

Related

Upsert fails when using as() and coalesce()

I'm trying to create an upsert traversal in Gremlin. Update an edge if it exists, otherwise add a new edge.
g.V("123")
.as("user")
.V("456")
.as("post")
.inE("like")
.fold()
.coalesce(
__.unfold()
.property("likeCount", 1),
__.addE("like")
.from("user")
.to("post")
)
This returns an error.
The provided traverser does not map to a value: []->[SelectOneStep(last,post)]
I've narrowed this down to the to("post") step. From within coalesce it can't see post from as("post"). It is also unable to see user.
This is strange to me because the following does work:
g.V("123")
.as("user")
.V("456")
.as("post")
.choose(
__.inE("like"),
__.inE("like")
.property("likeCount", 1),
__.addE("like")
.from("user")
.to("post")
)
From within the choose() step I do have access to user and post.
I'd like to use the more efficient upsert pattern but can't get past this issue. I could just look up the user and post from within coalesce like so:
g.V("123")
.as("user")
.V("456")
.as("post")
.inE("like")
.fold()
.coalesce(
__.unfold()
.property("likeCount", 1),
__.V("456")
.as("post")
.V("123")
.addE("like")
.to("post")
)
But repeating that traversal seems inefficient. I need post and user in the outer traversal for other reasons.
Why can't I access user and post from within a coalesce in my first example?
The issue you are running into is that as soon as you hit the fold() step in your code you lose the path history, which means that it will not know user or post are referring to. fold() is what is known as a ReducingBarrierStep which means that many results are collected into a single result. The way I think about it is that because you have converted many results to one, anything like aliases that were added (e.g. user and post) no longer really have meaning as they have all been collected into a single element.
However you can rewrite your query as shown here to achieve the desired result:
g.V("456")
.inE("like")
.fold()
.coalesce(
__.unfold()
.property("likeCount", 1),
__.addE("like")
.from(V("123"))
.to(V("456"))
)
I am also not sure if you meant to only add a like count on an existing edge or if you wanted to add the like count to the edge in either case which would be like this:
g.V("456")
.inE("like")
.fold()
.coalesce(
__.unfold(),
__.addE("like")
.from(V("123"))
.to(V("456"))
).property("likeCount", 1)

TinkerPop not getting correct count type with traversal

I want to retrieve the specific range of users(required for pagination) and want to retrieve the total count as well, I'm executing the below query which is retrieving the list of user vertices as expected but the total count is returned as BulkSet
Map<String, Object> result = gt.V().hasLabel("user").sideEffect(__.count().store("total"))
.order().by("name", Order.desc)
.range(0, 10).fold().as("list")
.select("list","total").next();
The output is as below
How do I get the correct count as a Long value instead of the BulkSet?
Paging with Gremlin is discussed here and references this blog post which provides additional information on the topic. Those resources should help you with your paging strategy.
You framed this question in terms of inquiring about BulkSet so it isn't quite a duplicate of the answer I referenced, so I will try to answer that much for you. BulkSet allows for an important traversal optimization in TinkerPop which helps reduce object propagation, thus reducing memory requirements for a particular query. It does this by holding the traverser object and its count where the count is the number of times that object has been added to the BulkSet. Calling size() or longSize() (where the latter returns a long and the former returns int) will return the summation of the counts and therefore the "correct" or actual count of the objects. A call to uniqueSize() will return actual size of the set which will be the unique objects within it.
If you want the size of the BulkSet you just need to count() it:
gt.V().hasLabel("user").sideEffect(__.count().store("total"))
.order().by("name", Order.desc)
.range(0, 10).fold().as("list")
.select("list","total")
.by().
.by(count(local))
That said, I don't think your traversal isn't really doing what you want . The sideEffect() is just counting the current traverser which will simply return "1" and then you store that "1" in the list "total". At least that's what I see with TinkerGraph:
gremlin> g.V().hasLabel("person").sideEffect(count().store("total")).range(0,1).fold().as('list').select('list','total').by().by(count(local))
==>[list:[v[1]],total:1]
gremlin> g.V().hasLabel("person").sideEffect(count().store("total")).range(0,10).fold().as('list').select('list','total').by().by(count(local))
==>[list:[v[1],v[2],v[4],v[6]],total:4]
Interesting that JanusGraph somehow gives you 114 rather than 10 for the "total". I'd not expect that. I'd consider avoiding reliance on that "feature" in the case it is a "bug" that is later "fixed". Instead, please consider the posts I'd provided and look at them for inspiration.

How to get a path from one node to another including all other nodes and relationships involved in between

I have designed a model in Neo4j in order to get paths from one station to another including platforms/legs involved. The model is depicted down here. Basically, I need a query to take me from NBW to RD. also shows the platforms and legs involved. I am struggling with the query. I get no result. Appreciate if someone helps.
Here is my cypher statement:
MATCH p = (a:Station)-[r:Goto|can_board|can_alight|has_platfrom*0..]->(c:Station)
WHERE (a.name='NBW')
AND c.name='RD'
RETURN p
Model:
As mentioned in the comments, in Cypher you can't use a directed variable-length relationship that uses differing directions for some of the relationships.
However, APOC Procedures just added the ability to expand based on sequences of relationships. You can give this a try:
MATCH (start:station), (end:station)
WHERE start.name='NBW' AND end.name='THT'
CALL apoc.path.expandConfig(start, {terminatorNodes:[end], limit:1,
relationshipFilter:'has_platform>, can_board>, goto>, can_alight>, <has_platform'}) YIELD path
RETURN path
I added a limit so that only the first (and shortest) path to your end station will be returned. Removing the limit isn't advisable, since this will continue to repeat the relationships in the expansion, going from station to station, until it finds all possible ways to get to your end station, which could hang your query.
EDIT
Regarding the new model changes, the reason the above will not work is because relationship sequences can't contain a variable-length sequence within them. You have 2 goto> relationships to traverse, but only one is specified in the sequence.
Here's an alternative that doesn't use sequences, just a whitelisting of allowed relationships. The spanningTree() procedure uses NODE_GLOBAL uniqueness so there will only be a single unique path to each node found (paths will not backtrack or revisit previously-visited nodes).
MATCH (start:station), (end:station)
WHERE start.name='NBW' AND end.name='RD'
CALL apoc.path.spanningTree(start, {terminatorNodes:[end], limit:1,
relationshipFilter:'has_platform>|can_board>|goto>|can_alight>|<has_platform'}) YIELD path
RETURN path
Your query is directed --> and not all of the relationships between your two stations run in the same direction. If you remove the relationship direction you will get a result.
Then once you have a result I think something like this could get you pointed in the right direction on extracting the particular details from the resulting path once you get that working.
Essentially I am assuming that everything you are interested in is in your path that is returned you just need to filter out the different pieces that are returned.
As #InverseFalcon points out this query should be limited in a larger graph or it could easily run away.
MATCH p = (a:Station)-[r:Goto|can_board|can_alight|has_platfrom*0..]-(c:Station)
WHERE (a.name='NBW')
AND c.name='THT'
RETURN filter( n in nodes(p) WHERE 'Platform' in labels(n)) AS Platforms

Avoiding salesforce governing limits on soql queries getting group members for each group?

I am working in apex on salesforce platform. I have this loop to grab all group names, Ids, and their respective group members, place them in an object to collect all this info, then put that in a list to have a list of all groups and all information I need:
List<groupInfo> memberList = new List<groupInfo>();
for(Id key : groupMap.keySet()){
groupInfo newGroup = new groupInfo();
Group g = groupMap.get(key);
if(g.Name != null){
set<Id> memberSet = getGroupEventRelations(new set<Id>{g.Id});
if(memberSet.size() != 0){
newGroup.groupId = g.Id;
newGroup.groupName = g.Name;
newGroup.groupMemberIds = memberSet;
memberList.add(newGroup);
}
}
}
My getGroupEventRelations method is as such:
global static set<Id> getGroupEventRelations(set<Id> groupIds){
set<Id> nestedIds = new set<Id>();
set<Id> returnIds = new set<Id>();
List<GroupMember> members = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groupIds];
for(GroupMember member : members){
if(Schema.Group.SObjectType == member.UserOrGroupId.getSObjectType()){
nestedIds.add(member.UserOrGroupId);
} else{
returnIds.add(member.UserOrGroupId);
}
}
if(nestedIds.size() > 0){
returnIds.addAll(getGroupEventRelations(nestedIds));
}
return returnIds;
}
getGroupEventRelations contains a soql query, and considering this is called inside a loop of groups... if someone has over 100 groups with group members or possibly a series of 100 nested groups inside groups... then this is going to hit the governing limits of salesforce soql queries pretty quickly...
I am wondering if anyone knows of a way to possibly get rid of the soql query inside getGroupEventRelations to get rid of the query in the loop. When I want group members for a specific group, I am not really seeing a way to get by this without more loops inside loops where I could risk running into CPU timeout salesforce governing limit :(
Thank you in advance for any help!
At large enough numbers there's no solution, you'll run into SOME governor limit. But you can certainly make your code work with bigger numbers than it does now. Here's a quick little cheat you could do to cut nesting 5-fold. Instead of just looking at the immediate parent (single level of children) look for parent, grandparent, great grandparent, etc, all in one query.
[SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE (GroupId IN :groupIds OR Group.GroupId IN :groupIds OR Group.Group.GroupId IN :groupIds OR Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.Group.GroupId IN :groupIds) AND Id NOT IN :returnIds];
You just got 5 (or is it 6?) levels of children in one SOQL call, so you can support that many times more nest levels now. Note that I added a 'NOT IN' clause to make sure you don't repeat children that you already have, since you won't know which Ids came from the bottom level.
You can also make your very first call for all groups instead of each group at a time. So if someone has 100 groups you'll make just one call instead of 100.
List<Group> groups = groupMap.values();
List<GroupMember> allMembers = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groups];
Lastly, you could query all GroupMembers in a single SOQL call and then iterate yourself. Like you said, you risk running into the 10 second limit here, but if the number of groups isn't in the millions you'll likely be just fine, especially if you do some O(n) analysis and choose good data structures and algorithms. On the plus side, you won't have to worry about SOQL limits regardless of the nesting and the tree complexity. This answer should be very helpful, they are doing almost exactly what you'd have to do if you pulled all members in one call.
How to efficiently build a tree from a flat structure?

How do i get all nodes in the graph on a certain relation ship type

I have build a small graph where all the screens are connected and the flow of the screen varies based on the system/user. So the system/user is the relationship type.
I am looking to fetch all nodes that are linked with a certain relation ship from a starting screen. I don't care about the depth since i don't know the depth of the graph.
Something like this, but the below query takes ever to get the result and its returning incorrect connections not matching the attribute {path:'CC'}
match (n:screen {isStart:true})-[r:NEXT*0..{path:'CC'}]-()
return r,n
A few suggestions:
Make sure you have created an index for :screen(isStart):
CREATE INDEX ON :screen(isStart);
Are you sure you want to include 0-length paths? If not, take out 0.. from your query.
You did not specify the directionality of the :NEXT relationships, so the DB has to look at both incoming and outgoing :NEXT relationships. If appropriate, specify the directionality.
To minimize the number of result rows, add a WHERE clause that ensures that the current path cannot be extended further.
Here is a proposed query that combines the last 3 suggestions (fix it up to suit your needs):
MATCH (n:screen {isStart:true})-[r:NEXT* {path:'CC'}]->(x)
WHERE NOT (x)-[:NEXT {path:'CC'}]->()
return r,n;

Resources