Gremlin repeat untill the combined weight is greater than x - graph

Hello I'm messing around with Gremlin to find paths from one node to another. I have a weighted graph and I need to be able to find all the paths that don't exceed a combined weight.
For instance, if I want all the paths from [A] to [D] that don't exceed the weight of 20
[A] -5-> [B] -15-> [C] -20-> [D] - Would not be valid as it exceeds a combined weight of 20
[A] -5-> [B] -15-> [D] - Would return as its combined weight does not exceed 20.
This is my current query
g.V('A').repeat(bothE().otherV().hasLabel('test'))
.until(hasId('D')
.or().loops().is(5)
.or().map(unfold().coalesce(values("weight"),constant(0)).sum().is(gt(20))))
.hasId('D').path().by(valueMap(true))
If I remove the below section of the query it returns the same data so there is something wrong with my logic here.
.or().map(unfold().coalesce(values("weight"),constant(0)).sum().is(gt(20))))
I have considered just filtering this out in the backend API but this doesn't seem like a good practise as a lot of commuting may be wasted as the graph gets larger.

This is a case where the sack step can help a lot. As the query progresses you can continually add the weights found to the current sack value.
Using the air-routes data set, where edges have a distance (dist) property, we can write a query similar to yours that will end when it finds the target or the sack has exceeded the allowed maximum. I added all the parts that start with local just to show some nicely formatted results. Note the second has check. This is needed to make sure we found the intended target in cases where we exited the repeat as the sack limit has been exceeded.
g.withSack(0).
V('3').
repeat(outE().sack(sum).by('dist').inV().simplePath()).
until(hasId('49').or().sack().is(gt(6000)).or().loops().is(3)).
hasId('49').
local(
union(
path().
by('code').
by('dist'),
sack()).
fold()).
limit(3)
which yields
1 [path[AUS, 4901, LHR], 4901]
2 [path[AUS, 748, MEX, 5529, LHR], 6277]
3 [path[AUS, 1209, PIT, 3707, LHR], 4916]
To filter out the results where the total exceeds the target, we can move the test to be after the until
g.withSack(0).
V('3').
repeat(outE().sack(sum).by('dist').inV().simplePath()).
until(hasId('49').or().loops().is(3)).
hasId('49').
where(sack().is(lte(6000))).
local(
union(
path().
by('code').
by('dist'),
sack()).
fold()).
limit(3)
this time all the results are less than the target of 6000.
1 [path[AUS, 4901, LHR], 4901]
2 [path[AUS, 1209, PIT, 3707, LHR], 4916]
3 [path[AUS, 809, ATL, 4198, LHR], 5007]
Using this template you can hopefully build a version of your query.
UPDATED after discussion in the comments. In the absence of sack, you can try something like this. It essentially builds a path of distances, substituting a 0 for where the vertices are, adds that list up, and filters on the sum. It's pretty ugly, but it works!
g.V('3').
repeat(outE().as('d').inV().simplePath()).
until(hasId('49').or().loops().is(3)).
hasId('49').
where(path().by(constant(0)).by('dist').unfold().sum().is(lte(6000))).
path().
by('code').
by('dist').
limit(5)
The results are
1 path[AUS, 4901, LHR]
2 path[AUS, 1209, PIT, 3707, LHR]
3 path[AUS, 809, ATL, 4198, LHR]
4 path[AUS, 755, BNA, 4168, LHR]
5 path[AUS, 1690, BOS, 3254, LHR]
Not having support for sack is going to make everything a lot more complex but this at least is a possible workaround.

Related

Create a group vertex for each group, and create outgoing edges to the group vertices

I have a gremlin query which groups vertices based on two properties
g.V().hasLabel("PERSON").
group().
by(values('favorite_brand', 'favorite_color').fold()).
next()
It returns a list of each group mapped to the list of the vertices in group
1 {('adidas', 'blue'): [v[123], v[456]]}
2 {('nike', 'red'): [v[789]]}
How can I: for each group, create a vertex with an outgoing edge to all the vertices in that group and also set the new group vertex properties to be the same
For example for the above, I would create two new vertices. Group Vertex 1 would have 'favorite_brand' as adidas and 'favorite_color' as blue and would have two outgoing edges to the two vertices 123 and 456.
Same for Group Vertex 2
Is there a way in gremlin to carry this query or do I have to store the returned hashmap in a variable and for loop in my lambda to create new vertices? I'm familiar with addV step but how would I iterate through each element in the hashmap and then access the list value? Thanks!
I have look at the tinkerpop official documentation to understand group step but then didn't find enough information on how to iterate through results and perform actions
Not having your data, the answer below is built using the air-routes data set. The initial group can be built using:
g.V().hasId('3','8','12','13','14').
group().
by(values('region', 'country').fold()).
unfold()
which yields
1 {('US-CA', 'US'): [v[13]]}
2 {('US-NY', 'US'): [v[12], v[14]]}
3 {('US-TX', 'US'): [v[3], v[8]]}
From there we can build a query to unroll the group while creating the new nodes and edges.
g.V().hasId('3','8','12','13','14').
group().
by(values('region', 'country').fold()).
unfold().as('grp').
addV('group').as('new-node').
property('region',select(keys).limit(local,1)).
property('country',select(keys).tail(local)).
sideEffect(select('grp').select(values).unfold().addE('new-edge').from('new-node'))
which shows us the nodes created but will also have created the edges inside the sideEffect.
1 v[26c2d653-5e9c-3ec9-0854-6ed2a212c63b]
2 v[80c2d653-5e9c-b838-063c-82f6d21cd6e5]
3 v[42c2d653-5e9d-31c7-2c02-c7bf72fe8e38]
We can use the query below to verify everything has worked.
g.V().hasLabel('group').out().path().by(valueMap()).by(id())
Which returns
1 path[{'region': ['US-CA'], 'country': ['US']}, 13]
2 path[{'region': ['US-NY'], 'country': ['US']}, 12]
3 path[{'region': ['US-NY'], 'country': ['US']}, 14]
4 path[{'region': ['US-TX'], 'country': ['US']}, 3]
5 path[{'region': ['US-TX'], 'country': ['US']}, 8]
I used Amazon Neptune to build this answer but it should work on other TinkerPop compliant stores.

Gremlin: How to traverse backward and/or forward based on a vertex id

Usecase: Given a person (with a known id), find that person's all ancestors and all descendants.
Example:
Vertex:
Person1 (id=11) -> Person2 (id=22) -> Person3 (id=33) -> Person4 (id=44) -> Person5 (id=55)
Edge:
Every vertex can have almost two edges denoting the relationship:
Person1 (isParentOf) -> Person2 (isParentOf) -> Person3 (isParentOf) -> Person4 (isParentOf) -> Person5
Person1 <- (isChildOf) Person2 <- (isChildOf) Person3 <- (isChildOf) Person4 <- (isChildOf) Person5
E.g. Query 1:
Given Person1 (id=11), find Person1's all ancestors and all descendants.
Expected response: [isParentOf: 22 (Person2), 33 (Person3), 44 (Person4), 55 (Person(5)]
E.g. Query 2:
Given Person3 (id=33), find Person3's all ancestors and all descendants.
Expected response: [isChildOf: 22 (Person2), 11 (Person1), isParentOf: 44 (Person4), 55 (Person5)]
E.g. Query 3:
Given Person5 (id=55), find Person5's all ancestors and all descendants.
Expected response: [isChildOf: 44 (Person4), 33 (Person3), 22 (Person2), 11 (Person1)]
The response should keep the order of the ancestors and all descendants to help with other follow-up queries. The response should also contain the name property of the persons.
Test data populated at https://gremlify.com/pk9z0s6gv4g/5
g.addV('PERSON').property(id, '11').
property('name', 'person1').as('p1').
addV('PERSON').property(id, '22').
property('name', 'person2').as('p2').
addV('PERSON').property(id, '33').
property('name', 'person3').as('p3').
addV('PERSON').property(id, '44').
property('name', 'person4').as('p4').
addV('PERSON').property(id, '55').
property('name', 'person5').as('p5').
addE('isChildOf').from('p2').to('p1').
addE('isParentOf').from('p1').to('p2').
addE('isChildOf').from('p3').to('p2').
addE('isParentOf').from('p2').to('p3').
addE('isChildOf').from('p4').to('p3').
addE('isParentOf').from('p3').to('p4').
addE('isChildOf').from('p5').to('p4').
addE('isParentOf').from('p4').to('p5')
I tried the following example query:
g.V('33').repeat(out('isParentOf')).times(2).path()
Response
path[v[33], v[44], v[55]
My query has the following issues:
Only can check for 'isParentOf' or 'isChildOf' and not both.
Query search is limited to the times(2) parameter. I want to search for all ancestors and all descendants.
It is return only the IDs and not the properties of the Person (like the name).
With my limited understanding of GraphDB and all the examples I have seen, the query must contain a limiting id and/or source+destination nodes (vertex) to traverse. I have not seen an open-ended query that traverses from a source node to a leaf node and vice-versa.
Using the Gremlin Console, here is an example that uses a union step to search for both children and parents. I am a little curious why you modeled these both as outgoing relationships. If you modeled one as incoming and the other as outgoing, you could just have isParentOf edges, and depending on the direction you travel (in or out), you could infer if someone is a child or a parent. Anyway, using the sample data:
gremlin> g.V('33').
......1> union(repeat(out('isChildOf')).until(__.not(out('isChildOf'))),
......2> repeat(out('isParentOf')).until(__.not(out('isParentOf')))).
......3> path().
......4> by('name')
==>[person3,person2,person1]
==>[person3,person4,person5]
Changing the query a bit, we can annotate the results to show the relationship type:
gremlin> g.V('33').
......1> union(project('child-of').by(repeat(out('isChildOf')).until(__.not(out('isChildOf'))).path().by('name')),
......2> project('parent-of').by(repeat(out('isParentOf')).until(__.not(out('isParentOf'))).path().by('name')))
==>[child-of:[person3,person2,person1]]
==>[parent-of:[person3,person4,person5]]
Pretty printed the query looks like this:
g.V('33').
union(
project('child-of').
by(
repeat(out('isChildOf')).until(__.not(out('isChildOf'))).
path().by('name')),
project('parent-of').
by(
repeat(out('isParentOf')).until(__.not(out('isParentOf'))).
path().by('name')))
Finally, if you do not want the starting person to appear in the results:
gremlin> g.V('33').
......1> union(
......2> project('child-of').
......3> by(
......4> repeat(out('isChildOf')).until(__.not(out('isChildOf'))).
......5> path().by('name').range(local,1,-1)),
......6> project('parent-of').
......7> by(
......8> repeat(out('isParentOf')).until(__.not(out('isParentOf'))).
......9> path().by('name').range(local,1,-1)))
==>[child-of:[person2,person1]]
==>[parent-of:[person4,person5]]

Gremlin arithmetic on where predicate?

In this question Gremlin graph traversal that uses previous edge property value to filter later edges we can use where to compare property of edge. I don't want just using simple neq or eq or gt. Can gremlin support on arithmetic on this two edge? suck like gtv('firstEdge', 0.2) or g.V(1).outE().has('weight',1.0).as('firstEdge').inV().outE().as('secondEdge').filter((secondEdge-firstEdge) > 0.2)
I seem don't find such thing in document.
There are a couple of ways to approach a situation like this. For simple queries where you just want a+b < some value, using sack works well. For example, using the air-routes data set:
g.withSack(0).
V('44').
repeat(outE('route').sack(sum).by('dist').inV().simplePath()).
times(2).
where(sack().is(lt(500))).
path().
by('code').
by('dist').
limit(2)
which yields:
1 path[SAF, 369, PHX, 110, TUS]
2 path[SAF, 369, PHX, 119, FLG]
To use the math step requires just a little more work:
Let's first just see how the math step works in such a case via a query to take the difference between some route distances:
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
project('b','a','diff').
by(select('b').values('dist')).
by(select('a').values('dist')).
by(math('b - a').by('dist')).
limit(3)
which yields:
1 {'b': 1185, 'a': 549, 'diff': 636.0}
2 {'b': 6257, 'a': 549, 'diff': 5708.0}
3 {'b': 8053, 'a': 549, 'diff': 7504.0}
we can now refine the query to find routes where the difference is less than 100.
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
where(math('b - a').by('dist').is(lt(100))).
path().
by('code').
by('dist').
limit(3)
which gives us:
1 path[SAF, 549, DFW, 430, MEM]
2 path[SAF, 549, DFW, 461, MCI]
3 path[SAF, 549, DFW, 550, STL]
you can also use the absolute value in the calculation if preferred:
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
where(math('abs(b - a)').by('dist').is(lt(100))).
path().
by('code').
by('dist').
limit(3)

Gremlin - finding connected nodes with several boolean conditions on both nodes and edges properties

I want to find nodes who should be linked to a given node, where the link is defined by some logic, which uses the nodes' and existing edges' attribute with the following logic:
A) (The pair has the same zip (node attribute) and name_similarity (edge attribute) > 0.3 OR
B) The pair has a different zip and name_similarity > 0.5 OR
C) The pair has an edge type "external_info" with value = "connect")
D) AND (the pair doesn't have an edge type with "external info" with value = "disconnect")
In short:
(A | B | C) & (~D)
I'm still a newbie to gremlin, so I'm not sure how I can combine several conditions on edges and nodes.
Below is the code for creating the graph, as well as the expected results for that graph:
# creating nodes
(g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate())
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
# creating name similarity edges
g.V(node1).addE('name_similarity').from_(node1).to(node2).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node3).property('score', 0.2).next() # under threshold
g.V(node1).addE('name_similarity').from_(node1).to(node4).property('score', 0.4).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node5).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node6).property('score', 0).next() # under threshold
# creating external output edges
g.V(node1).addE('external_info').from_(node1).to(node5).property('decision', 'connect').next()
g.V(node1).addE('external_info').from_(node1).to(node6).property('decision', 'disconnect').next()
The expected output - for input node A - are nodes B (due to condition A), D (due to Condition B), and F (due to condition C). node E should not be linked due to condition D.
I'm looking for a Gremlin query that will retrieve these results.
Something seemed wrong in your data given the output you expected so I had to make corrections:
Vertex D wouldn't appear in the results because "score" was less than 0.5
"external_info" edges seemed reversed
Here's the data I used:
g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate()
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
g.V(node1).addE('name_similarity').from(node1).to(node2).property('score', 1).next()
g.V(node1).addE('name_similarity').from(node1).to(node3).property('score', 0.2).next()
g.V(node1).addE('name_similarity').from(node1).to(node4).property('score', 0.6).next()
g.V(node1).addE('name_similarity').from(node1).to(node5).property('score', 1).next()
g.V(node1).addE('name_similarity').from(node1).to(node6).property('score', 0).next()
g.V(node1).addE('external_info').from(node1).to(node6).property('decision', 'connect').next()
g.V(node1).addE('external_info').from(node1).to(node5).property('decision', 'disconnect').next()
I went with the following approach:
gremlin> g.V().has('person','name','A').as('a').
......1> V().as('b').
......2> where('a',neq('b')).
......3> or(where('a',eq('b')). // A
......4> by('zip').
......5> bothE('name_similarity').has('score',gt(0.3)).otherV().where(eq('a')),
......6> bothE('name_similarity').has('score',gt(0.5)).otherV().where(eq('a')), // B
......7> bothE('external_info'). // C
......8> has('decision','connect').otherV().where(eq('a'))).
......9> filter(__.not(bothE('external_info'). // D
.....10> has('decision','disconnect').otherV().where(eq('a')))).
.....11> select('a','b').
.....12> by('name')
==>[a:A,b:B]
==>[a:A,b:D]
==>[a:A,b:F]
I think this contains all the logic you were looking for, but I didn't spend a lot of time optimizing it as I don't think any optimization will get around the pain of the full graph scan of V().as('b'), so either your situation involves a relatively small graph (in-memory perhaps) and this query will work or you would need to find another method all together. Perhaps you have methods to further limit "b" which might help? If something along those lines is possible, I'd probably try to better define directionality of edge traversals to avoid bothE() and instead limit to outE() or inE() which would get rid of otherV(). Hopefully you use a graph that allows for vertex centric indices which would speed up those edge lookups on "score" as well (not sure if that would help much on "decision" as it has low selectivity).

R filter produces NAs

I see posts on how the filter() function can deal with NAs, but I have the opposite problem.
I have a fully complete dataset that I process through a long algorithm. One piece of that algorithm is a series of FIR filters. After going through the filters, SOMETIMES my data comes back with NAs at the beginning and end. They are not padding the original dimensions- they "replace" the values that would have otherwise come out of the filter.
I've come up with a couple of hacks to remove the NAs after they are created, but I'm wondering if I can prevent the NAs from showing up in the first place?
set.seed(500)
xoff <- sample(-70:70, 8200, replace=TRUE)
filt <- c(-75, -98, -130, -174, -233, -312, -412, -524, -611, -574, -246, 485, 1503, 2545, 3446, 4174, 4749, 5189, 5502, 5689, 5750, 5689, 5502, 5189, 4749, 4174, 3446, 2545, 1503, 485, -246, -574, -611, -524, -412, -312, -233, -174, -130, -98, -75)
xs = filter(xoff,filt)
40 NAs are returned. 20 at head, 20 at the tail.
sum(is.na(xs)==TRUE)
head(xs, n=21)
tail(xs, n=21)
The length of the original data and the filtered vector are identical.
length(xoff)==length(xs)
Another filter I use sometimes produces 20 NAs, 10 at the head, 10 at the tail. So it is filter dependent.
Things I've thought about:
-Is the length of the data indivisible by the length of the filter? No, the filter is length=41, and the data is length=8200. 8200/41 = 200.
-Unlike a moving average which would need n observations prior to starting the smooth, this FIR filter provides the filter values and doesn't rely on prior observations.
so, any help debugging this is much appreciated! Thanks!

Resources