Gremlin arithmetic on where predicate? - gremlin

In this question Gremlin graph traversal that uses previous edge property value to filter later edges we can use where to compare property of edge. I don't want just using simple neq or eq or gt. Can gremlin support on arithmetic on this two edge? suck like gtv('firstEdge', 0.2) or g.V(1).outE().has('weight',1.0).as('firstEdge').inV().outE().as('secondEdge').filter((secondEdge-firstEdge) > 0.2)
I seem don't find such thing in document.

There are a couple of ways to approach a situation like this. For simple queries where you just want a+b < some value, using sack works well. For example, using the air-routes data set:
g.withSack(0).
V('44').
repeat(outE('route').sack(sum).by('dist').inV().simplePath()).
times(2).
where(sack().is(lt(500))).
path().
by('code').
by('dist').
limit(2)
which yields:
1 path[SAF, 369, PHX, 110, TUS]
2 path[SAF, 369, PHX, 119, FLG]
To use the math step requires just a little more work:
Let's first just see how the math step works in such a case via a query to take the difference between some route distances:
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
project('b','a','diff').
by(select('b').values('dist')).
by(select('a').values('dist')).
by(math('b - a').by('dist')).
limit(3)
which yields:
1 {'b': 1185, 'a': 549, 'diff': 636.0}
2 {'b': 6257, 'a': 549, 'diff': 5708.0}
3 {'b': 8053, 'a': 549, 'diff': 7504.0}
we can now refine the query to find routes where the difference is less than 100.
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
where(math('b - a').by('dist').is(lt(100))).
path().
by('code').
by('dist').
limit(3)
which gives us:
1 path[SAF, 549, DFW, 430, MEM]
2 path[SAF, 549, DFW, 461, MCI]
3 path[SAF, 549, DFW, 550, STL]
you can also use the absolute value in the calculation if preferred:
g.V('44').
outE('route').as('a').inV().
outE('route').as('b').inV().
where(math('abs(b - a)').by('dist').is(lt(100))).
path().
by('code').
by('dist').
limit(3)

Related

Gremlin repeat untill the combined weight is greater than x

Hello I'm messing around with Gremlin to find paths from one node to another. I have a weighted graph and I need to be able to find all the paths that don't exceed a combined weight.
For instance, if I want all the paths from [A] to [D] that don't exceed the weight of 20
[A] -5-> [B] -15-> [C] -20-> [D] - Would not be valid as it exceeds a combined weight of 20
[A] -5-> [B] -15-> [D] - Would return as its combined weight does not exceed 20.
This is my current query
g.V('A').repeat(bothE().otherV().hasLabel('test'))
.until(hasId('D')
.or().loops().is(5)
.or().map(unfold().coalesce(values("weight"),constant(0)).sum().is(gt(20))))
.hasId('D').path().by(valueMap(true))
If I remove the below section of the query it returns the same data so there is something wrong with my logic here.
.or().map(unfold().coalesce(values("weight"),constant(0)).sum().is(gt(20))))
I have considered just filtering this out in the backend API but this doesn't seem like a good practise as a lot of commuting may be wasted as the graph gets larger.
This is a case where the sack step can help a lot. As the query progresses you can continually add the weights found to the current sack value.
Using the air-routes data set, where edges have a distance (dist) property, we can write a query similar to yours that will end when it finds the target or the sack has exceeded the allowed maximum. I added all the parts that start with local just to show some nicely formatted results. Note the second has check. This is needed to make sure we found the intended target in cases where we exited the repeat as the sack limit has been exceeded.
g.withSack(0).
V('3').
repeat(outE().sack(sum).by('dist').inV().simplePath()).
until(hasId('49').or().sack().is(gt(6000)).or().loops().is(3)).
hasId('49').
local(
union(
path().
by('code').
by('dist'),
sack()).
fold()).
limit(3)
which yields
1 [path[AUS, 4901, LHR], 4901]
2 [path[AUS, 748, MEX, 5529, LHR], 6277]
3 [path[AUS, 1209, PIT, 3707, LHR], 4916]
To filter out the results where the total exceeds the target, we can move the test to be after the until
g.withSack(0).
V('3').
repeat(outE().sack(sum).by('dist').inV().simplePath()).
until(hasId('49').or().loops().is(3)).
hasId('49').
where(sack().is(lte(6000))).
local(
union(
path().
by('code').
by('dist'),
sack()).
fold()).
limit(3)
this time all the results are less than the target of 6000.
1 [path[AUS, 4901, LHR], 4901]
2 [path[AUS, 1209, PIT, 3707, LHR], 4916]
3 [path[AUS, 809, ATL, 4198, LHR], 5007]
Using this template you can hopefully build a version of your query.
UPDATED after discussion in the comments. In the absence of sack, you can try something like this. It essentially builds a path of distances, substituting a 0 for where the vertices are, adds that list up, and filters on the sum. It's pretty ugly, but it works!
g.V('3').
repeat(outE().as('d').inV().simplePath()).
until(hasId('49').or().loops().is(3)).
hasId('49').
where(path().by(constant(0)).by('dist').unfold().sum().is(lte(6000))).
path().
by('code').
by('dist').
limit(5)
The results are
1 path[AUS, 4901, LHR]
2 path[AUS, 1209, PIT, 3707, LHR]
3 path[AUS, 809, ATL, 4198, LHR]
4 path[AUS, 755, BNA, 4168, LHR]
5 path[AUS, 1690, BOS, 3254, LHR]
Not having support for sack is going to make everything a lot more complex but this at least is a possible workaround.

Finding the indexes of the smallest 3 elements in a subsetted vector

I have the following vector a:
a<-c(100, 84, 126, 336, 544, 0, 2176)
I want to subset a using the following index vector `b:
b<-c(1,2,4,5,7)
In this case, the subset would be:
a[b]=c(100,84,336,544,3276)
From this subset of a I want to take the smallest three numbers. I then want to know what indexes of a these smallest three numbers are.
The smallest 3 numbers in this subset would be:
c(84,100,336)
So the indexes of these numbers in a would be:
result<-c(2,1,4)
How can I get to this final result?
If efficiency is not important:
match(sort(a[b])[1:3], a)
# [1] 2 1 4
A bit faster:
match(sort(a[b], partial = 1:3)[1:3], a)
A bit cleaner:
intersect(order(a), b)[1:3]

How to find edge lists of people who attended the same workshop on the same day in Gremlin?

I'd like to create an edge list that shows connections and connection strength. This sample graph contains 4 people and information about their attendance at workshops A and B, including the day attended and the number of hours they stayed. I'd like to form connections through the workshop node, where I would consider two people to be connected if they attended the same workshop on the same day, and the connection strength would be the minimum number of hours spent at the workshop.
Here is the sample graph:
g.addV('person').property(id, '1').property('name', 'Alice').next()
g.addV('person').property(id, '2').property('name', 'Bob').next()
g.addV('person').property(id, '3').property('name', 'Carol').next()
g.addV('person').property(id, '4').property('name', 'David').next()
g.addV('workshop').property(id, '5').property('name', 'A').next()
g.addV('workshop').property(id, '6').property('name', 'B')
g.V('1').addE('attended').to(g.V('5')).property('hours', 2).property('day', 'Monday').next()
g.V('1').addE('attended').to(g.V('6')).property('hours', 2).property('day', 'Monday').next()
g.V('2').addE('attended').to(g.V('5')).property('hours', 5).property('day', 'Monday').next()
g.V('3').addE('attended').to(g.V('6')).property('hours', 5).property('day', 'Monday').next()
g.V('4').addE('attended').to(g.V('5')).property('hours', 4).property('day', 'Tuesday').next()
g.V('4').addE('attended').to(g.V('6')).property('hours', 4).property('day', 'Monday').next()
g.V('2').addE('attended').to(g.V('6')).property('hours', 1).property('day', 'Monday')
This would be step 1, showing minimum hours on each workshop for each pair that took a workshop on the same day:
Note that David doesn't have any connections through workshop A because he attended it on a different day than Alice and Bob.
We can then find the total strength of the relationship by adding up hours together across workshops for each pair (now Alice and Bob have 3 total hours together, which were across workshops A and B):
I'm struggling with how to write this in a Neptune graph using Gremlin. I'm more familiar with Cypher, and could find this type of edge list using something like this:
match (p:Person)-[a:ATTENDED]->(w:Workshop)<-[a2:ATTENDED]-(other:Person)
where a.day = a2.day
and p.name <> other.name
unwind [a.hours, a2.hours] as hrs
with p, w, other, a, min(hrs) as hrs
return a.name, other.name, sum(hrs) as total_hours
This is as far as I've gotten with Gremlin, but I'm not sure how to finish up the summarization:
g.V().
hasLabel('person').as('p').
outE().as('e').
inV().as('ws').
inE('attended').
where(eq('e')).by('day').as('e2').
otherV().
where(neq('p')).as('other').
select('p','e','other','e2','ws').
by(valueMap('name','hours','day'))
Would anyone be able to help?
Given more time I am fairly sure the query can be simplified. However, given where you have got to so far, we can extract the details for each person:
g.V().
hasLabel('person').as('p').
outE().as('e').
inV().as('ws').
inE('attended').
where(eq('e')).by('day').as('e2').
otherV().
where(neq('p')).as('other').
select('p','e','other','e2','ws').
by(valueMap('name','hours','day').
by(unfold())).
project('p1','p2','shared').
by(select('p').select('name')).
by(select('other').select('name')).
by(union(select('e').select('hours'),
select('e2').select('hours')).min())
This gives us the time each person spent together but not yet the grand total
==>[p1:Alice,p2:Bob,shared:2]
==>[p1:Alice,p2:Carol,shared:2]
==>[p1:Alice,p2:David,shared:2]
==>[p1:Alice,p2:Bob,shared:1]
==>[p1:Bob,p2:Alice,shared:2]
==>[p1:Bob,p2:Alice,shared:1]
==>[p1:Bob,p2:Carol,shared:1]
==>[p1:Bob,p2:David,shared:1]
==>[p1:Carol,p2:Alice,shared:2]
==>[p1:Carol,p2:David,shared:4]
==>[p1:Carol,p2:Bob,shared:1]
==>[p1:David,p2:Alice,shared:2]
==>[p1:David,p2:Carol,shared:4]
==>[p1:David,p2:Bob,shared:1]
All that is left is to produce the final results. One way to do this is to use a group step.
gremlin> g.V().
......1> hasLabel('person').as('p').
......2> outE().as('e').
......3> inV().as('ws').
......4> inE('attended').
......5> where(eq('e')).by('day').as('e2').
......6> otherV().
......7> where(neq('p')).as('other').
......8> select('p','e','other','e2','ws').
......9> by(valueMap('name','hours','day').
.....10> by(unfold())).
.....11> project('p1','p2','shared').
.....12> by(select('p').select('name')).
.....13> by(select('other').select('name')).
.....14> by(union(select('e').select('hours'),
.....15> select('e2').select('hours')).min()).
.....16> group().
.....17> by(union(select('p1'),select('p2')).fold()).
.....18> by(select('shared').sum())
==>[[Bob,Carol]:1,[David,Alice]:2,[Carol,Alice]:2,[Carol,Bob]:1,[Alice,Bob]:3,[Carol,David]:4,[Bob,Alice]:3,
[David,Bob]:1,[Bob,David]:1,[David,Carol]:4,[Alice,Carol]:2,[Alice,David]:2]
Adding an unfold makes the results a little easier to read. I did not try to factor out duplicates, for Bob-Alice and Alice-Bob. If you need to do that in the query an order step could be added after the group is created and then a dedup used.
gremlin> g.V().
......1> hasLabel('person').as('p').
......2> outE().as('e').
......3> inV().as('ws').
......4> inE('attended').
......5> where(eq('e')).by('day').as('e2').
......6> otherV().
......7> where(neq('p')).as('other').
......8> select('p','e','other','e2','ws').
......9> by(valueMap('name','hours','day').
.....10> by(unfold())).
.....11> project('p1','p2','shared').
.....12> by(select('p').select('name')).
.....13> by(select('other').select('name')).
.....14> by(union(select('e').select('hours'),
.....15> select('e2').select('hours')).min()).
.....16> group().
.....17> by(union(select('p1'),select('p2')).fold()).
.....18> by(select('shared').sum()).
.....19> unfold()
==>[Bob, Carol]=1
==>[David, Alice]=2
==>[Carol, Alice]=2
==>[Carol, Bob]=1
==>[Alice, Bob]=3
==>[Carol, David]=4
==>[Bob, Alice]=3
==>[David, Bob]=1
==>[Bob, David]=1
==>[David, Carol]=4
==>[Alice, Carol]=2
==>[Alice, David]=2

Gremlin - finding connected nodes with several boolean conditions on both nodes and edges properties

I want to find nodes who should be linked to a given node, where the link is defined by some logic, which uses the nodes' and existing edges' attribute with the following logic:
A) (The pair has the same zip (node attribute) and name_similarity (edge attribute) > 0.3 OR
B) The pair has a different zip and name_similarity > 0.5 OR
C) The pair has an edge type "external_info" with value = "connect")
D) AND (the pair doesn't have an edge type with "external info" with value = "disconnect")
In short:
(A | B | C) & (~D)
I'm still a newbie to gremlin, so I'm not sure how I can combine several conditions on edges and nodes.
Below is the code for creating the graph, as well as the expected results for that graph:
# creating nodes
(g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate())
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
# creating name similarity edges
g.V(node1).addE('name_similarity').from_(node1).to(node2).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node3).property('score', 0.2).next() # under threshold
g.V(node1).addE('name_similarity').from_(node1).to(node4).property('score', 0.4).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node5).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node6).property('score', 0).next() # under threshold
# creating external output edges
g.V(node1).addE('external_info').from_(node1).to(node5).property('decision', 'connect').next()
g.V(node1).addE('external_info').from_(node1).to(node6).property('decision', 'disconnect').next()
The expected output - for input node A - are nodes B (due to condition A), D (due to Condition B), and F (due to condition C). node E should not be linked due to condition D.
I'm looking for a Gremlin query that will retrieve these results.
Something seemed wrong in your data given the output you expected so I had to make corrections:
Vertex D wouldn't appear in the results because "score" was less than 0.5
"external_info" edges seemed reversed
Here's the data I used:
g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate()
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
g.V(node1).addE('name_similarity').from(node1).to(node2).property('score', 1).next()
g.V(node1).addE('name_similarity').from(node1).to(node3).property('score', 0.2).next()
g.V(node1).addE('name_similarity').from(node1).to(node4).property('score', 0.6).next()
g.V(node1).addE('name_similarity').from(node1).to(node5).property('score', 1).next()
g.V(node1).addE('name_similarity').from(node1).to(node6).property('score', 0).next()
g.V(node1).addE('external_info').from(node1).to(node6).property('decision', 'connect').next()
g.V(node1).addE('external_info').from(node1).to(node5).property('decision', 'disconnect').next()
I went with the following approach:
gremlin> g.V().has('person','name','A').as('a').
......1> V().as('b').
......2> where('a',neq('b')).
......3> or(where('a',eq('b')). // A
......4> by('zip').
......5> bothE('name_similarity').has('score',gt(0.3)).otherV().where(eq('a')),
......6> bothE('name_similarity').has('score',gt(0.5)).otherV().where(eq('a')), // B
......7> bothE('external_info'). // C
......8> has('decision','connect').otherV().where(eq('a'))).
......9> filter(__.not(bothE('external_info'). // D
.....10> has('decision','disconnect').otherV().where(eq('a')))).
.....11> select('a','b').
.....12> by('name')
==>[a:A,b:B]
==>[a:A,b:D]
==>[a:A,b:F]
I think this contains all the logic you were looking for, but I didn't spend a lot of time optimizing it as I don't think any optimization will get around the pain of the full graph scan of V().as('b'), so either your situation involves a relatively small graph (in-memory perhaps) and this query will work or you would need to find another method all together. Perhaps you have methods to further limit "b" which might help? If something along those lines is possible, I'd probably try to better define directionality of edge traversals to avoid bothE() and instead limit to outE() or inE() which would get rid of otherV(). Hopefully you use a graph that allows for vertex centric indices which would speed up those edge lookups on "score" as well (not sure if that would help much on "decision" as it has low selectivity).

R filter produces NAs

I see posts on how the filter() function can deal with NAs, but I have the opposite problem.
I have a fully complete dataset that I process through a long algorithm. One piece of that algorithm is a series of FIR filters. After going through the filters, SOMETIMES my data comes back with NAs at the beginning and end. They are not padding the original dimensions- they "replace" the values that would have otherwise come out of the filter.
I've come up with a couple of hacks to remove the NAs after they are created, but I'm wondering if I can prevent the NAs from showing up in the first place?
set.seed(500)
xoff <- sample(-70:70, 8200, replace=TRUE)
filt <- c(-75, -98, -130, -174, -233, -312, -412, -524, -611, -574, -246, 485, 1503, 2545, 3446, 4174, 4749, 5189, 5502, 5689, 5750, 5689, 5502, 5189, 4749, 4174, 3446, 2545, 1503, 485, -246, -574, -611, -524, -412, -312, -233, -174, -130, -98, -75)
xs = filter(xoff,filt)
40 NAs are returned. 20 at head, 20 at the tail.
sum(is.na(xs)==TRUE)
head(xs, n=21)
tail(xs, n=21)
The length of the original data and the filtered vector are identical.
length(xoff)==length(xs)
Another filter I use sometimes produces 20 NAs, 10 at the head, 10 at the tail. So it is filter dependent.
Things I've thought about:
-Is the length of the data indivisible by the length of the filter? No, the filter is length=41, and the data is length=8200. 8200/41 = 200.
-Unlike a moving average which would need n observations prior to starting the smooth, this FIR filter provides the filter values and doesn't rely on prior observations.
so, any help debugging this is much appreciated! Thanks!

Resources