How can I find groups nodes sharing common traits in a graph - graph

Lets say I have a graph that relates food items to traits such as sour, sweet, spicy, tangy, ...
How can I query the graph to give me a set of food items matching each possible combination of traits.
i.e.
all foods that are sweet and spicy
all foods that are sweet and sour
all foods that are sweet, sour, and spicy
The graph tuples would look as follows:
F1 > Spicy
F1 > Sweet
F2 > Sour
F2 > Sweet
F3 > Sour
...
The query should output sets of food matching each possible combination of traits.
Spicy => F1, F2, F3, F4, F5
Spicy & Sweet => F1, F3, F5
Spicy & Sweet & Sour => F3
Spicy & Sweet & Sour # Tangy => F3
Spicy & Sour => ...
Spicy & Sour & Tangy => ...
Spicy & Tangy => ...

1) Assume the following inputs:
UNWIND [ {name: 'F1', traits: ['Spicy', 'Sweet' ]},
{name: 'F2', traits: ['Sour' , 'Sweet' ]},
{name: 'F3', traits: ['Tangy', 'Sour', 'Spicy' ]},
{name: 'F4', traits: ['Tangy', 'Sour', 'Spice', 'Tart']} ] AS food
MERGE (F:Food {name: food.name}) WITH F, food
UNWIND food.traits as trait
MERGE (T:Trait {name: trait})
MERGE (F)-[:hasTrait]->(T)
RETURN F, T
2) Now we need to get all combinations of traits. For this we need apoc library:
MATCH (T:Trait)
WITH collect(T) as traits
// Here we count the number of combinations of traits as a power of two
WITH traits, toInt(round(exp( log(2) * size(traits) )))-1 as combCount
// Go through all the combinations
UNWIND RANGE(1, combCount) as combIndex
UNWIND RANGE(0, size(traits)-1 ) as p
// Check whether the trait is present in the combination
CALL apoc.bitwise.op( toInt(round( exp(log(2) * p) )),'&',combIndex) YIELD value
WITH combIndex, collect(CASE WHEN value > 0 THEN traits[p] END) as comb
// Return all combinations of traits
RETURN comb ORDER BY size(comb)
3) Now, for each combination we need to find the intersection for food:
MATCH (T:Trait)
WITH collect(T) as traits
// Here we count the number of combinations of traits as a power of two
WITH traits, toInt(round(exp( log(2) * size(traits) )))-1 as combCount
// Go through all the combinations
UNWIND RANGE(1, combCount) as combIndex
UNWIND RANGE(0, size(traits)-1 ) as p
// Check whether the trait is present in the combination
CALL apoc.bitwise.op( toInt(round( exp(log(2) * p) )),'&',combIndex) YIELD value
WITH combIndex, collect(CASE WHEN value > 0 THEN traits[p] END) as comb
// Take foods for the first trait:
WITH comb, head(comb) as ft
OPTIONAL MATCH (ft)<-[:hasTrait]-(F:Food)
// We find the intersection of each food with other traits
WITH comb, collect(F) as testFoods
UNWIND testFoods as food
UNWIND comb as trait
OPTIONAL MATCH p = (food)-[:hasTrait]->(trait)
WITH comb, food, trait, size(collect(p)) as pairs
// Check that the number of crossings for food with traits
// for each combination of the same number of traits
WITH comb, food, collect(CASE WHEN pairs > 0 THEN trait END) as pairs
WITH comb, collect(CASE WHEN size(pairs)=size(comb) THEN food END) as pairs
// Return combinations where there is a common food
WITH comb, pairs WHERE size(pairs)>0
RETURN comb, pairs ORDER BY size(comb)

Keep in mind that the format of neo4j query output is designed for rows with columns, not your desired output format, so this makes things a little tricky.
I would highly recommend just outputting your food items on each row, with boolean columns for membership in each distinct simple trait, then in your application code, insert the food objects into sets for each trait. Then using application logic you can calculate all possible combinations of traits you need, and perform set intersection to generate them.
This would make the neo4j query very easy:
MATCH (f:Food)
WITH f
RETURN f.name, EXISTS((f)-[:IS]->(:Trait{name:'tangy'})) AS tangy,
EXISTS((f)-[:IS]->(:Trait{name:'sweet'})) AS sweet,
EXISTS((f)-[:IS]->(:Trait{name:'sour'})) AS sour,
EXISTS((f)-[:IS]->(:Trait{name:'spicy'})) AS spicy
That said, if you're determined to do the entire thing with a neo4j query, it's going to be messy, since you'll need to track and generate all the combinations you need yourself. For intersection operations, you'll want to install the APOC procedures library.
Seems to me that the best start is to create sets of food nodes according to each individual trait.
MATCH (f:Food)-[:IS]->(:Trait{name:'spicy'})
WITH COLLECT(f) AS spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'sour'})
WITH COLLECT(f) AS sourFood, spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'sweet'})
WITH COLLECT(f) AS sweetFood, sourFood, spicyFood
MATCH (f:Food)-[:IS]->(:Trait{name:'tangy'})
WITH COLLECT(f) AS tangyFood, sweetFood, sourFood, spicyFood
Now that you have these, you can do your intersections with every combination you're interested in.
CALL apoc.coll.intersection(tangyFood, sweetFood) YIELD value AS tangySweetFood
CALL apoc.coll.intersection(tangyFood, sourFood) YIELD value AS tangySourFood
CALL apoc.coll.intersection(tangyFood, spicyFood) YIELD value AS tangySpicyFood
CALL apoc.coll.intersection(tangySweetFood, sourFood) YIELD value AS tangySweetSourFood
CALL apoc.coll.intersection(tangySweetFood, spicyFood) YIELD value AS tangySweetSpicyFood
CALL apoc.coll.intersection(tangySourFood, spicyFood) YIELD value AS tangySourSpicyFood
CALL apoc.coll.intersection(tangySweetSourFood, spicyFood) YIELD value AS tangySweetSourSpicyFood
CALL apoc.coll.intersection(sweetFood, sourFood) YIELD value AS sweetSourFood
CALL apoc.coll.intersection(sweetFood, spicyFood) YIELD value AS sweetSpicyFood
CALL apoc.coll.intersection(sweetSourFood, spicyFood) YIELD value AS sweetSourSpicyFood
CALL apoc.coll.intersection(sourFood, spicyFood) YIELD value AS sourSpicyFood
RETURN tangyFood, sweetFood, sourFood, spicyFood,
tangySweetFood, tangySourFood, tangySpicyFood,
tangySweetSourFood, tangySweetSpicyFood, tangySourSpicyFood,
tangySweetSourSpicyFood,
sweetSourFood, sweetSpicyFood,
sweetSourSpicyFood,
sourSpicyFood

Related

How is the number of random walks determined in GDS/Neo4j?

I am running the random walk algorithm on my Neo4j graph named 'example', with the minimum allowed walk length (2) and walks per node (1). Namely,
CALL gds.beta.randomWalk.stream(
'example',
{
walkLength: 2,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS event_name
And I get 41 walks. How is this number determined? I checked the graph and it contains 161 nodes and 574 edges. Any insights?
Added later: Here is more info on the projected graph that I am constructing. Basically, I am filtering on nodes and relationships and just projecting the subgraph and doing nothing else. Here is the code -
// Filter for only IDH Codel recurrent events
WITH [path=(m:IDHcodel)--(n:Tissue)
WHERE (m.node_category = 'molecular' AND n.event_class = 'Recurrence')
AND NOT EXISTS((m)--(:Tissue{event_class:'Primary'})) | m] AS recur_events
// Obtain the sub-network with 2 or more patients in edges
MATCH p=(m1)-[r:hasIDHcodelPatients]-(m2)
WHERE (m1 IN recur_events AND m2 IN recur_events AND r.total_common_patients >= 2)
WITH COLLECT(p) AS all_paths
WITH [p IN all_paths | nodes(p)] AS path_nodes, [p IN all_paths | relationships(p)] AS path_rels
WITH apoc.coll.toSet(apoc.coll.flatten(path_nodes)) AS subgraph_nodes, apoc.coll.flatten(path_rels) AS subgraph_rels
// Form the GDS Cypher projection
CALL gds.graph.create.cypher(
'example',
'MATCH (n) where n in $sn RETURN id(n) as id',
'MATCH ()-[r]-() where r in $sr RETURN id(startNode(r)) as source , id(endNode(r)) as target, { LINKS: { orientation: "UNDIRECTED" } }',
{parameters: {sn: subgraph_nodes, sr: subgraph_rels} }
)
YIELD graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels
RETURN graph, nodes, rels
Thanks.
It seems that the documentation is missing the description for the sourceNodes parameter, which would tell you how many walks will be created.
We don't know the default value, but we can use the parameter to set the source nodes that the walk should start from.
For example, you could use all the nodes in the graph to be treated as a source node (the random walk will start from them).
MATCH (n)
WITH collect(n) AS nodes
CALL gds.beta.randomWalk.stream(
'example',
{ sourceNodes:nodes,
walkLength: 2,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS event_name
This way you should get 161 walks as there are 161 nodes in your graph and the walksPerNode is set to 1, so a single random walk will start from every node in the graph. In essence, the number of source nodes times the walks per node will determine the number of random walks.

Is there a less verbose way to unwrap maybe values in Elm

I've been running into a frequent issue in elm where I have a function that depends on multiple maybe values being Just. Is there a less verbose way to write this code:
commandIf apples bananas oranges =
case apples of
Just apples_ ->
case bananas of
Just bananas_ ->
case oranges of
Just oranges_ ->
someCommand apples_ bananas_ oranges_
Nothing ->
Cmd.none
Nothing ->
Cmd.none
Nothing ->
Cmd.none
If you need all three values at the same time you can match them together as a tuple and leave all other combinations (when one of them or several are Nothing) to the fallback case:
commandIf apples bananas oranges =
case (apples, bananas, oranges) of
(Just apples_, Just bananas_, Just oranges_) ->
someCommand apples_ bananas_ oranges_
_ ->
Cmd.none
#laughedelic's answer is very good. Just wanted to offer some alternative and more generic solutions too, since verbose Maybe unwrapping is an issue I also ran into when I started out in Elm.
If you have a fixed number of Maybe values, you can use map2, map3 etc to do what you want (docs here):
commandIf apples bananas oranges =
Maybe.map3 someCommand apples bananas oranges
|> Maybe.withDefault Cmd.none
Here, someCommand is your functions that takes 3 arguments, and returns a some command.
Maybe.map3 applies this function only if all 3 variables are Just x, and wraps it in one Maybe type. So the result is Just (someCommand apples bananas oranges) if all 3 have a value. Otherwise, the function returns Nothing.
This result is then "piped" into Maybe.withDefault. Which returns a Cmd.none if the input is Nothing, and otherwise returns the value (your command), without the Just.
If you would have a list of Maybe values of unknown length, you could do something this:
keepOnlyJusts : List (Maybe a) -> List a
keepOnlyJusts listOfMaybes =
listOfMaybes
|> List.filterMap identity
newList = keepOnlyJusts [ Just 1, Nothing, Just 3 ] -- == [1,3]
where the result is a list (could be empty) where only the values are kept.
Maybe.map3 solves your particular case, but this answer is about the general pattern of chaining maybe values using Maybe.andThen.
commandIf a_ b_ c_ =
a_ |> Maybe.andThen (\a ->
b_ |> Maybe.andThen (\b ->
c_ |> Maybe.andThen (Just << someCommand a b)))
|> Maybe.withDefault Cmd.none

Dictionary key from pdb file

I'm trying to go through a .pdb file, calculate distance between alpha carbon atoms from different residues on chains A and B of a protein complex, then store the distance in a dictionary, together with the chain identifier and residue number.
For example, if the first alpha carbon ("CA") is found on residue 100 on chain A and the one it binds to is on residue 123 on chain B I would want my dictionary to look something like d={(A, 100):[B, 123, distance_between_atoms]}
from Bio.PDB.PDBParser import PDBParser
parser=PDBParser()
struct = parser.get_structure("1trk", "1trk.pdb")
def getAlphaCarbons(chain):
vec = []
for residue in chain:
for atom in residue:
if atom.get_name() == "CA":
vec = vec + [atom.get_vector()]
return vec
def dist(a,b):
return (a-b).norm()
chainA = struct[0]['A']
chainB = struct[0]['B']
vecA = getAlphaCarbons(chainA)
vecB = getAlphaCarbons(chainB)
t={}
model=struct[0]
for model in struct:
for chain in model:
for residue in chain:
for a in vecA:
for b in vecB:
if dist(a,b)<=8:
t={(chain,residue):[(a, b, dist(a, b))]}
break
print t
It's been running the programme for ages and I had to abort the run (have I made an infinite loop somewhere??)
I was also trying to do this:
t = {i:[((a, b, dist(a,b)) for a in vecA) for b in vecB if dist(a, b) <= 8] for i in chainA}
print t
But it's printing info about residues in the following format:
<Residue PHE het= resseq=591 icode= >: []
It's not printing anything related to the distance.
Thanks a lot, I hope everything is clear.
Would strongly suggest using C libraries while calculating distances. I use mdtraj for this sort of thing and it works much quicker than all the for loops in BioPython.
To get all pairs of alpha-Carbons:
import mdtraj as md
def get_CA_pairs(self,pdbfile):
traj = md.load_pdb(pdbfile)
topology = traj.topology
CA_index = ([atom.index for atom in topology.atoms if (atom.name == 'CA')])
pairs=list(itertools.combinations(CA_index,2))
return pairs
Then, for quick computation of distances:
def get_distances(self,pdbfile,pairs):
#returns list of resid1, resid2,distances between CA-CA
traj = md.load_pdb(pdbfile)
pairs=self.get_CA_pairs(pdbfile)
dist=md.compute_distances(traj,pairs)
#make dictionary you desire.
dict=dict(zip(CA, pairs))
return dict
This includes all alpha-Carbons. There should be a chain identifier too in mdtraj to select CA's from each chain.

Iterate over a list in a Match query

I have a relation that has a list of ids s_ids as a property of the relation. each id in the list correspond to another node that has a sentence corresponding to an id.I used:
MATCH (c: term)-[r: semrel]->(t: term), (b: Sentence)
Where r.source = "xyz" And b.sentence_id IN r.s_id
return r,b
to return all sentences corresponding to the relation,
the result looks like :
r b
w abc
w rty
w zxv
e nmx
e qrt
the relation r is repeated for every sentence how can I group the list of sentences corresponding to each relation to get
r b
w abc, rty, zxv
e nmx,qrt
Thanks
This should return each r and its collection of sentences:
MATCH (c: term)-[r: semrel]->(t: term), (b: Sentence)
WHERE r.source = "xyz" AND b.sentence_id IN r.s_i
RETURN r, COLLECT(b) AS sentences;
For better performance, if you create an index on :Sentence(sentence_id), like this:
CREATE INDEX ON :Sentence(sentence_id);
then this query (which adds a hint to use the index) should be faster (as the b nodes can be found using the index):
MATCH (c: term)-[r: semrel]->(t: term), (b: Sentence)
USING INDEX b:Sentence(sentence_id)
WHERE r.source = "xyz" AND b.sentence_id IN r.s_i
RETURN r, COLLECT(b) AS sentences;

What does the lambda calculus have to say about return values?

It is by now a well known theorem of the lambda calculus that any function taking two or more arguments can be written through currying as a chain of functions taking one argument:
# Pseudo-code for currying
f(x,y) -> f_curried(x)(y)
This has proven to be extremely powerful not just in studying the behavior of functions but in practical use (Haskell, etc.).
Functions returning values, however, seem to not be discussed. Programmers typically deal with their inability to return more than one value from a function by returning some meta-object (lists in R, structures in C++, etc.). It has always struck me as a bit of a kludge, but a useful one.
For instance:
# R code for "faking" multiple return values
uselessFunc <- function(dat) {
model1 <- lm( y ~ x , data=dat )
return( list( coef=coef(model1), form=formula(model1) ) )
}
Questions
Does the lambda calculus have anything to say about a multiplicity of return values? If so, do any surprising conclusions result?
Similarly, do any languages allow true multiple return values?
According to the Wikipedia page on lambda calculus:
Lambda calculus, also written as λ-calculus, is a formal system for function
definition, function application and recursion
And a function, in the mathematical sense:
Associates one quantity, the argument of the function, also known as the input,
with another quantity, the value of the function, also known as the output
So answering your first question no, lambda calculus (or any other formalism based on mathematical functions) can not have multiple return values.
For your second question, as far as I know, programming languages that implement multiple return values do so by packing multiple results in some kind of data structure (be it a tuple, an array, or even the stack) and then unpacking it later - and that's where the differences lie, as some programming languages make the packing/unpacking part transparent for the programmer (for instance Python uses tuples under the hood) while other languages make the programmer do the job explicitly, for example Java programmers can simulate multiple return values to some extent by packing multiple results in a returned Object array and then extracting and casting the returned result by hand.
A function returns a single value. This is how functions are defined in mathematics. You can return multiple values by packing them into one compound value. But then it is still a single value. I'd call it a vector, because it has components. There are vector functions in mathematics there, so there are also in programming languages. The only difference is the support level from the language itself and does it facilitate it or not.
Nothing prevents you from having multiple functions, each one returning one of the multiple results that you would like to return.
For example, say, you had the following function in python returning a list.
def f(x):
L = []
for i in range(x):
L.append(x * i)
return L
It returns [0, 3, 6] for x=3 and [0, 5, 10, 15, 20] for x=5. Instead, you can totally have
def f_nth_value(x, n):
L = []
for i in range(x):
L.append(x * i)
if n < len(L):
return L[n]
return None
Then you can request any of the outputs for a given input, and get it, or get None, if there aren't enough outputs:
In [11]: f_nth_value(3, 0)
Out[11]: 0
In [12]: f_nth_value(3, 1)
Out[12]: 3
In [13]: f_nth_value(3, 2)
Out[13]: 6
In [14]: f_nth_value(3, 3)
In [15]: f_nth_value(5, 2)
Out[15]: 10
In [16]: f_nth_value(5, 5)
Computational resources may be wasted if you have to do some of the same work, as in this case. Theoretically, it can be avoided by returning another function that holds all the results inside itself.
def f_return_function(x):
L = []
for i in range(x):
L.append(x * i)
holder = lambda n: L[n] if n < len(L) else None
return holder
So now we have
In [26]: result = f_return_function(5)
In [27]: result(3)
Out[27]: 15
In [28]: result(4)
Out[28]: 20
In [29]: result(5)
Traditional untyped lambda calculus is perfectly capable of expressing this idea. (After all, it is Turing complete.) Whenever you want to return a bunch of values, just return a function that can give the n-th value for any n.
In regard to the second question, python allows for such a syntax, if you know exactly, just how many values the function is going to return.
def f(x):
L = []
for i in range(x):
L.append(x * i)
return L
In [39]: a, b, c = f(3)
In [40]: a
Out[40]: 0
In [41]: b
Out[41]: 3
In [42]: c
Out[42]: 6
In [43]: a, b, c = f(2)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-5480fa44be36> in <module>()
----> 1 a, b, c = f(2)
ValueError: need more than 2 values to unpack
In [44]: a, b, c = f(4)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-44-d2c7a6593838> in <module>()
----> 1 a, b, c = f(4)
ValueError: too many values to unpack
Lastly, here is an example from this Lisp tutorial:
;; in this function, the return result of (+ x x) is not assigned so it is essentially
;; lost; the function body moves on to the next form, (* x x), which is the last form
;; of this function body. So the function call only returns (* 10 10) => 100
* ((lambda (x) (+ x x) (* x x)) 10)
=> 100
;; in this function, we capture the return values of both (+ x x) and (* x x), as the
;; lexical variables SUM and PRODUCT; using VALUES, we can return multiple values from
;; a form instead of just one
* ((lambda (x) (let ((sum (+ x x)) (product (* x x))) (values sum product))) 10)
=> 20 100
I write this as a late response to the accepted answer since it is wrong!
Lambda Calculus does have multiple return values, but it takes a bit to understand what returning multiple values mean.
Lambda Calculus has no inherent definition of a collection of stuff, but it does allow you to invent it using products and church numerals.
pure functional JavaScript will be used for this example.
let's define a product as follows:
const product = a => b => callback => callback(a)(b);
then we can define church_0, and church_1 aka true, false, aka left, right, aka car, cdr, aka first, rest as follows:
const church_0 = a => b => a;
const church_1 = a => b => b;
let's start with making a function that returns two values, 20, and "Hello".
const product = a => b => callback => callback(a)(b);
const church_0 = a => b => a;
const church_1 = a => b => b;
const returns_many = () => product(20)("Hello");
const at_index_zero = returns_many()(church_0);
const at_index_one = returns_many()(church_1);
console.log(at_index_zero);
console.log(at_index_one);
As expected, we got 20 and "Hello".
To return more than 2 values, it gets a bit tricky:
const product = a => b => callback => callback(a)(b);
const church_0 = a => b => a;
const church_1 = a => b => b;
const returns_many = () => product(20)(
product("Hello")(
product("Yes")("No")
)
);
const at_index_zero = returns_many()(church_0);
const at_index_one = returns_many()(church_1)(church_0);
const at_index_two = returns_many()(church_1)(church_1)(church_0);
console.log(at_index_zero);
console.log(at_index_one);
console.log(at_index_two);
As you can see, a function can return an arbitrary number of return values, but to access these values, a you cannot simply use result()[0], result()[1], or result()[2], but you must use functions that filter out the position you want.
This is mindblowingly similar to electrical circuits, in that circuits have no "0", "1", "2", "3", but they do have means to make decisions, and by abstracting away our circuitry with byte(reverse list of 8 inputs), word(reverse list of 16 inputs), in this language, 0 as a byte would be [0, 0, 0, 0, 0, 0, 0, 0] which is equivalent to:
const Byte = a => b => c => d => e => f => g => h => callback =>
callback(a)(b)(c)(d)(e)(f)(g)(h);
const Byte_one = Byte(0)(0)(0)(0)(0)(0)(0)(1); // preserves
const Bit_zero = Byte_one(b7 => b6 => b5 => b4 => b3 => b2 => b1 => b0 => b0);
After inventing a number, we can make an algorithm to, given a byte-indexed array, and a byte representing index we want from this array, it will take care of the boilerplate.
Anyway, what we call arrays is nothing more than the following, expressed in higher level to show the point:
// represent nested list of bits(addresses)
// to nested list of bits(bytes) interpreted as strings.
const MyArray = function(index) {
return (index == 0)
? "0th"
: (index == 1)
? "first"
: "second"
;
};
except it doesnt do 2^32 - 1 if statements, it only does 8 and recursively narrows down the specific element you want. Essentially it acts exactly like a multiplexor(except the "single" signal is actually a fixed number of bits(coproducts, choices) needed to uniquely address elements).
My point is that is Arrays, Maps, Associative Arrays, Lists, Bits, Bytes, Words, are all fundamentally functions, both at circuit level(where we can represent complex universes with nothing but wires and switches), and mathematical level(where everything is ultimately products(sequences, difficult to manage without requiring nesting, eg lists), coproducts(types, sets), and exponentials(free functors(lambdas), forgetful functors)).

Resources