Best way to count downstream with edge data - graph

I have a NetworkX problem. I create a digraph with a pandas DataFrame and there is data that I set along the edge. I now need to count the # of unique sources for nodes descendants and access the edge attribute.
This is my code and it works for one node but I need to pass a lot of nodes to this and get unique counts.
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes).copy()
domain_sources = {}
for s, t, v in subgraph.edges(data=True):
if v["domain"] in domain_sources:
domain_sources[v["domain"]].append(s)
else:
domain_sources[v["domain"]] = [s]
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(list(set(v)))
It works but, again, for one node the time is not a big deal but I'm feeding this routine at least 40 to 50 nodes. Is this the best way? Is there something else I can do that can group by an edge attribute and uniquely count the nodes?

Two possible enhancements:
Remove copy from line creating the sub graph. You are not changing anything and the copy is redundant.
Create a defaultdict with keys of set. Read more here.
from collections import defaultdict
import networkx as nx
# missing part of df creation
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes)
domain_sources = defaultdict(set)
for s, t, v in subgraph.edges(data=True):
domain_sources[v["domain"]].add(s)
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(set(v))

Related

All path *lengths* from source to target in Directed Acyclic Graph

I have a graph with an adjacency matrix shape (adj_mat.shape = (4000, 4000)). My current problem involves finding the list of path lengths (the sequence of nodes is not so important) that traverses from the source (row = 0 ) to the target (col = trans_mat.shape[0] -1).
I am not interested in finding the path sequences; I am only interested in propagating the path length. As a result, this is different from finding all simple paths - which would be too slow (ie. find all paths from source to target; then score each path). Is there a performant way to do this quickly?
DFS is suggested as one possible strategy (noted here). My current implementation (below) is simply not optimal:
# create graph
G = nx.from_numpy_matrix(adj_mat, create_using=nx.DiGraph())
# initialize nodes
for node in G.nodes:
G.nodes[node]['cprob'] = []
# set starting node value
G.nodes[0]['cprob'] = [0]
def propagate_prob(G, node):
# find incoming edges to node
predecessors = list(G.predecessors(node))
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = G.get_edge_data(prev_node, node)['weight']
# get predecessor node value
if len(G.nodes[prev_node]['cprob']) == 0:
G.nodes[prev_node]['cprob'] = propagate_prob(G, prev_node)
prev_node_arr = G.nodes[prev_node]['cprob']
# add incoming edge weight to prev_node arr
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(prev_node_arr)])
# update current node array
G.nodes[node]['cprob'] = curr_node_arr
return G.nodes[node]['cprob']
# calculate all path lengths from source to sink
part_func = propagate_prob(G, 4000)
I don't have a large example by hand (e.g. >300 nodes), but I found a non recursive solution:
import networkx as nx
g = nx.DiGraph()
nx.add_path(g, range(7))
g.add_edge(0, 3)
g.add_edge(0, 5)
g.add_edge(1, 4)
g.add_edge(3, 6)
# first step retrieve topological sorting
sorted_nodes = nx.algorithms.topological_sort(g)
start = 0
target = 6
path_lengths = {start: [0]}
for node in sorted_nodes:
if node == target:
print(path_lengths[node])
break
if node not in path_lengths or g.out_degree(node) == 0:
continue
new_path_length = path_lengths[node]
new_path_length = [i + 1 for i in new_path_length]
for successor in g.successors(node):
if successor in path_lengths:
path_lengths[successor].extend(new_path_length)
else:
path_lengths[successor] = new_path_length.copy()
if node != target:
del path_lengths[node]
Output: [2, 4, 2, 4, 4, 6]
If you are only interested in the number of paths with different length, e.g. {2:2, 4:3, 6:1} for above example, you could even reduce the lists to dicts.
Background
Some explanation what I'm doing (and I hope works for larger examples as well). First step is to retrieve the topological sorting. Why? Then I know in which "direction" the edges flow and I can simply process the nodes in that order without "missing any edge" or any "backtracking" like in a recursive variant. Afterwards, I initialise the start node with a list containing the current path length ([0]). This list is copied to all successors, while updating the path length (all elements +1). The goal is that in each iteration the path length from the starting node to all processed nodes is calculated and stored in the dict path_lengths. The loop stops after reaching the target-node.
With igraph I can calculate up to 300 nodes in ~ 1 second. I also found that accessing the adjacency matrix itself (rather than calling functions of igraph to retrieve edges/vertices) also saves time. The two key bottlenecks are 1) appending a long list in an efficient manner (while also keeping memory) 2) finding a way to parallelize. This time grows exponentially past ~300 nodes, I would love to see if someone has a faster solution (while also fitting into memory).
import igraph
# create graph from adjacency matrix
G = igraph.Graph.Adjacency((trans_mat_pad > 0).tolist())
# add edge weights
G.es['weight'] = trans_mat_pad[trans_mat_pad.nonzero()]
# initialize nodes
for node in range(trans_mat_pad.shape[0]):
G.vs[node]['cprob'] = []
# set starting node value
G.vs[0]['cprob'] = [0]
def propagate_prob(G, node, trans_mat_pad):
# find incoming edges to node
predecessors = trans_mat_pad[:, node].nonzero()[0] # G.get_adjlist(mode='IN')[node]
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = trans_mat_pad[prev_node, node] # G.es[prev_node]['weight']
# get predecessor node value
if len(G.vs[prev_node]['cprob']) == 0:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + propagate_prob(G, prev_node, trans_mat_pad)])
else:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(G.vs[prev_node]['cprob'])])
## NB: If memory constraint, uncomment below
# set max size
# if len(curr_node_arr) > 100:
# curr_node_arr = np.sort(curr_node_arr)[:100]
# update current node array
G.vs[node]['cprob'] = curr_node_arr
return G.vs[node]['cprob']
# calculate path lengths
path_len = propagate_prob(G, trans_mat_pad.shape[0]-1, trans_mat_pad)

DFS to get all possible solutions?

I have these Circles:
I want to get the list of all possible solution of maximum non-intersecting circles. This is the illustration of the solution I wanted from node A.
Therefore the possible solutions from node A:
1 = [A,B,C], 2 = [A,B,E], 3 = [A,C,B], 4 = [A,E,B] ..etc
I want to store all of the possibilities into a list, which the will be used for weighting and selecting the best result. However, I'm still trying to create the list of all possibilities.
I've tried to code the structure here, however I still confused about backtracking and recursive. Anyone could help here?
# List of circle
# List of circle
list_of_circle = ['A','B','C','D','E']
# List of all possible solutions
result = []
# List of possible nodes
ways = []
for k in list_of_circle:
if len(list_of_circle)==0:
result.append(ways)
else:
ways.append[k]
list_of_circle.remove(k)
for j in list_of_circle:
if k.intersects(j):
list_of_circle.remove(j)
return result
Here is a possible solution (pseudocode).
def get_max_non_intersect(selected_circles, current_circle_idx, all_circles):
if current_circle_idx == len(all_circles): # final case
return selected_circles
# we recursively get the biggest selection of circles if the current circle is not selected
list_without_current_circle = get_max_non_intersect(selected_circles, current_circle_idx + 1, all_circles)
# now we check if we can add the current circle to the ones selected
current_intersects_selected = false
current_circle = all_circles[current_circle_idx]
for selected_circle in selected_circles:
if intersects(current_circle, selected_circle):
current_intersects_selected = true
break
if current_intersects_selected is true: # we cannot add the current circle
return list_without_current_circle
else: # we can add the current circle
list_with_current_circle = get_max_non_intersect(selected_circles + [current_circle], current_circle_idx + 1, all_circles)
return list_with_current_circle + list_without_current_circle

Find neighbors of a node

How to find neighbors of a node from non-directed graph using scala-programming?
I created graph with node list and edge list, now I need to find neighbor list of each node. Is there any library or any idea how to find neighbor list?
val graphx = Graph(nodes,routes)
val label = sc.textFile("label.csv")
val getgdata2 = label.map(line=>line.split(","))
val node11 = getgdata2.map(line=>((line(0)))).distinct
val verticesWithSuccessors: VertexRDD[Array[VertexId]] =
graphx.ops.collectNeighborIds(EdgeDirection.Out)
val successorGraph = Graph(verticesWithSuccessors, routes)
val res = successorGraph.vertices.collect()
res.take(5)
--------------------------
Output shows :
(384,[J#38d17d80)
(454,[J#6ede46f6)
(1084,[J#66273da0)
(1410,[J#2127e66e)
(772,[J#1229a2b7)
Answer should be:
(384 - 1084, 984,2013)
(454 - 924)
(1084 - 2302,354)
I need to see adjacency list for each node. Anyone can help me?

Dictionary key from pdb file

I'm trying to go through a .pdb file, calculate distance between alpha carbon atoms from different residues on chains A and B of a protein complex, then store the distance in a dictionary, together with the chain identifier and residue number.
For example, if the first alpha carbon ("CA") is found on residue 100 on chain A and the one it binds to is on residue 123 on chain B I would want my dictionary to look something like d={(A, 100):[B, 123, distance_between_atoms]}
from Bio.PDB.PDBParser import PDBParser
parser=PDBParser()
struct = parser.get_structure("1trk", "1trk.pdb")
def getAlphaCarbons(chain):
vec = []
for residue in chain:
for atom in residue:
if atom.get_name() == "CA":
vec = vec + [atom.get_vector()]
return vec
def dist(a,b):
return (a-b).norm()
chainA = struct[0]['A']
chainB = struct[0]['B']
vecA = getAlphaCarbons(chainA)
vecB = getAlphaCarbons(chainB)
t={}
model=struct[0]
for model in struct:
for chain in model:
for residue in chain:
for a in vecA:
for b in vecB:
if dist(a,b)<=8:
t={(chain,residue):[(a, b, dist(a, b))]}
break
print t
It's been running the programme for ages and I had to abort the run (have I made an infinite loop somewhere??)
I was also trying to do this:
t = {i:[((a, b, dist(a,b)) for a in vecA) for b in vecB if dist(a, b) <= 8] for i in chainA}
print t
But it's printing info about residues in the following format:
<Residue PHE het= resseq=591 icode= >: []
It's not printing anything related to the distance.
Thanks a lot, I hope everything is clear.
Would strongly suggest using C libraries while calculating distances. I use mdtraj for this sort of thing and it works much quicker than all the for loops in BioPython.
To get all pairs of alpha-Carbons:
import mdtraj as md
def get_CA_pairs(self,pdbfile):
traj = md.load_pdb(pdbfile)
topology = traj.topology
CA_index = ([atom.index for atom in topology.atoms if (atom.name == 'CA')])
pairs=list(itertools.combinations(CA_index,2))
return pairs
Then, for quick computation of distances:
def get_distances(self,pdbfile,pairs):
#returns list of resid1, resid2,distances between CA-CA
traj = md.load_pdb(pdbfile)
pairs=self.get_CA_pairs(pdbfile)
dist=md.compute_distances(traj,pairs)
#make dictionary you desire.
dict=dict(zip(CA, pairs))
return dict
This includes all alpha-Carbons. There should be a chain identifier too in mdtraj to select CA's from each chain.

networkx, undirected graph: for one source node, find directly connected neighbors from target nodes list

In an undirected graph, for a given source node ('sa' in code / pic below) and a list of target nodes (tlist=['ta','tb','tc','td','te','tf']) I am trying to find the subset of directly connected target nodes, i.e. if they are connected via another target node, they are not going into the subset.
So for undirected graph G:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_path(['te','og','oe','sa','oa','ta','tb'])
#G = nx.Graph()
G.add_path(['tf','oe'])
G.add_path(['sa','of','td','od'])
G.add_path(['sa','ob','tc','oc','td'])
val_map = {'sa': 1.0,
'ta': 0.5714285714285714,
'tb': 0.5714285714285714,
'tc': 0.5714285714285714,
'td': 0.5714285714285714,
'te': 0.5714285714285714,
'tf': 0.5714285714285714
}
values = [val_map.get(node, 0.25) for node in G.nodes()]
nx.draw(G, cmap=plt.get_cmap('jet'), node_color=values,with_labels=True)
plt.show()
the resulting subset of target nodes should be ['ta','tc','td','te','tf']
Thanks in advance!
ok sorry for my bad programming style, but this is only a draft. Anyhow it seems to work, please test on other undirected graphs and improve if necessary:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_path(['te','og','oe','sa','oa','ta','tb'])
G.add_path(['tf','oe'])
G.add_path(['sa','of','td','od'])
G.add_path(['sa','ob','tc','oc','td'])
tlist=['ta','tb','tc','td','te','tf']
def deduplicate_list(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
def nearest_connected_neighbors(graph,sourcenode,targetnodes):
templist=[]
endendlist=[]
searchlist=[]
tlist=targetnodes
G=graph
nlist=G.neighbors(sourcenode)
donelist=[sourcenode]
while len(nlist)>0:
for n in nlist:
donelist.append(n)
if n in tlist:
endendlist.append(n)
endendlist=deduplicate_list(endendlist)
searchlist = list(set(nlist) - set(endendlist))
for n in searchlist:
templist.extend(G.neighbors(n))
templist=deduplicate_list(templist)
nlist=[]
nlist=list(set(templist) - set(donelist))
return endendlist
print nearest_connected_neighbors(G,'sa',tlist)

Resources