All path *lengths* from source to target in Directed Acyclic Graph - graph

I have a graph with an adjacency matrix shape (adj_mat.shape = (4000, 4000)). My current problem involves finding the list of path lengths (the sequence of nodes is not so important) that traverses from the source (row = 0 ) to the target (col = trans_mat.shape[0] -1).
I am not interested in finding the path sequences; I am only interested in propagating the path length. As a result, this is different from finding all simple paths - which would be too slow (ie. find all paths from source to target; then score each path). Is there a performant way to do this quickly?
DFS is suggested as one possible strategy (noted here). My current implementation (below) is simply not optimal:
# create graph
G = nx.from_numpy_matrix(adj_mat, create_using=nx.DiGraph())
# initialize nodes
for node in G.nodes:
G.nodes[node]['cprob'] = []
# set starting node value
G.nodes[0]['cprob'] = [0]
def propagate_prob(G, node):
# find incoming edges to node
predecessors = list(G.predecessors(node))
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = G.get_edge_data(prev_node, node)['weight']
# get predecessor node value
if len(G.nodes[prev_node]['cprob']) == 0:
G.nodes[prev_node]['cprob'] = propagate_prob(G, prev_node)
prev_node_arr = G.nodes[prev_node]['cprob']
# add incoming edge weight to prev_node arr
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(prev_node_arr)])
# update current node array
G.nodes[node]['cprob'] = curr_node_arr
return G.nodes[node]['cprob']
# calculate all path lengths from source to sink
part_func = propagate_prob(G, 4000)

I don't have a large example by hand (e.g. >300 nodes), but I found a non recursive solution:
import networkx as nx
g = nx.DiGraph()
nx.add_path(g, range(7))
g.add_edge(0, 3)
g.add_edge(0, 5)
g.add_edge(1, 4)
g.add_edge(3, 6)
# first step retrieve topological sorting
sorted_nodes = nx.algorithms.topological_sort(g)
start = 0
target = 6
path_lengths = {start: [0]}
for node in sorted_nodes:
if node == target:
print(path_lengths[node])
break
if node not in path_lengths or g.out_degree(node) == 0:
continue
new_path_length = path_lengths[node]
new_path_length = [i + 1 for i in new_path_length]
for successor in g.successors(node):
if successor in path_lengths:
path_lengths[successor].extend(new_path_length)
else:
path_lengths[successor] = new_path_length.copy()
if node != target:
del path_lengths[node]
Output: [2, 4, 2, 4, 4, 6]
If you are only interested in the number of paths with different length, e.g. {2:2, 4:3, 6:1} for above example, you could even reduce the lists to dicts.
Background
Some explanation what I'm doing (and I hope works for larger examples as well). First step is to retrieve the topological sorting. Why? Then I know in which "direction" the edges flow and I can simply process the nodes in that order without "missing any edge" or any "backtracking" like in a recursive variant. Afterwards, I initialise the start node with a list containing the current path length ([0]). This list is copied to all successors, while updating the path length (all elements +1). The goal is that in each iteration the path length from the starting node to all processed nodes is calculated and stored in the dict path_lengths. The loop stops after reaching the target-node.

With igraph I can calculate up to 300 nodes in ~ 1 second. I also found that accessing the adjacency matrix itself (rather than calling functions of igraph to retrieve edges/vertices) also saves time. The two key bottlenecks are 1) appending a long list in an efficient manner (while also keeping memory) 2) finding a way to parallelize. This time grows exponentially past ~300 nodes, I would love to see if someone has a faster solution (while also fitting into memory).
import igraph
# create graph from adjacency matrix
G = igraph.Graph.Adjacency((trans_mat_pad > 0).tolist())
# add edge weights
G.es['weight'] = trans_mat_pad[trans_mat_pad.nonzero()]
# initialize nodes
for node in range(trans_mat_pad.shape[0]):
G.vs[node]['cprob'] = []
# set starting node value
G.vs[0]['cprob'] = [0]
def propagate_prob(G, node, trans_mat_pad):
# find incoming edges to node
predecessors = trans_mat_pad[:, node].nonzero()[0] # G.get_adjlist(mode='IN')[node]
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = trans_mat_pad[prev_node, node] # G.es[prev_node]['weight']
# get predecessor node value
if len(G.vs[prev_node]['cprob']) == 0:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + propagate_prob(G, prev_node, trans_mat_pad)])
else:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(G.vs[prev_node]['cprob'])])
## NB: If memory constraint, uncomment below
# set max size
# if len(curr_node_arr) > 100:
# curr_node_arr = np.sort(curr_node_arr)[:100]
# update current node array
G.vs[node]['cprob'] = curr_node_arr
return G.vs[node]['cprob']
# calculate path lengths
path_len = propagate_prob(G, trans_mat_pad.shape[0]-1, trans_mat_pad)

Related

DFS to get all possible solutions?

I have these Circles:
I want to get the list of all possible solution of maximum non-intersecting circles. This is the illustration of the solution I wanted from node A.
Therefore the possible solutions from node A:
1 = [A,B,C], 2 = [A,B,E], 3 = [A,C,B], 4 = [A,E,B] ..etc
I want to store all of the possibilities into a list, which the will be used for weighting and selecting the best result. However, I'm still trying to create the list of all possibilities.
I've tried to code the structure here, however I still confused about backtracking and recursive. Anyone could help here?
# List of circle
# List of circle
list_of_circle = ['A','B','C','D','E']
# List of all possible solutions
result = []
# List of possible nodes
ways = []
for k in list_of_circle:
if len(list_of_circle)==0:
result.append(ways)
else:
ways.append[k]
list_of_circle.remove(k)
for j in list_of_circle:
if k.intersects(j):
list_of_circle.remove(j)
return result
Here is a possible solution (pseudocode).
def get_max_non_intersect(selected_circles, current_circle_idx, all_circles):
if current_circle_idx == len(all_circles): # final case
return selected_circles
# we recursively get the biggest selection of circles if the current circle is not selected
list_without_current_circle = get_max_non_intersect(selected_circles, current_circle_idx + 1, all_circles)
# now we check if we can add the current circle to the ones selected
current_intersects_selected = false
current_circle = all_circles[current_circle_idx]
for selected_circle in selected_circles:
if intersects(current_circle, selected_circle):
current_intersects_selected = true
break
if current_intersects_selected is true: # we cannot add the current circle
return list_without_current_circle
else: # we can add the current circle
list_with_current_circle = get_max_non_intersect(selected_circles + [current_circle], current_circle_idx + 1, all_circles)
return list_with_current_circle + list_without_current_circle

Best way to count downstream with edge data

I have a NetworkX problem. I create a digraph with a pandas DataFrame and there is data that I set along the edge. I now need to count the # of unique sources for nodes descendants and access the edge attribute.
This is my code and it works for one node but I need to pass a lot of nodes to this and get unique counts.
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes).copy()
domain_sources = {}
for s, t, v in subgraph.edges(data=True):
if v["domain"] in domain_sources:
domain_sources[v["domain"]].append(s)
else:
domain_sources[v["domain"]] = [s]
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(list(set(v)))
It works but, again, for one node the time is not a big deal but I'm feeding this routine at least 40 to 50 nodes. Is this the best way? Is there something else I can do that can group by an edge attribute and uniquely count the nodes?
Two possible enhancements:
Remove copy from line creating the sub graph. You are not changing anything and the copy is redundant.
Create a defaultdict with keys of set. Read more here.
from collections import defaultdict
import networkx as nx
# missing part of df creation
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes)
domain_sources = defaultdict(set)
for s, t, v in subgraph.edges(data=True):
domain_sources[v["domain"]].add(s)
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(set(v))

Given two lists(A,B) with same length, How can I find the index(i) which makes the max(sum(A[:i],B[i:]),sum(A[i:],B[:i])) smallest?

I am working on an online challenge problem, and I can solve this problem with brute force, but when the length became very large, the runtime is significantly increased, I believe there must be a better algorithm to solve this problem, but it is just out of my hand. I appreciate any brilliant ideas.
If you are allowed to use numpy, by using numpy.cumsum method you can find store sum(A[:i]), sum(A[i:]), sum(B[:i]), and sum(B[i:]) values in four different arrays as follows
import numpy as np
A = [] # Array A
B = [] # Array B
A_start_to_i = np.cumsum(A) # A_start_to_i[i] = sum(A[:i])
A.reverse() # Reverse the order
A_i_to_end = np.cumsum(A) # A_i_to_end[i] = sum(A[i:])
B_start_to_i = np.cumsum(B) # B_start_to_i[i] = sum(B[:i])
B.reverse() # Reverse the order
B_i_to_end = np.cumsum(B) # B_i_to_end = sum(B[i:])
Now all you need to do is to create sum(A[:i], B[i:]) and sum(B[:i], A[i:]) and find the index with the minimum element.
first_array = A_start_to_i + B_i_to_end # first_array[i] = sum(A[:i], B[i:])
second_array = A_i_to_end + B_start_to_i # second_array[i] = sum(B[:i], A[i:])
# Find which array has the minimum element
idx = np.argmin([min(first_array), min(second_array)])
if idx == 0:
# First array has the minimum element
i = np.argmin(first_array)
else:
# Second array has the minimum element
i = np.argmin(second_array)

Using method source(edge) of Package Graphs.jl in Julia on Juliabox.org

Take a look at the following simple code example:
Pkg.add("Graphs")
using Graphs
gd = simple_graph(20, is_directed=true) # directed graph with 20 nodes
nodeTo = 2
for nodeFrom in vertices(gd) # add some edges...
if(nodeTo != 20)
add_edge!(gd, nodeFrom, nodeTo)
nodeTo +=1
end
end
for i in edges(gd) # Print source and target for every edge in gd
println("Target: ",target(i))
println("Source: ", source(i))
end
So it works sometimes, and it prints the source and targets of the edges, but most times running the cell(after programming in this or other cells or doing nothing) i get the following error:
type: anonymous: in apply, expected Function, got Int64
while loading In[11], in expression starting on line 14
in anonymous at no file:16
I have not change any code concerning the method nether the cell, but it doesnt work anymore. The method target(edge) works fine, but the method source(edge) makes problems the most times.
http://graphsjl-docs.readthedocs.org/en/latest/graphs.html#graph
What should i do? I would be pleased to get some help.
After some thoughts, i found out, that the mistake have to be in the code between the hashtags:
Pkg.add("JuMP")
Pkg.add("Cbc")
# Pkg.update()
using Graphs
using JuMP
using Cbc
function createModel(graph, costs, realConnections)
m = Model(solver = CbcSolver())
#defVar(m, 0 <= x[i=1:realConnections] <= 1, Int)
#setObjective(m, Min, dot(x,costs[i=1:realConnections]))
println(m)
for vertex in vertices(graph)
edgesIn = Int64[] # Array of edges going in the vertex
edgesOut = Int64[] # Array of Edges going out of the vertex
for edge in edges(graph)
if (target(edge) == vertex) # works fine
push!(edgesIn, edge_index(edge))
end
if (source(edge) == vertex) # does not work
push!(edgesOut, edge_index(edge))
print(source(edge), " ")
end
end
# #addConstraint(m, sum{x[edgesIn[i]], i=1:length(edgesIn)} - sum{x[edgesOut[j]], j=1:length(edgesOut)} == 0)
end
return m
end
file = open("csp50.txt")
g = createGraph(file) # g = g[1] = simpleGraph Object with 50 nodes and 173 arccs, g[2] = number of Nodes g[3]/g[4] = start/end time g[5] = costs of each arc
# After running this piece of code, the source(edge) method does not work anymore
########################################################################################
# adding a source and sink node and adding edges between every node of the orginal graph and the source and sink node
realConnections = length(g[5]) # speichern der Kanten
source = (num_vertices(g[1])+1)
sink = (num_vertices(g[1])+2)
add_vertex!(g[1], source)
add_vertex!(g[1], sink)
push!(g[3], 0)
push!(g[3], 0)
push!(g[4], 0)
push!(g[4], 0)
for i in vertices(g[1])
if (i != source)
add_edge!(g[1], source, i) # edge from source to i
push!(g[5], 0)
end
if (i != sink)
add_edge!(g[1], i, sink) # Kante von i zu Senke
push!(g[5], 0) # Keine Kosten/Zeit fuer diese Kante
end
end
######################################################################################
numEdges = num_edges(g[1]);
createModel(g[1], g[5], realConnections)
From Julia's Manual:
Julia will even let you redefine built-in constants and functions if
needed:
julia> pi
π = 3.1415926535897...
julia> pi = 3
Warning: imported binding for pi overwritten in module Main
3
julia> pi
3
julia> sqrt(100)
10.0
julia> sqrt = 4
Warning: imported binding for sqrt overwritten in module Main
4
However, this is obviously not recommended to avoid potential
confusion.
So reusing source as a variable "unbound" it from it's function definition. Using a different variable name should preserve the Graphs.jl definition for it.

C Tree in R : Get all leaf node split variable in form of list in a non binary tree

The decision tree we are using in our current project uses Conditional Inference (C Tree) algorithm. I can extract the split variables for binary c-trees using the code below :
#develop ctree decision tree
prod_discount_data_ctree <- ctree(Discount~Prod, data=prod_discount_data, controls = ctree_control(minsplit=30))
plot(prod_discount_data_ctree)
#extract the left and right terminal node split rule
lvls <- levels(prod_discount_data_ctree#tree$psplit$splitpoint)
#left leaf node split variable
left.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 1]
#right leaf node split variable
right.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 0]
This works fine if the tree has only one node (depth = 1) which splits into 2 leaf nodes. But if the tree has one node (node 1) that splits into multiple nodes (node 2,5) which further split into leaf nodes (node 2{3,4} node 5{6,7}), how should I traverse deeper and get the leaf node split variable?
Based on the example I would want split variables for node 3,4,6,7 in the form of 4 lists.
I tried all possible options and finally found a way to traverse inside a C-tree, and get the split variables for each leaf node. Pasting the code snippet if anyone wants to refer in future.
if (nrow(SubBrandright_total) > 200) {
sec_discount_data <- subset(SubBrandright_total, select=c(Discount,Sector))
sec_discount_data_ctree <- ctree(Discount~Sector, data=sec_discount_data, controls = ctree_control(minsplit=30))
sec_lvls_r <- levels(sec_discount_data_ctree#tree$psplit$splitpoint)
#Testing if the node is terminal [TRUE] or not [FALSE]
#print(sec_discount_data_ctree#tree$terminal)
#print(sec_discount_data_ctree#tree$left$terminal)
#print(sec_discount_data_ctree#tree$left$left$terminal)
#print(sec_discount_data_ctree#tree$left$right$terminal)
sec_left_left.df = sec_lvls_r[sec_discount_data_ctree#tree$left$psplit$splitpoint == 1]
sec_left.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 1]
#Using setdiff to get right leaf node from Node minus left leaf node
sec_left_right.df = setdiff(sec_left.df,sec_left_left.df)
print("Sector Segmentation")
print(sec_left_left.df)
print(sec_left_right.df)
sec_right.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 0]
sec_right_right.df = sec_lvls_r[sec_discount_data_ctree#tree$right$psplit$splitpoint == 0]
#Using setdiff to get left leaf node from Node minus right leaf node
sec_right_left.df = setdiff(sec_right.df,sec_right_right.df)
print(sec_right_left.df)
print(sec_right_right.df)
}

Resources