Find neighbors of a node - graph

How to find neighbors of a node from non-directed graph using scala-programming?
I created graph with node list and edge list, now I need to find neighbor list of each node. Is there any library or any idea how to find neighbor list?
val graphx = Graph(nodes,routes)
val label = sc.textFile("label.csv")
val getgdata2 = label.map(line=>line.split(","))
val node11 = getgdata2.map(line=>((line(0)))).distinct
val verticesWithSuccessors: VertexRDD[Array[VertexId]] =
graphx.ops.collectNeighborIds(EdgeDirection.Out)
val successorGraph = Graph(verticesWithSuccessors, routes)
val res = successorGraph.vertices.collect()
res.take(5)
--------------------------
Output shows :
(384,[J#38d17d80)
(454,[J#6ede46f6)
(1084,[J#66273da0)
(1410,[J#2127e66e)
(772,[J#1229a2b7)
Answer should be:
(384 - 1084, 984,2013)
(454 - 924)
(1084 - 2302,354)
I need to see adjacency list for each node. Anyone can help me?

Related

All path *lengths* from source to target in Directed Acyclic Graph

I have a graph with an adjacency matrix shape (adj_mat.shape = (4000, 4000)). My current problem involves finding the list of path lengths (the sequence of nodes is not so important) that traverses from the source (row = 0 ) to the target (col = trans_mat.shape[0] -1).
I am not interested in finding the path sequences; I am only interested in propagating the path length. As a result, this is different from finding all simple paths - which would be too slow (ie. find all paths from source to target; then score each path). Is there a performant way to do this quickly?
DFS is suggested as one possible strategy (noted here). My current implementation (below) is simply not optimal:
# create graph
G = nx.from_numpy_matrix(adj_mat, create_using=nx.DiGraph())
# initialize nodes
for node in G.nodes:
G.nodes[node]['cprob'] = []
# set starting node value
G.nodes[0]['cprob'] = [0]
def propagate_prob(G, node):
# find incoming edges to node
predecessors = list(G.predecessors(node))
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = G.get_edge_data(prev_node, node)['weight']
# get predecessor node value
if len(G.nodes[prev_node]['cprob']) == 0:
G.nodes[prev_node]['cprob'] = propagate_prob(G, prev_node)
prev_node_arr = G.nodes[prev_node]['cprob']
# add incoming edge weight to prev_node arr
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(prev_node_arr)])
# update current node array
G.nodes[node]['cprob'] = curr_node_arr
return G.nodes[node]['cprob']
# calculate all path lengths from source to sink
part_func = propagate_prob(G, 4000)
I don't have a large example by hand (e.g. >300 nodes), but I found a non recursive solution:
import networkx as nx
g = nx.DiGraph()
nx.add_path(g, range(7))
g.add_edge(0, 3)
g.add_edge(0, 5)
g.add_edge(1, 4)
g.add_edge(3, 6)
# first step retrieve topological sorting
sorted_nodes = nx.algorithms.topological_sort(g)
start = 0
target = 6
path_lengths = {start: [0]}
for node in sorted_nodes:
if node == target:
print(path_lengths[node])
break
if node not in path_lengths or g.out_degree(node) == 0:
continue
new_path_length = path_lengths[node]
new_path_length = [i + 1 for i in new_path_length]
for successor in g.successors(node):
if successor in path_lengths:
path_lengths[successor].extend(new_path_length)
else:
path_lengths[successor] = new_path_length.copy()
if node != target:
del path_lengths[node]
Output: [2, 4, 2, 4, 4, 6]
If you are only interested in the number of paths with different length, e.g. {2:2, 4:3, 6:1} for above example, you could even reduce the lists to dicts.
Background
Some explanation what I'm doing (and I hope works for larger examples as well). First step is to retrieve the topological sorting. Why? Then I know in which "direction" the edges flow and I can simply process the nodes in that order without "missing any edge" or any "backtracking" like in a recursive variant. Afterwards, I initialise the start node with a list containing the current path length ([0]). This list is copied to all successors, while updating the path length (all elements +1). The goal is that in each iteration the path length from the starting node to all processed nodes is calculated and stored in the dict path_lengths. The loop stops after reaching the target-node.
With igraph I can calculate up to 300 nodes in ~ 1 second. I also found that accessing the adjacency matrix itself (rather than calling functions of igraph to retrieve edges/vertices) also saves time. The two key bottlenecks are 1) appending a long list in an efficient manner (while also keeping memory) 2) finding a way to parallelize. This time grows exponentially past ~300 nodes, I would love to see if someone has a faster solution (while also fitting into memory).
import igraph
# create graph from adjacency matrix
G = igraph.Graph.Adjacency((trans_mat_pad > 0).tolist())
# add edge weights
G.es['weight'] = trans_mat_pad[trans_mat_pad.nonzero()]
# initialize nodes
for node in range(trans_mat_pad.shape[0]):
G.vs[node]['cprob'] = []
# set starting node value
G.vs[0]['cprob'] = [0]
def propagate_prob(G, node, trans_mat_pad):
# find incoming edges to node
predecessors = trans_mat_pad[:, node].nonzero()[0] # G.get_adjlist(mode='IN')[node]
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = trans_mat_pad[prev_node, node] # G.es[prev_node]['weight']
# get predecessor node value
if len(G.vs[prev_node]['cprob']) == 0:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + propagate_prob(G, prev_node, trans_mat_pad)])
else:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(G.vs[prev_node]['cprob'])])
## NB: If memory constraint, uncomment below
# set max size
# if len(curr_node_arr) > 100:
# curr_node_arr = np.sort(curr_node_arr)[:100]
# update current node array
G.vs[node]['cprob'] = curr_node_arr
return G.vs[node]['cprob']
# calculate path lengths
path_len = propagate_prob(G, trans_mat_pad.shape[0]-1, trans_mat_pad)

DFS to get all possible solutions?

I have these Circles:
I want to get the list of all possible solution of maximum non-intersecting circles. This is the illustration of the solution I wanted from node A.
Therefore the possible solutions from node A:
1 = [A,B,C], 2 = [A,B,E], 3 = [A,C,B], 4 = [A,E,B] ..etc
I want to store all of the possibilities into a list, which the will be used for weighting and selecting the best result. However, I'm still trying to create the list of all possibilities.
I've tried to code the structure here, however I still confused about backtracking and recursive. Anyone could help here?
# List of circle
# List of circle
list_of_circle = ['A','B','C','D','E']
# List of all possible solutions
result = []
# List of possible nodes
ways = []
for k in list_of_circle:
if len(list_of_circle)==0:
result.append(ways)
else:
ways.append[k]
list_of_circle.remove(k)
for j in list_of_circle:
if k.intersects(j):
list_of_circle.remove(j)
return result
Here is a possible solution (pseudocode).
def get_max_non_intersect(selected_circles, current_circle_idx, all_circles):
if current_circle_idx == len(all_circles): # final case
return selected_circles
# we recursively get the biggest selection of circles if the current circle is not selected
list_without_current_circle = get_max_non_intersect(selected_circles, current_circle_idx + 1, all_circles)
# now we check if we can add the current circle to the ones selected
current_intersects_selected = false
current_circle = all_circles[current_circle_idx]
for selected_circle in selected_circles:
if intersects(current_circle, selected_circle):
current_intersects_selected = true
break
if current_intersects_selected is true: # we cannot add the current circle
return list_without_current_circle
else: # we can add the current circle
list_with_current_circle = get_max_non_intersect(selected_circles + [current_circle], current_circle_idx + 1, all_circles)
return list_with_current_circle + list_without_current_circle

Best way to count downstream with edge data

I have a NetworkX problem. I create a digraph with a pandas DataFrame and there is data that I set along the edge. I now need to count the # of unique sources for nodes descendants and access the edge attribute.
This is my code and it works for one node but I need to pass a lot of nodes to this and get unique counts.
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes).copy()
domain_sources = {}
for s, t, v in subgraph.edges(data=True):
if v["domain"] in domain_sources:
domain_sources[v["domain"]].append(s)
else:
domain_sources[v["domain"]] = [s]
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(list(set(v)))
It works but, again, for one node the time is not a big deal but I'm feeding this routine at least 40 to 50 nodes. Is this the best way? Is there something else I can do that can group by an edge attribute and uniquely count the nodes?
Two possible enhancements:
Remove copy from line creating the sub graph. You are not changing anything and the copy is redundant.
Create a defaultdict with keys of set. Read more here.
from collections import defaultdict
import networkx as nx
# missing part of df creation
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes)
domain_sources = defaultdict(set)
for s, t, v in subgraph.edges(data=True):
domain_sources[v["domain"]].add(s)
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(set(v))

Why do I get an error type with an argument that is used twice for the same purpose

First of all, I have these types :
type position = float * float
type node = position
type path = position list
Here are the two pieces of code causing the error :
let build_path map source target =
let rec build_aux acc map source x initial_target =
if (((DistMap.find_opt x map) = None) || x = source) then acc#[initial_target]
else build_aux ((DistMap.find x map)::acc) map source (DistMap.find x map) initial_target
in build_aux [] map source target target
let shortest_path graph source target : path =
build_path (snd (dijkstra graph source target)) source target
path has type position list for clarity.
Here's the error :
361 | build_path (snd (dijkstra graph source target)) source target
^^^^^^
Error: This expression has type position list
but an expression was expected of type position = float * float
I just don't get it. I've tried the build_path function in Utop, by having a Map filled like this :
DistMap.bindings prevMap;;
- : (node * (float * float)) list =
[((1., 1.), (7., 7.)); ((2., 2.), (1., 1.)); ((3., 3.), (2., 2.));
((4., 4.), (3., 3.)); ((5., 5.), (4., 4.))]
let l = build_list prevMap (1.,1.) (5.,5.);;
val l : node list = [(1., 1.); (2., 2.); (3., 3.); (4., 4.); (5., 5.)]
shortest_path has to, with 100% certainty, receive target with type node. The thing is, no error is raised when target is used as an argument for the dijsktra function, which requires a graph and two nodes source and target.
So I'm really confused why target suddenly has the wrong type for build_path and not dijkstra.
Anyway to resolve this issue?
Thanks to the help of #Pierre G., we determined that target type was constrained to a position list by the dijkstra function because I was comparing a list in dijkstra with target, once the mistake was fixed and target was compared with another node in dijkstra, the issue was solved.

C Tree in R : Get all leaf node split variable in form of list in a non binary tree

The decision tree we are using in our current project uses Conditional Inference (C Tree) algorithm. I can extract the split variables for binary c-trees using the code below :
#develop ctree decision tree
prod_discount_data_ctree <- ctree(Discount~Prod, data=prod_discount_data, controls = ctree_control(minsplit=30))
plot(prod_discount_data_ctree)
#extract the left and right terminal node split rule
lvls <- levels(prod_discount_data_ctree#tree$psplit$splitpoint)
#left leaf node split variable
left.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 1]
#right leaf node split variable
right.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 0]
This works fine if the tree has only one node (depth = 1) which splits into 2 leaf nodes. But if the tree has one node (node 1) that splits into multiple nodes (node 2,5) which further split into leaf nodes (node 2{3,4} node 5{6,7}), how should I traverse deeper and get the leaf node split variable?
Based on the example I would want split variables for node 3,4,6,7 in the form of 4 lists.
I tried all possible options and finally found a way to traverse inside a C-tree, and get the split variables for each leaf node. Pasting the code snippet if anyone wants to refer in future.
if (nrow(SubBrandright_total) > 200) {
sec_discount_data <- subset(SubBrandright_total, select=c(Discount,Sector))
sec_discount_data_ctree <- ctree(Discount~Sector, data=sec_discount_data, controls = ctree_control(minsplit=30))
sec_lvls_r <- levels(sec_discount_data_ctree#tree$psplit$splitpoint)
#Testing if the node is terminal [TRUE] or not [FALSE]
#print(sec_discount_data_ctree#tree$terminal)
#print(sec_discount_data_ctree#tree$left$terminal)
#print(sec_discount_data_ctree#tree$left$left$terminal)
#print(sec_discount_data_ctree#tree$left$right$terminal)
sec_left_left.df = sec_lvls_r[sec_discount_data_ctree#tree$left$psplit$splitpoint == 1]
sec_left.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 1]
#Using setdiff to get right leaf node from Node minus left leaf node
sec_left_right.df = setdiff(sec_left.df,sec_left_left.df)
print("Sector Segmentation")
print(sec_left_left.df)
print(sec_left_right.df)
sec_right.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 0]
sec_right_right.df = sec_lvls_r[sec_discount_data_ctree#tree$right$psplit$splitpoint == 0]
#Using setdiff to get left leaf node from Node minus right leaf node
sec_right_left.df = setdiff(sec_right.df,sec_right_right.df)
print(sec_right_left.df)
print(sec_right_right.df)
}

Resources