Recursive tree sum without defining a tree structure - recursion

I have a dictionary (1) of nodes and child nodes Dictionary<int,int[]> and a list (2) of Weights associated to each nodes. The dictionary can be interpreted as follows: e.g.: key 1 has values 3,4 which means that the node id=1 is parent to nodes 3 and 4. key 3 has values 5,6,8 which means that node id=3 is parent to nodes 5,6 and 8... etc. The second list is just a list of weights where the index represents the node id the weight is associated to.
I want to calculate for each key nodes of list (1) its sum of all child nodes weights.
I think this problem is similar to a recursive tree sum, although my lists aren't setup as tree structures.
How should I proceed?

Here a Python version of a solution to what you want to achieve:
dctNodeIDs_vs_Childs = {}
dctNodeIDs_vs_Childs[1] = (2,3,4)
dctNodeIDs_vs_Childs[2] = (13,14,15)
dctNodeIDs_vs_Childs[3] = (5,6,7,8)
dctNodeIDs_vs_Childs[4] = (9,10,11,12)
lstNodeIDs_vs_Weight = [None,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
def getSumOfWeights(currNodeID, lstNodeIDs_vs_Weight = lstNodeIDs_vs_Weight, dctNodeIDs_vs_Childs = dctNodeIDs_vs_Childs):
sumOfWeights = 0
print("#currNodeID", currNodeID)
if currNodeID not in dctNodeIDs_vs_Childs:
sumOfWeights += lstNodeIDs_vs_Weight[currNodeID]
else:
for childNodeID in dctNodeIDs_vs_Childs[currNodeID]:
print("## childNodeID", childNodeID)
if childNodeID not in dctNodeIDs_vs_Childs:
sumOfWeights += lstNodeIDs_vs_Weight[childNodeID]
else:
sumOfWeights += lstNodeIDs_vs_Weight[childNodeID] + sum( [ getSumOfWeights(nodeID) for nodeID in dctNodeIDs_vs_Childs[childNodeID] ] )
return sumOfWeights
lstNodeIDs_vs_WeightsOfChildNodes = [None for _ in range(len(lstNodeIDs_vs_Weight)+1)]
for nodeID in dctNodeIDs_vs_Childs.keys():
print("nodeID =", nodeID)
lstNodeIDs_vs_WeightsOfChildNodes[nodeID] = getSumOfWeights(nodeID)
print("---")
print(lstNodeIDs_vs_WeightsOfChildNodes)
which give following output:
nodeID = 1
#currNodeID 1
## childNodeID 2
#currNodeID 13
#currNodeID 14
#currNodeID 15
## childNodeID 3
#currNodeID 5
#currNodeID 6
#currNodeID 7
#currNodeID 8
## childNodeID 4
#currNodeID 9
#currNodeID 10
#currNodeID 11
#currNodeID 12
nodeID = 2
#currNodeID 2
## childNodeID 13
## childNodeID 14
## childNodeID 15
nodeID = 3
#currNodeID 3
## childNodeID 5
## childNodeID 6
## childNodeID 7
## childNodeID 8
nodeID = 4
#currNodeID 4
## childNodeID 9
## childNodeID 10
## childNodeID 11
## childNodeID 12
---
[None, 119, 42, 26, 42, None, None, None, None, None, None, None, None, None, None, None, None]

A colleague at work came up with this elegant solution (2 dictionaries needed). Might not be the most efficient one though.
double MethodName(int Id) => FirstDic.ContainsKey(Id) ? FirstDic[Id].Sum(n => MethodName(n)) : SecondDic.Where(y => y.Key == Id).Select(x => x.Value).Sum();

Related

regex for searching through dataframe in R

I have a list of barcodes with the format: AAACCTGAGCGTCAAG-1
The letters can be A, C, G or T and the number after the dash can be 1 - 16.
barcode = c('AAACCTGAGCGTCAAG-1',
'AAACCTGAGTACCGGA-1',
'AAACCTGCAGCTGCTG-1',
'AAACCTGCATCACGAT-3',
'AAACCTGCATTGGGCC-5',
'AAACCTGGTATAGTAG-10',
'AAACCTGGTCGCGTGT-1',
'AAACCTGGTTTCCACC-16',
'AAACCTGTCATGCATG-14',
'AAACCTGTCGCAGGCT-15',
'AAACGGGAGAACTCGG-1')
cluster = c(6,3,6,16,17,11,14,18,9,8,14)
df <- data.frame(Barcode = barcode, Cluster = cluster)
I need to subset this dataframe based on the -# at the end of the barcode. I have been using this to subset the dataframe. The problem is this works for every number except 1.
> df[grep("([ACGT]-10){1}", df$Barcode),]
Barcode Cluster
6 AAACCTGGTATAGTAG-10 11
When I use the following, it will include all the barcodes that end in -1, as well as -10, -11, -12, -13, -14, -15 and -16.
> df[grep("([ACGT]-1){1}", df$Barcode),]
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
6 AAACCTGGTATAGTAG-10 11
7 AAACCTGGTCGCGTGT-1 14
8 AAACCTGGTTTCCACC-16 18
9 AAACCTGTCATGCATG-14 9
10 AAACCTGTCGCAGGCT-15 8
11 AAACGGGAGAACTCGG-1 14
>
Is there a regex that will include barcodes ending in -1, but exclude all other barcodes that end in numbers from 10 - 16?
I want to subset the dataframe so that I only get this:
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14
>
Thanks!
How about:
df[grep("-1$", df$Barcode),]
This matches 1 at the end of the string, but also requires that the digit before 1 is not 1, so you don't match 11
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14
I think you can just use df[grep("([ACGT]-1$){1}", df$Barcode),]
You can just use a $ to specify the end of the chain. See more information here on "pattern" use: http://www.jdatalab.com/data_science_and_data_mining/2017/03/20/regular-expression-R.html

Build new adjacency matrix after graph partitioning

I have an adjancecy matrix stored in CSR format. Eg
xadj = 0 2 5 8 11 13 16 20 24 28 31 33 36 39 42 44
adjncy = 1 5 0 2 6 1 3 7 2 4 8 3 9 0 6 10 1 5 7 11 2 6 8 12 3 7 9 13 4 8 14 5 11 6 10 12 7 11 13 8 12 14 9 13
I am now paritioning said graph using METIS. This gives me the partition vector part of the graph. Basically a list that tells me in which partition each vertex is. Is there an efficient way to build the new adjacency matrix for this partitioning such that I can partition the new graph again? Eg a function rebuildAdjacency(xadj, adjncy, part). If possible reusing xadj and adjncy.
I'm assuming that what you mean by "rebuild" is removing the edges between vertices that have been assigned different partitions? If so, the (probably) best you can do is iterate your CSR list, generate a new CSR list, and skip all edges that are between partitions.
In pseudocode (actually, more or less Python):
new_xadj = []
new_adjcy = []
for row in range(0, n):
row_index = xadj[row]
next_row_index = xadj[row+1]
# New row index for the row we are currently building
new_xadj.append(len(new_adjcy))
for col in adjncy[row_index:next_row_index]:
if partition[row] != partition[col]:
pass # Not in the same partition
else:
# Put the row->col edge into the new CSR list
new_adjcy.append(col)
# Last entry in the row index field is the number of entries
new_xadj.append(len(new_adjcy))
I don't think that you can do this very efficiently re-using the old xadj and adjcy fields. However, if you are doing this recursively, you can save memory allocation / deallocation by having exacyly two copies of xadj and adjc, and alternating between them.

Generate list of strings

Here's what is inside my CSV file
Symbol
0 AACAF
1 AACAY
2 AACTF
3 AAGC
4 AAGIY
5 AAIGF
6 AAMAF
7 AAPH
8 AAPT
9 AAST
10 AATDF
11 AATGF
12 AATRL
13 AAUKF
14 AAWC
15 ABBY
16 ABCAF
17 ABCCF
18 ABCE
19 ABCFF
20 ABCZF
21 ABCZY
22 ABEPF
23 ABHD
24 ABHI
25 ABLT
26 ABLYF
27 ABNAF
28 ABNK
29 ABNRY
I would like to build a function which could create strings by batch of three symbols, e.g.
'AACAF,AACAY,AACTF'
'AAGC,AAGIY,AAIGF'
'AAMAF,AAPH,AAPT'
'AAST,AATDF,AATGF'
'AATRL,AAUKF,AAWC'
'AABY,ABCAF,ABCCF'
'ABCE,ABCFF,ABCZF'
'ABCZY,ABEPF,ABHD'
'ABHI,ABLT,ABLYF'
'ABNAF,ABNK,ABNRY'
I started what I want in using python, but I don't know how to complete it. I think I could use the csv module to do that.
with open(path, 'r') as csvfile:
rows=[row for row in csvfile]
batch_size = 100
listing = []
string = ''
count = 0
for index, row in enumerate(rows):
if count >= batch_size:
listing.append(string)
string = ''
count = 0
','.join((string,row))
count += 1
How could I do that with python 3.6?
arr = pandas.read_csv(path).Symbol.values
symbol_groups = numpy.split(arr, len(arr) // 3)
result = [','.join(symbols) for symbols in symbol_groups]
Should be doing what you're looking for.
with open(path, 'r') as csvfile:
rows=[row.strip('\n') for row in csvfile]
batch_size = 100
listing = []
string = ''
count = 0
for index, row in enumerate(rows[1:]):
if count >= batch_size or index == len(rows[1:])-1:
listing.append(string)
string = ''
count = 0
if count == 0:
string = ''.join((string,row))
else:
string = ','.join((string,row))
count += 1

How to get the ID of each node from topological sort?

I have a network (a directed acyclic graph):
dag_1 <- barabasi.game(20)
I applied a topological sort:
top1 <- topo_sort(dag_1)
top1
+ 20/20 vertices, from 0ee5d26:
[1] 5 8 11 13 14 15 16 17 18 20 4 7 12 19 2 10 9 6 3 1
If I type top1 and hit enter, the results are above. I need to access the vector
5 8 11 13, ..., 1
I tried top1[1] and top1[[1]]. Neither of them gave me the vector.
How can I get it?
top1 is an igraph.vs class object, and indexing e.g. top1[1:10] returns the vertices of the graph. To return a vector of the vertices use:
as.vector(top1)

Vertex Labels in igraph R

I am using igraph to plot a non directed force network.
I have a dataframe of nodes and links as follows:
> links
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363
> nodes
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8
I plot these using igraph as follows:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
On this plot, igraph has used the proxy indexes for source and target as vertex labels.
I want to use the real ID's, in my links table expressed as sourceID and targetID.
So, for:
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
This would show as:
(1450552) ----- 0.6245 ----- (1519842)
Instead of:
(3) ----- 0.6245 ----- (4)
(Note that the proxy indexes are zero indexed in the links dataframe, and one indexed in the nodes dataframe. This offset by 1 is necessary for igraph plotting).
I know I need to somehow match or map the proxy indexes to their corresponding name within the nodes dataframe. However, I am at a loss as I do no not know the order in which igraph plots labels.
How can I achieve this?
I have consulted the following questions to no avail:
Vertex Labels in igraph with R
how to specify the labels of vertices in R
R igraph rename vertices
You can specify the labels like this:
library(igraph)
gg <- graph.data.frame(
links,directed=FALSE,
vertices = rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))))
plot(gg, vertex.color = 'lightblue', edge.label=links$value,
vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
You could also pass
merge(rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))),
nodes,
by.x="label", by.y="name")
to the vertices argument if you needed the other node attributes.
Data:
links <- read.table(header=T, text="
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363")
nodes <- read.table(header=T, text="
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8")
It appears I was able to repurpose the answer to this question to achieve this.
r igraph - how to add labels to vertices based on vertex id
The key was to use the vertex.label attribute within plot() and a select a sliced subset of nodes$names.
For our index we can use the ordered default labels returned in igraph automatically. To extract these, you can type V(gg)$names.
Within plot(gg) we can then write:
vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name
# 1 Convert to numeric
# 2 Add 1 for offset between proxy links index and nodes index
# 3 Select subset of nodes with above as row index. Return name column
As full code:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2, vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name)
With the data above, this gave:
The easiest solution would be to reorder the columns of links, because according to the documentation:
"If vertices is NULL, then the first two columns of d are used as a symbolic edge list and additional columns as edge attributes."
Hence, your code will give the correct output after running:
links <- links[,c(4,5,3)]

Resources