Finding the cumulative sum of the value of nodes in a DAG - r

Suppose I have the following directed acyclic graph (DAG) with each node having a weight of 1.
I am interested in calculating the accumulated sum of each node based on the value of its ancestor. Assuming as I said earlier that the weight of each node is 1, then this is what I would expect to get
This is what I tried to do:
library(tidygraph, quietly = TRUE)
library(tidyverse)
library(ggraph)
# create adjacencies
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"B", "D")
# create the graph
grafo <- as_tbl_graph(grafo_df)
# calculate accumulated sum
grafo %>%
arrange(node_topo_order()) %>%
mutate(
revenue = 1,
cum_weight = map_dfs(1, .f = function(node, path, ...) {
sum(.N()$revenue[c(node, path$node)])
})) %>%
as_tibble() %>%
unnest("cum_weight")
#> # A tibble: 4 x 3
#> name revenue cum_weight
#> <chr> <dbl> <dbl>
#> 1 C 1 1
#> 2 A 1 2
#> 3 B 1 2
#> 4 D 1 3
Created on 2021-05-13 by the reprex package (v2.0.0)
As you can see, the accumulated sum of D results in 3 and not 4, because the value of D should be the sum of the accumulated value of A and B. I do not understand why D does not add 4
I have tried to understand the solution given here, but had a hard time understanding it
How can I get the accumulated sum?
Update # 1
I am not concerned (for the moment) with the complexity of the algorithm, that is, if the algorithm does it in O(V + E) it is not relevant.
Something important that is mentioned in this question is about the problem of counting twice, that is, the partial sum of the value of A is equal to C(1) + A(1) = 2, and the partial sum of the value of B is equal to C(1) + B (1) = 2, so to say that the value of D is not equal to the partial sums of A (2) + B(2) because the value of C would be duplicating I think it does not apply in this situation due to the following:
Let's imagine that each of these 4 nodes (A, B, C and D) are internet nodes that generate revenue of $1 each, so the total accumulated income of the 4 nodes would be $4. If D is the convergence node of the rest of nodes, then in a scenario where D stops working, the income of the remaining nodes and that of D would no longer be possible, therefore, its value is $4.
Update # 2
If I add a new path from C to D then the value of D should always be 4 because the number of dependent nodes is maintained, that is, what should matter is the number of dependent nodes in the accumulated sum. For example, in the solution proposed by #ThomasIsCoding, if I add this new path, the value of D is now 5, I think partly that their algorithm uses the degrees as a parameter to calculate the cumulative sum, however, if I add a additional node then the calculation is correct.
Update # 3
The example that I have placed is simple with the intention that it is easy to understand the objective, however, I did not specify that it should be generalizable for a graph with many nodes with three different topologies. The outermost layers are trees, the middle layers are rings, and the innermost layer is a full mesh.

Here is an igraph option using distance with argument mode = "in"
If your nodes are unweighted, i.e., revenue=1 for all nodes
g <- graph_from_data_frame(grafo_df)
data.frame(name = names(V(g))) %>%
mutate(revenue = 1) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 1 3
3 B 1 2
4 F 1 1
5 D 1 5
If your nodes are weighted, e.g.,
data.frame(name = names(V(g))) %>%
mutate(revenue = 1:n()) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 2 7
3 B 3 4
4 F 4 4
5 D 5 15
Data
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"C", "D",
"B", "D",
"F", "A"
)
and the DAG by plot(g) is given as

Now the question is clear, so I propose an algorithm, I cannot code it since I don't know the language that you are using.
For each node Ni in the graph we will calculate the set of ancestors Ai, then the accumulated sum for each node will be |Ai| + 1.
Initialize all nodes with an empty ancestor set Ai = {}
Start with a set S0 containing all nodes with no incoming edges
Initialize the next set Sn+1
Iterate over Sn, for each node N:
For all nodes D with an incoming edge from N:
Merge the ancestor set of D with the ancestor set of N plus N itself
remove the egde N->D
If D has no other incoming edges add it to Sn+1
If Sn+1 is not empty, increase pass to n+1 and repeat from 2.
The big limit of this solution is the complexity, I'll try later to find some optimized solution.

Related

for loop to determine mutually exclusive/exhaustive network membership

I want to run a for loop that assigns group_ids to a dataset that have two columns, name and location. The goal is to plot large mutually exclusive and exhaustive networks. it is sort of like contract tracing, where individuals interact with locations and then, as a function of both individual interaction and locations, you get a finite network of people who belong to a group. So A and C both interact with location B. Both A and C are in the same group, but so is every other person who interacted with location B. Likewise, any other locations, x1-xn, that interacted with A or C or anyone else from location B is part of the same group. And so forth.
Basically, in the for loop, I want to do the following. First, for the first row in the dataframe, I want to check if the name exists in the name_group_id list initialized. If it does, then assign the group_id associated with that name. If not, then check if the location exists in the location_group_id list. If it does, return that group_id, if it does not, then create a new group_id and add this row's name/group_id to the name_group_id and the location/group_id to the location_group_id
This code does not work yet, as row three should have group_id 1 because name interacted with location 1 in row 1. But it seems close. Help tweaking this code would be awesome.
Thanks, everyone.
# Create data frame with the sample data
df <- data.frame(name = c("a", "a", "b", "b", "b", "c", "c", "d", "d", "e", "e", "f", "g", "g", "h"),
location = c(1, 2, 1, 3, 4, 3, 2, 5, 6, 7, 8, 4, 9, 10, 5))
# Initialize two lists
name_id <- list()
location_id <- list()
# Counter for creating new group IDs
group_id_counter <- 0
# For loop that implements the logic
for (i in 1:nrow(df)) {
row <- df[i, ]
name <- row$_name
location <- row$location
# Check if the name exists in name_id
if (name %in% names(name_id)) {
group_id <- name_group_id[[name]]
} else {
# Check if the location exists in location_group_id
if (location %in% names(location _group_id)) {
group_id <- location _group_id[[location ]]
} else {
group_id_counter <- group_id_counter + 1
group_id <- paste0("Group ", group_id_counter)
name_id[[name]] <- group_id
location _group_id[[location ]] <- group_id
}
}
# Assign the group_id to the current row
df[i, "group_id"] <- group_id
}
Right now, this code would return group_id 2 to row three, but it should be group_id 1 because location 1 was already seen in row 1 and assigned group_id 1. Help would be greatly appreciated.
Expected results for the fake data frame supplied are:
name location Group ID
a 1 1
a 2 1
b 1 1
b 3 1
b 4 1
c 3 1
c 2 1
d 5 2
d 6 2
e 7 3
e 8 3
f 4 1
g 9 4
g 10 4
h 5 2
As Gregor Thomas mentioned this is the same as determining connected components in graphs. Another term for this is equivalence groups. The code below is a bit of hack using an internal function of one of my packages (on CRAN); you could do this using the external functions but then you have to convert the data into the right form; this is easier:
First make a list of all node 'identifiers':
x <- c(unique(df$name), unique(df$location))
Determine which nodes belong to the same group
res <- reclin2:::equivalence(x = x,
df$name, df$location)
The result res is a vector with identifiers for each group for each value in x. To get the end result:
res[match(df$name, x)]
Edit: the OP mentioned that the datasets are large; this should work on large datasets.

How to number nodes when creating decision trees in R?

In R I am creating a data frame of the structure of decision trees. The issue I'm facing is, I have to number the nodes of the trees in a certain way that will allow me to plot them later. However, Im struggling to find a good way to number the nodes. Hopefully my example below will explain the issue.
For example, if I have a column in my data frame that describes the path or direction of the nodes, like so:
df <- data.frame(
var = c("P", "L", "R", "RL", "RR",
"P", "L", "R" , "RL", "RR", "LL", "LR", "RRL", "RRR")
)
Here, P means the parent node, L means left node, R means right node, RL means the left node from the previous right node etc... The diagram below shows what the decision trees made from df$var would look like:
So, as we can see, every time we reach a P in df$var, we start a new decision tree, as it is the parent.
Now, I want to try and number the nodes, so I can plot them. I initially tried numbering the nodes sequentially, like so:
df <- df %>%
group_by(newVal = cumsum(var == "P")) %>%
mutate(node = 1:length(var)) %>%
ungroup() %>%
select(-newVal)
df
var node
P 1
L 2
R 3
RL 4
RR 5
P 1
L 2
R 3
RL 4
RR 5
LL 6
LR 7
RRL 8
RRR 9
For clarity, that would look like this:
But as you can see (mainly in the 2nd tree), due to the original ordering of df$var, it results in a non-intuitive numbering of the nodes. This presents a problem when I try to plot the tree.
The issue is, when Im plotting the trees, I have to create data frames (for each tree) with 2 columns. That is, from and to, where we go from node x to node y. Using the 2nd image as an example, my data frames for plotting would look like this:
tree.1.Edges <- data.frame(
from = c(1,1,3,3),
to = c(2,3,4,5)
)
tree.2.edges <- data.frame(
from = c(1,1,2,2,3,3,5,5),
to = c(2,3,6,7,4,5,8,9)
)
Im finding it difficult to come up with a way to automate the process of creating the tree edges data frames using my method of sequentially numbering the nodes. Does anyone have any suggestions as to a better way I could do this?
Prefix
This is my solution. It returns a list of the edges with correctly numbered nodes.
The Nodes are numbered like this:
Parent node number < Child node number
Left node number < Right node number
Code
library(tidyverse)
df <- data.frame(
var = c("P", "L", "R", "RL", "RR",
"P", "L", "R" , "RL", "RR", "LL", "LR", "RRL", "RRR"),
stringsAsFactors = FALSE # important for character operations
)
#enumerate tree ids
# a new tree is initialized when a parent node "P" is initialized
df$tree <-cumsum(df$var=="P") # Cumsum increments for every TRUE by one
#sort nodes so that Left nodes are in front of Right nodes
# and every deeper level of the tree is numbered after
# the preceeding level
df <- df %>% group_by(tree) %>% mutate(level = nchar(var)) %>%
group_by(tree) %>% arrange(level, # arrange by level first
# custom alphabet function where P comes first
# As L comes in front of R in the alphabet
# longer strings are
# correctly sorted
ifelse(var=="P",1,match(LETTERS,var)+1),
.by_group = TRUE)
# define the nodes as row numbers resetting at every tree
df <- df %>% group_by(tree) %>% mutate(node = row_number())
## At this point the nodes are numbered according to your specifications
# Find out parent node by deleting the last character from every node name (var)
df <- df %>% group_by(tree) %>% mutate(parent_node_name=substr(var,0,nchar(var)-1))
# define parent node of P as NA
df$parent_node_name[df$var=="P"] <- NA
# define parent nodes vars with still empty parent node name as "P"
df$parent_node_name[df$parent_node_name==""] <- "P"
# Match parent node names to node numbers
df <- df %>%
group_by(tree) %>%
mutate(parent_node_num = match(parent_node_name,var))
# split the dataframe into a list of dfs, one for each tree
list_edges <- split(df,df$tree)
# for every dataframe in the list, replace by a result dataframe (res)
list_edges <- lapply(list_edges, function(df_tree){
res <- data.frame(
from = df_tree$parent_node_num,
to = df_tree$node
)
# delete NAs from result
res <- res[!is.na(res$from),]
return(res)
})
# Show result
list_edges
# $`1`
# from to
# 2 1 2
# 3 1 3
# 4 3 4
# 5 3 5
#
# $`2`
# from to
# 2 1 2
# 3 1 3
# 4 3 4
# 5 3 5
# 6 2 6
# 7 2 7
# 8 5 8
# 9 5 9
The code is quite convoluted, but you can insert df at any point to look at the intermediate results. Or simply post a comment.

How to make a tapply style structure containing the N of multiple groups?

Probably very simple but I'm stuck right now: The following code basically returns a list of the Var1 means grouped by Var2 (Group1's mean: 5, Group2's mean: 7, Group3's mean: 4) that can be referred to by Mean_group[GroupX] How would I create a similar structure just showing how many cases there are in each group so I can access the N of each group like this:
N_of_each_group <- N of cases belonging to each group
N_of_each_group[GroupX]?
data <- data_frame(group = c("A", "B", "C", "A", "B", "B", "B"),values = c(1,2,5,4,6,3,4))
Mean_group <- tapply(as.numeric(values, group, mean, na.rm=TRUE)
It's meant for a Shiny app and the N_of_each_group should be added into a text always matching the group the given user is part of. So if there are 10 people in group 1 and 15 people in group 2 user A from group 1 is supposed to see a text saying "There were 10 people in your group" and user B from group 2 sees a text that reads "There were 15 people in your group".
Dplyr excels in this:
library(dplyr)
data %>% group_by(group) %>% summarise(mean_group=mean(values, na.rm=TRUE), n_group=n())
# A tibble: 3 x 3
group mean_group n_group
<chr> <dbl> <int>
1 A 2.5 2
2 B 3.75 4
3 C 5 1

How to create table with both forward and backward information for edges with igraph

I have created a table with igraph listing the data as follows :
where a,b,c,d,e are the edges.
a and b are mutual edges,
with the weight values of 1 for a->b, 2 for b->a (There is no self-loop).
By the way I used the following code to create the above table:
library(igraph)
library(dplyr)
g <- data.frame(from = c("a", "b", "c", "d", "e"),
to = c("b", "a", "a", "b", "a"), weight = c(1:5)) %>%
igraph::graph_from_data_frame()
Now I hope to create another table listing both the forward and backward information between the edges, as well as the weight values like:
Does anyone know how to do this with igraph?
First you could get a list of the pairs of node that share and edge regardless of direction
simplified <- as.undirected(g, mode="collapse")
pairs <- ends(simplified, E(simplified))
Then we can write a helper function to return a given edge weight between two node and if it doesn't exist, return NA instead
get_edge_weight<- Vectorize(function(a, b) {
e <- E(g)[a %->% b]
if(length(e)==1) {
e$weight
} else {
NA
}
})
Then you can build your desired data.frame with
data.frame(from=pairs[,1], to=pairs[,2],
fwd=get_edge_weight(pairs[,1], pairs[,2]),
back=get_edge_weight(pairs[,2], pairs[,1])
)
# from to fwd back
# b a b 1 2
# c a c NA 3
# d b d NA 4
# e a e NA 5

Simple data-manipulation in R

#Aniko points out that one way to view my problem is that I need to find the connected components of a graph, where the vertices are called groups and, variables group and nominated_group indicate an edges between those two groups. My goal is to create a variable parent_Group which indexes the connected components. Or as I put it before:
I have a dataframe with four variables: ID, group, and nominated_ID, and nominated_Group.
Consider sister-groups: Groups A and B are sister-groups if there is at least one case in the data where group==A and nominated_group==B, or vice versa.
I would like to create a variable parent_group which takes on a unique value for each set of sister-groups. In other words, no nominations should occur between cases in different parent_groups. Making the parent_group sequential numbers seems like a good idea.
Many thanks for the help I already received here! I can't really contribute here but note that I try to pay it forward at stats.exchange and on wikipedia.
In my fake data, A and B are sister-groups. Either case ID=4 or ID=5 are sufficient to make this true. Each group is also their own sister-group. The goal, the creation of parent_group, should result in one parent_group for all cases in A or B, and another parent_group for group C
df <- data.frame(ID = c(9, 5, 2, 4, 3, 7),
group = c("A", "A", "B", "B", "A", "C"),
nominated_ID = c(9, 8, 4, 9, 2, 7) )
df$nominated_group <- with(df, group[match(nominated_ID, ID)])
df
ID group nominated_ID nominated_group
1 9 A 9 A
2 5 A 8 <NA>
3 2 B 4 B
4 4 B 9 A
5 3 A 2 B
6 7 C 7 C
Consider a graph with the groups as its vertices and the edges indicating that the two groups occur for the same ID. Then I think you are looking for connected components of this graph. The following is a quick and dirty (and probably not optimal) implementation of this idea using the graph package:
library(graph)
#make some fake data
nom <- data.frame(group = c("A","A","A","B","B","C","C"),
group2 = c("A","A","B","B","A","C","C"),
stringsAsFactors=FALSE)
#remove duplicated pairs
#it will keep A-B distinct from B-A, could probably be fixed
nom1 <- nom[!duplicated(nom),]
#define empty graph
grps <- union(unique(nom$group), unique(nom$group2))
gg <- new("graphNEL", nodes=grps, edgeL=list())
#add an edge for every pair
for (i in 1:nrow(nom1)) gg <- addEdge(nom1$group[i], nom1$group2[i], gg, 1)
#find connected components
cc <- connComp(gg)
#assing parent by matching within cc
nom$parent <- apply(nom, 1,
function(x) which(sapply(cc, function(y) x["group"] %in% y)))
nom
group group2 parent
1 A A 1
2 A A 1
3 A B 1
4 B B 1
5 B A 1
6 C C 2
7 C C 2

Resources