How to number nodes when creating decision trees in R? - r

In R I am creating a data frame of the structure of decision trees. The issue I'm facing is, I have to number the nodes of the trees in a certain way that will allow me to plot them later. However, Im struggling to find a good way to number the nodes. Hopefully my example below will explain the issue.
For example, if I have a column in my data frame that describes the path or direction of the nodes, like so:
df <- data.frame(
var = c("P", "L", "R", "RL", "RR",
"P", "L", "R" , "RL", "RR", "LL", "LR", "RRL", "RRR")
)
Here, P means the parent node, L means left node, R means right node, RL means the left node from the previous right node etc... The diagram below shows what the decision trees made from df$var would look like:
So, as we can see, every time we reach a P in df$var, we start a new decision tree, as it is the parent.
Now, I want to try and number the nodes, so I can plot them. I initially tried numbering the nodes sequentially, like so:
df <- df %>%
group_by(newVal = cumsum(var == "P")) %>%
mutate(node = 1:length(var)) %>%
ungroup() %>%
select(-newVal)
df
var node
P 1
L 2
R 3
RL 4
RR 5
P 1
L 2
R 3
RL 4
RR 5
LL 6
LR 7
RRL 8
RRR 9
For clarity, that would look like this:
But as you can see (mainly in the 2nd tree), due to the original ordering of df$var, it results in a non-intuitive numbering of the nodes. This presents a problem when I try to plot the tree.
The issue is, when Im plotting the trees, I have to create data frames (for each tree) with 2 columns. That is, from and to, where we go from node x to node y. Using the 2nd image as an example, my data frames for plotting would look like this:
tree.1.Edges <- data.frame(
from = c(1,1,3,3),
to = c(2,3,4,5)
)
tree.2.edges <- data.frame(
from = c(1,1,2,2,3,3,5,5),
to = c(2,3,6,7,4,5,8,9)
)
Im finding it difficult to come up with a way to automate the process of creating the tree edges data frames using my method of sequentially numbering the nodes. Does anyone have any suggestions as to a better way I could do this?

Prefix
This is my solution. It returns a list of the edges with correctly numbered nodes.
The Nodes are numbered like this:
Parent node number < Child node number
Left node number < Right node number
Code
library(tidyverse)
df <- data.frame(
var = c("P", "L", "R", "RL", "RR",
"P", "L", "R" , "RL", "RR", "LL", "LR", "RRL", "RRR"),
stringsAsFactors = FALSE # important for character operations
)
#enumerate tree ids
# a new tree is initialized when a parent node "P" is initialized
df$tree <-cumsum(df$var=="P") # Cumsum increments for every TRUE by one
#sort nodes so that Left nodes are in front of Right nodes
# and every deeper level of the tree is numbered after
# the preceeding level
df <- df %>% group_by(tree) %>% mutate(level = nchar(var)) %>%
group_by(tree) %>% arrange(level, # arrange by level first
# custom alphabet function where P comes first
# As L comes in front of R in the alphabet
# longer strings are
# correctly sorted
ifelse(var=="P",1,match(LETTERS,var)+1),
.by_group = TRUE)
# define the nodes as row numbers resetting at every tree
df <- df %>% group_by(tree) %>% mutate(node = row_number())
## At this point the nodes are numbered according to your specifications
# Find out parent node by deleting the last character from every node name (var)
df <- df %>% group_by(tree) %>% mutate(parent_node_name=substr(var,0,nchar(var)-1))
# define parent node of P as NA
df$parent_node_name[df$var=="P"] <- NA
# define parent nodes vars with still empty parent node name as "P"
df$parent_node_name[df$parent_node_name==""] <- "P"
# Match parent node names to node numbers
df <- df %>%
group_by(tree) %>%
mutate(parent_node_num = match(parent_node_name,var))
# split the dataframe into a list of dfs, one for each tree
list_edges <- split(df,df$tree)
# for every dataframe in the list, replace by a result dataframe (res)
list_edges <- lapply(list_edges, function(df_tree){
res <- data.frame(
from = df_tree$parent_node_num,
to = df_tree$node
)
# delete NAs from result
res <- res[!is.na(res$from),]
return(res)
})
# Show result
list_edges
# $`1`
# from to
# 2 1 2
# 3 1 3
# 4 3 4
# 5 3 5
#
# $`2`
# from to
# 2 1 2
# 3 1 3
# 4 3 4
# 5 3 5
# 6 2 6
# 7 2 7
# 8 5 8
# 9 5 9
The code is quite convoluted, but you can insert df at any point to look at the intermediate results. Or simply post a comment.

Related

Finding the cumulative sum of the value of nodes in a DAG

Suppose I have the following directed acyclic graph (DAG) with each node having a weight of 1.
I am interested in calculating the accumulated sum of each node based on the value of its ancestor. Assuming as I said earlier that the weight of each node is 1, then this is what I would expect to get
This is what I tried to do:
library(tidygraph, quietly = TRUE)
library(tidyverse)
library(ggraph)
# create adjacencies
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"B", "D")
# create the graph
grafo <- as_tbl_graph(grafo_df)
# calculate accumulated sum
grafo %>%
arrange(node_topo_order()) %>%
mutate(
revenue = 1,
cum_weight = map_dfs(1, .f = function(node, path, ...) {
sum(.N()$revenue[c(node, path$node)])
})) %>%
as_tibble() %>%
unnest("cum_weight")
#> # A tibble: 4 x 3
#> name revenue cum_weight
#> <chr> <dbl> <dbl>
#> 1 C 1 1
#> 2 A 1 2
#> 3 B 1 2
#> 4 D 1 3
Created on 2021-05-13 by the reprex package (v2.0.0)
As you can see, the accumulated sum of D results in 3 and not 4, because the value of D should be the sum of the accumulated value of A and B. I do not understand why D does not add 4
I have tried to understand the solution given here, but had a hard time understanding it
How can I get the accumulated sum?
Update # 1
I am not concerned (for the moment) with the complexity of the algorithm, that is, if the algorithm does it in O(V + E) it is not relevant.
Something important that is mentioned in this question is about the problem of counting twice, that is, the partial sum of the value of A is equal to C(1) + A(1) = 2, and the partial sum of the value of B is equal to C(1) + B (1) = 2, so to say that the value of D is not equal to the partial sums of A (2) + B(2) because the value of C would be duplicating I think it does not apply in this situation due to the following:
Let's imagine that each of these 4 nodes (A, B, C and D) are internet nodes that generate revenue of $1 each, so the total accumulated income of the 4 nodes would be $4. If D is the convergence node of the rest of nodes, then in a scenario where D stops working, the income of the remaining nodes and that of D would no longer be possible, therefore, its value is $4.
Update # 2
If I add a new path from C to D then the value of D should always be 4 because the number of dependent nodes is maintained, that is, what should matter is the number of dependent nodes in the accumulated sum. For example, in the solution proposed by #ThomasIsCoding, if I add this new path, the value of D is now 5, I think partly that their algorithm uses the degrees as a parameter to calculate the cumulative sum, however, if I add a additional node then the calculation is correct.
Update # 3
The example that I have placed is simple with the intention that it is easy to understand the objective, however, I did not specify that it should be generalizable for a graph with many nodes with three different topologies. The outermost layers are trees, the middle layers are rings, and the innermost layer is a full mesh.
Here is an igraph option using distance with argument mode = "in"
If your nodes are unweighted, i.e., revenue=1 for all nodes
g <- graph_from_data_frame(grafo_df)
data.frame(name = names(V(g))) %>%
mutate(revenue = 1) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 1 3
3 B 1 2
4 F 1 1
5 D 1 5
If your nodes are weighted, e.g.,
data.frame(name = names(V(g))) %>%
mutate(revenue = 1:n()) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 2 7
3 B 3 4
4 F 4 4
5 D 5 15
Data
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"C", "D",
"B", "D",
"F", "A"
)
and the DAG by plot(g) is given as
Now the question is clear, so I propose an algorithm, I cannot code it since I don't know the language that you are using.
For each node Ni in the graph we will calculate the set of ancestors Ai, then the accumulated sum for each node will be |Ai| + 1.
Initialize all nodes with an empty ancestor set Ai = {}
Start with a set S0 containing all nodes with no incoming edges
Initialize the next set Sn+1
Iterate over Sn, for each node N:
For all nodes D with an incoming edge from N:
Merge the ancestor set of D with the ancestor set of N plus N itself
remove the egde N->D
If D has no other incoming edges add it to Sn+1
If Sn+1 is not empty, increase pass to n+1 and repeat from 2.
The big limit of this solution is the complexity, I'll try later to find some optimized solution.

How to search for Specific Rows in R and return that row's information back

I am working with RNA seq data with Gene names as my first column and cluster gene expression data as the following columns. There are a lot of genes however I am only interested in about 200 of them. Is there a way to just target those specific genes and then create a data matrix with them. I can retrieve information from columns
Mydata.1 <- x[c("Gene Name", "Cluster_1")]
But not rows for example this fails
Mydata.1 <- x[c("Malat1", "Cd74")]
does anyone know how I would be able to do that?
Thanks!
This answer uses a logical vector for subsetting your dataframe rows. For more on this take a look at: http://adv-r.had.co.nz/Subsetting.html#data-types.
# Mockup data
x <- data.frame(
`Gene Name` = c("HPRT1", "ABC", "Malat1", "Cd74"),
Cluster_1 = 1:4,
Cluster_2 = 5:8,
check.names = FALSE
)
# Defining gene names of interest to look for
target_genes <- c("Malat1", "Cd74")
# Getting a logical vector that implicitly codes for row positions
# Note: we need to wrap Gene Name in backticks (``) because of the space character in "Gene Name"
row_matches <- x$`Gene Name` %in% target_genes
# Subsetting the gene expression matrix (actually a dataframe object)
# mydata2: dataframe whose rows are for target genes only
# Note: the empty placeholder after the comma in the subsetting below indicates all columns
mydata2 <- x[row_matches, ]
mydata2
#> Gene Name Cluster_1 Cluster_2
#> 3 Malat1 3 7
#> 4 Cd74 4 8
Alternatively we can also use the function subset for more concise code:
# Mockup data
x <- data.frame(
`Gene Name` = c("HPRT1", "ABC", "Malat1", "Cd74"),
Cluster_1 = 1:4,
Cluster_2 = 5:8,
check.names = FALSE
)
# Defining gene names of interest to look for
target_genes <- c("Malat1", "Cd74")
# As an alternative use the function subset
mydata2 <- subset(x, `Gene Name` %in% target_genes)
mydata2
#> Gene Name Cluster_1 Cluster_2
#> 3 Malat1 3 7
#> 4 Cd74 4 8
To find data you want, you can use this code :
newdata <- mydata[ which(mydata$gene=='THE_GENE_U_LOOK_FOR', ]

Calculations across more than two different dataframes in R

I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.

IGraph: network distance until stop node/vertex

I have an igraph network that contains two types of nodes, one set that describes my points/nodes of interest (NOI) and another set that act as barriers (B) in my network. Now I'd like to measure the total length of all edges that are connected starting from a specific NOI until a barrier is approached.
Here a short example using a ring-shape in igraph:
set.seed(123)
g <- make_ring(10) %>%
set_edge_attr("weight", value = rnorm(10,100,20))%>%
set_vertex_attr("barrier", value = c(0,0,1,0,0,1,0,0,1,0))%>%
set_vertex_attr("color", value = c("green","green","red",
"green","green","red",
"green","green","red","green"))
For example when starting from my node 1 (NOI, green) all edges until the nodes 9 and 3 are reachable (the nodes 9 and 3 are barriers B and block). Thus the total connected length of edges for NOI 1 is the sum of the lengths/weights of edges 1--2,2--3,1--10 and 10--9. The same value is true for node 10 as starting node. I the end I am interested in a list/dataframe of all NOI and their total length of reachable network. How to best proceed in R using igraph? Is there a built-in function in igraph?
Here's one possible strategy. First, I set a name for each node so I will be preserved during graph transformations
V(g)$name = seq.int(vcount(g))
Now I drop all the barriers and split the graph up into separate connected nodes of interest that will all share the same length.
gd <- g %>% induced_subgraph(V(g)[V(g)$barrier==0]) %>% decompose()
Then We can write a helper function that takes a subgraph and finds all the incident edges for the nodes in the subgraph in the the original graph, extracts the weights, and sums them up
get_connected_length <- function(x) {
incident_edges(g, V(g)$name %in% V(x)$name) %>% do.call("c", .) %>% unique() %>% .$weight %>% sum()
}
Now we apply the function to each of the subgraphs and extract the node names
n <- gd %>% Map(function(x) V(x)$name, .)
w <- gd %>% Map(get_connected_length, .)
And we can combine that data all together in a matrix
do.call("rbind", Map(cbind, n, w))
# [,1] [,2]
# [1,] 1 361.5366
# [2,] 2 361.5366
# [3,] 10 361.5366
# [4,] 4 335.1701
# [5,] 5 335.1701
# [6,] 7 318.2184
# [7,] 8 318.2184

permute dataframe but must have unique rows

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

Resources