I am trying to create an edge list based on binary splits.
If I have a data frame that only contains the node number and some other metric, then I can manually create an edge list for the nodes. For example, if my data frame looks like this:
dfTest <- data.frame(
node = c(1,2,3,4,5),
var = c("milk", NA, "coffee", NA, NA),
isLeaf = c(F, T, F, T, T)
)
> dfTest
node var isLeaf
1 1 milk FALSE
2 2 <NA> TRUE
3 3 coffee FALSE
4 4 <NA> TRUE
5 5 <NA> TRUE
Then, based on the var or isLeaf column, I can manually create an edge list to connect the nodes. For example, As node 2 is a leaf, I know that node 1 must go to node 2. Then (as they are binary splits) I know node 1 must also connect to node 3. And as node 4 and 5 are leaf nodes, I know that they must split on node 3.
Manually creating the edge list would look like this:
edges <- data.frame(
from = c(1, 1, 3, 3),
to = c(2, 3, 4, 5)
)
The to column is easy to find... it will always be c(2:length(dfTest$nodes)). In this case 2,3,4,5. But the from column is proving difficult to find.
Just for a visual aid, the resulting tree would look like this:
Is there any way to do this without having to manually work out the edges?
EDIT:
In response to an answer, I'm adding a slightly larger dataset to use:
dfTest <- data.frame(
node = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
var = c("milk", "milk", NA, NA, "coffee", "sugar", NA, NA, "milk", NA, NA),
isLeaf = c(F, F, T, T, F, F, T, T, F, T, T)
)
A little explanation:
From the var column I know that milk (the root/node 1) splits to another milk (node 2). I can then see that node 2 splits to NA (node 3) and NA (node 4). As I know they are binary splits, I know that node 2 cant split any further. So, I must go back to the previous node that only had 1 split… in this case node 1 (i.e., milk) which then splits to the right on coffee (node 5). Again, as they are binary splits, I now know that coffee (node 5) must split to sugar (node 6). Sugar (node 6) is followed by 2 NAs (node 7 & 8 ). Now, I must go back to coffee (node 5) and split to the right to get milk (node 9) which splits to 2 NAs (node 10 &11)
The desired node/edge list should look like this:
edges <- data.frame(
from = c(1,2,2,1,5,6,6,5,9,9),
to = c(2,3,4,5,6,7,8,9,10,11)
)
Which would produce a tree like this:
As per your update, perhaps you can try the code below
grps <- split(dfTest, ~ cumsum(!isLeaf))
edges <- do.call(
rbind,
lapply(
grps,
function(x) {
with(x, expand.grid(from = node[!isLeaf], to = node[isLeaf]))
}
)
)
for (k in seq_along(grps)) {
if (nrow(grps[[k]]) == 1) {
lleaf <- with(grps[[k + 1]], node[!isLeaf])
rleaf <- with(grps[[k + 2]], node[!isLeaf])
edges <- rbind(edges, data.frame(from = grps[[k]]$node, to = c(lleaf, rleaf)))
}
}
edges <- `row.names<-`(edges[with(edges, order(from, to)), ], NULL)
which gives
> edges
from to
1 1 2
2 1 5
3 2 3
4 2 4
5 5 6
6 5 9
7 6 7
8 6 8
9 9 10
10 9 11
Visualization
library(igraph)
graph_from_data_frame(edges) %>%
plot(layout = layout_as_tree)
Related
I'm trying to write a function that combines up to 4 (fair 6 sided) dice rolls to create a specific value (named 'target.mountain') as many times as possible given the numbers shown on the dice.
Then return these values along with any that aren't used in said combination. If the other numbers that aren't used to form the 'target.mountain' can sum to be within the range (5-10) then do so.
So as an example say I roll 4,3,2,5 and my target.mountain value is 9
I would do
4 + 5 -> 9 and as 2 + 3 = 5 my function would return 9, 5
Another example could be
Roll = (2,3,6,4) --> (6 + 3), (4 + 2) --> 9, 6
Once these values have been found then list so it appears like
[1] 9, 5 (example 1)
[1] 9, 6 (example 2)
How do I go about doing this?
If you have ever played the board game 'Mountain Goats' then that may shed some light on how I need the dice to work as I just cannot figure it out!
Let's make the problem a bit harder, say 5 dice.
library(tidyverse)
rolls <- sample(1:6,replace = TRUE, size = 5)
target.mountain <- 7
#Make all possible combinations of the dice:
map_dfr(seq_along(rolls),~ combn(seq_along(rolls),.x,simplify = FALSE) %>%
map(~tibble(dice = list(.), sum = sum(rolls[.]), rolls = list(rolls[.]),length = length(.)))) %>%
#filter to only those combinations which equal the target
filter(sum == target.mountain) %>%
#Now make all possible combinations of the sets that equal the target
{map2(.x = list(.), .y = nrow(.) %>% map(.x = seq(.), .f = combn,x=.,simplify = FALSE) %>% unlist(recursive = FALSE),
~.x[unlist(.y),])} %>%
#Subset to non-overlapping sets
subset(map_lgl(.,~length(reduce(.x$dice,union))==length(unlist(.x$dice)))) -> part1
map(part1, as.data.frame)
#[[1]]
# dice sum rolls length
#1 1, 3 7 3, 4 2
#
#[[2]]
# dice sum rolls length
#1 4, 5 7 6, 1 2
#
#[[3]]
# dice sum rolls length
#1 2, 3, 5 7 2, 4, 1 3
#
#[[4]]
# dice sum rolls length
#1 1, 3 7 3, 4 2
#2 4, 5 7 6, 1 2
From here you can apply whatever rules you want:
part1 %>%
#subset to the largest number of sets
subset(map_dbl(.,nrow) == max(map_dbl(.,nrow))) %>%
#subset to the fewest number of total dice
subset(map_dbl(.,~sum(.x$length)) == min(map_dbl(.,~sum(.x$length)))) %>%
#if there are still ties, pick the first
`[[`(1) -> part2
as.data.frame(part2)
# dice sum rolls length
#1 1, 3 7 3, 4 2
#2 4, 5 7 6, 1 2
possible solution to the problem
target.mountain = 9
dice <- c(4,3,2,5)
library(tidyverse)
fn <- function(target.mountain, dice){
fltr <- map(seq_along(dice), ~combn(dice, .x, sum) == target.mountain)
out <- map(seq_along(dice), ~combn(dice, .x))
sum_target <- map2(out, fltr, ~.x[, .y]) %>%
purrr::discard(.x = ., function(x) length(x) == 0) %>%
keep(.x = ., .p = function(x) length(x) == min(lengths(.))) %>%
flatten_dbl()
no_sum_target <- dice[!(dice %in% sum_target)]
result <- toString(c(sum(sum_target), no_sum_target))
return(result)
}
fn(target.mountain = target.mountain, dice = dice)
#> [1] "9, 3, 2"
Created on 2021-03-29 by the reprex package (v1.0.0)
This may look simple but I am not able to do this. I want to plot two types of nodes, small and big using a cutoff. The values are in the attribute degree.
Here is a small toy example
g1 <- graph(edges=c(1,2, 2,3, 3, 1, 4,2), n=4, directed=F) %>%
set_vertex_attr("names", value = LETTERS[1:4])
g1_degree <- degree(g1, mode = "total")
g1_degree
[1] 2 3 2 1
g1 <- set_vertex_attr(g1, "degree", value = g1_degree)
plot(g1, vertex.size=V(g1)$degree)
This gives me every node according to the degree, but I want nodes of degree 2 and 3 big and 1 small.
So I tried to edit the values within V(g1)$degree
ifelse(V(g1)$degree < 2, yes = V(g1)$degree==1, no = V(g1)$degree==5)
FALSE FALSE FALSE TRUE
Ok, I checked my degree values, but how can I overwrite the TRUE or FALSE using the cutoffs I need?
Here are two solutions.
One with ifelse, like in the question.
g1 <- set_vertex_attr(g1, "degree", value = ifelse(V(g1)$degree < 2, 1, 5))
V(g1)$degree
#[1] 5 5 5 1
And another with findInterval. This has better performance than ifelse, which can be important in large data sets.
i <- findInterval(V(g1)$degree, c(0, 2, Inf))
g1 <- set_vertex_attr(g1, "degree", value = c(1, 5)[i])
V(g1)$degree
#[1] 5 5 5 1
With different new sizes set with the findInterval index, c(10, 50)[i], the graph would look like below.
g1 <- set_vertex_attr(g1, "degree", value = c(10, 50)[i])
plot(g1, vertex.size = V(g1)$degree)
I have a data frame that contains two columns, an ID column and a column with sub ID's that are related to the corresponding ID. The sub ID's can again have sub ID's (in this case the previous sub ID is now an ID).
library(tibble)
df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))
df
# A tibble: 6 x 2
id sub_id
<dbl> <dbl>
1 1 2
2 1 3
3 2 4
4 2 5
5 3 6
6 7 8
I would like to write a function that finds all sub ID's that are related to an ID. It should return a vector with all sub ID's.
find_all_sub_ids <- function (data, id) {
data %>% ...
}
find_all_sub_ids(df, id = 1)
[1] 2 3 4 5 6
find_all_sub_ids(df, id = 2)
[1] 4 5
find_all_sub_ids(df, id = 9)
[1] NULL
This is very different from everything I have done in R so far and it was hard for me to phrase a good title for this question. So it is possible that with the right phrasing I could have already found an answer by just googling.
My first intuition for solving this was while loops. Since I also do not know how many sublevels there could be the function should continue until all are found. I never used while loops though and don't really know how I could implement them here.
Maybe someone knows a good solution for this problem. Thanks!
Edit: Forgot to assign the tibble to df and to use this argument in the function call.
With igraph:
library(igraph)
g <- graph_from_data_frame(d, directed = TRUE)
find_all_subs <- function(g,id){
#find child nodes, first one being origin
r <- igraph::subcomponent(g,match(id, V(g)$name),"out")$name
#remove origin
as.numeric(r[-1])
}
find_all_subs(g,1)
[1] 2 3 4 5 6
find_all_subs(g,2)
[1] 5 6
I think it's easiest to formulate this as a graph problem.
Your data.frame describes a directed graph (vertices going from id to sub_id), and you are interested in which nodes are reachable from a certain vertex.
Using tidygraph, this can be achieved as such:
library(tidyverse)
library(tidygraph)
df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))
find_all_sub_ids <- function (id) {
if (!(id %in% df$id)) {
return(NULL)
}
grph <- df %>%
as_tbl_graph(directed = TRUE)
id <- which(grph %>% pull(name) == as.character(id))
grph %>%
activate(nodes) %>%
mutate(reachable = !is.na(bfs_dist(id))) %>%
as_tibble() %>%
filter(reachable) %>%
pull(name) %>%
as.numeric()
}
We see which nodes are reachable (they have a non-NA distance to your given node), we use bfs_dist (see here for explanation).
This gives
> find_all_sub_ids(1)
[1] 1 2 3 4 5 6
> find_all_sub_ids(2)
[1] 2 4 5
> find_all_sub_ids(9)
NULL
The advantage of such an approach is that it can search many levels deep without you needing to write a loop explicitly.
Edit
There was a bug in my code, tidygraph::bfs_dist uses a differend id than I expected. Fixed it now.
On the new example:
> find_all_sub_ids(10)
[1] 10 200 300
I did it using a dataframe. The following works.
x= c(1,1,2,2,3,7)
y = c(2, 3, 4, 5, 6, 8)
df <- data.frame(cbind(x,y))
colnames(df) =c('id', 'sub_id')
find_all_sub_ids <- function (df, id_requested) {
si <- df[df$id==id_requested,]$sub_id
return(si)
}
find_all_sub_ids(df,id=2)
[1] 4 5
I have a large data set that I want to represent with a network graph using igraph. I just don't understand how to get the colors right. Is it possible to get an igraph plot with edge having the same color as vertex color? I my example below, I would like to color vertex and edges according to the status 'sampled' or 'unsampled'. An other problem is that all the edge do not appear on the igraph, and I don't understand why
My code so far is:
d <- data.frame(individual=c(1:10), mother_id = c(0,0,0,0,0,1,3,7,6,7), father_id = c(0,0,0,0,0,4,1,6,7,6) , generation = c(0,0,0,0,0,1,1,2,2,2), status=c("sampled","unsampled","unsampled","sampled",'sampled',"sampled","unsampled","unsampled","sampled",'sampled'))
#Just some settings for layout plot
g <- d$generation
n <- nrow(d)
pos <- matrix(data = NA, nrow = n, ncol = 2)
pos[, 2] <- max(g) - g
pos[, 1] <- order(g, partial = order(d$individual, decreasing = TRUE)) - cumsum(c(0, table(g)))[g + 1]
#Plotting the igraph
G <- graph_from_data_frame(d)
plot(G, rescale = T, vertex.label = d$individual, layout = pos,
edge.arrow.mode = "-",
vertex.color = d$status,
edge.color = d$status,
asp = 0.35)
My question is somewhat similar to this question, but I would like to do it with igraph package.
Ggraph node color to match edge color
Thanks for your help
if you plot(G) you will see that the graph from data frame object is not what you expect, most likely. That is why you dont see all edges (i.e the column father_id is not used at all).
By default igraph takes the first column as "from" and the second one as "to". That is why you see 1to0, 2to0 and so on.
You can fix this by passing in two objects, one with the edges and their attributes, and one with the nodes and their attributes.
It is not so clear to me where the edges should be. However, your code should look something like this:
dd <- read.table(text = "
from to type
1 6 A
3 7 B
7 8 A
6 9 B
7 10 A
4 6 B
1 7 A
6 8 B
7 9 B
6 10 A ", header=T )
nodes <- data.frame(id=unique(c(dd$from, dd$to)) )
nodes$type <- sample(LETTERS[1:2], 8, replace = T )
nodes$x <- c(8,3,5,7,1,2,4,10) # this if for the layout
nodes$y <- c(1, 2, 4, 5, 6, 8, 5, 7)
nodes
id type x y
1 1 B 8 1
2 3 A 3 2
3 7 B 5 4
4 6 A 7 5
5 4 A 1 6
6 8 B 2 8
7 9 A 4 5
8 10 A 10 7
G <- graph_from_data_frame(dd, vertices = nodes ) # directed T or F?
V(G)$color <- ifelse( V(G)$type == "A", "pink", "skyblue")
E(G)$color <- ifelse( E(G)$type == "A", "pink", "skyblue")
edge_attr(G)
vertex_attr(G)
plot(G)
This may be a rather complex question so if someone can at least point me in the right direction I can probably figure out the rest on my own.
Sample data:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
A B C D
1 1 6 5 5
2 4 5 3 NA
3 5 NA 1 3
4 3 5 5 10
5 NA 3 3 4
6 5 5 7 5
I would like to find all possible permutations of letter sequences of different lengths from the table shown above. For example, one valid letter sequence might be: A C A D D B. Another valid sequence could be B C C.
However, there are a few exceptions to this I'd like to follow:
1. Must be able to specify the minimum length of the returned sequence.
Note that in my example above, the min sequence length was 3 and the max sequence length was equal to the number of rows. I would like to be able to specify the min value (the max value will always be equal to the number of rows, 6 in the case of the sample data).
Note that if the sequence length is shorter than 6, it cannot be generated from skipping rows. In other words, any short sequences must come from consecutive rows. Clarification based on comments: Short sequences do not have to start on row 1. A short sequence could start on row 3 and continue onward through consecutive rows to row 6.
2. Letters with an NA value are not available for sampling.
Note that in row 2 there is an NA in the D column. This means that D would not be available for sampling in row 2. So A B D would be a valid combination but A D D would not be valid.
3. The sequences must be ranked based on the values in each cell.
Notice how each cell has a specific value in it. Each sequence chosen can be ranked by summing up the value shown in the table for the chosen letter. Using the example from above A C A D D B would have a rank of 1+3+5+10+4+5. So when generating all possible sequence they should be ordered from highest rank to lowest rank.
I would like to apply all three of these rules to the data table listed above to find all combinations of sequences possible of minimum length 3 and maximum length 6.
Please let me know if I need to clarify anything!
In principle, you want to do this using expand.grid I believe. Using your example data, I worked out the basics here:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5),
B = c(6, 5, NA, 5, 3, 5),
C = c(5, 3, 1, 5, 3, 7),
D = c(5, NA, 3, 10, 4, 5))
dat[,1][!is.na(dat[,1])] <- paste("A",na.omit(dat[,1]),sep="-")
dat[,2][!is.na(dat[,2])] <- paste("B",na.omit(dat[,2]),sep="-")
dat[,3][!is.na(dat[,3])] <- paste("C",na.omit(dat[,3]),sep="-")
dat[,4][!is.na(dat[,4])] <- paste("D",na.omit(dat[,4]),sep="-")
transp_data <- as.data.frame(t(dat))
data_list <- list(V1 = as.vector(na.omit(transp_data$V1)),
V2 = as.vector(na.omit(transp_data$V2)),
V3 = as.vector(na.omit(transp_data$V3)),
V4 = as.vector(na.omit(transp_data$V4)),
V5 = as.vector(na.omit(transp_data$V5)),
V6 = as.vector(na.omit(transp_data$V6)))
This code lets you essentially transform your data frame into a list of vectors of different lengths (one element for each variable in your original data, but omitting NAs and such). The reason you would want to do this is because it makes finding the acceptable combinations trivially easy by using the expand.grid function.
To solve for the six, you would simply use:
grid_6 <- do.call(what = expand.grid,
args = data_list)
This would give you a list of all possible permutations that met your criteria for the six (i.e. there were no NA elements). You can extract the numeric data back using some regular expressions (not a very vectorized way of doing it, but this is a complex thing that I don't have time to fully put into a function).
grid_6_letters <- grid_6
for(x in 1:ncol(grid_6_letters)) {
for(y in 1:nrow(grid_6_letters)) {
grid_6_letters[y,x] <- gsub(pattern = "-[0-9]*",replacement = "",x = grid_6_letters[y,x])
}
}
grid_6_numbers <- grid_6
for(x in 1:ncol(grid_6_numbers)) {
for(y in 1:nrow(grid_6_numbers)) {
grid_6_numbers[y,x] <- gsub(pattern = "^[ABCD]-",replacement = "",x = grid_6_numbers[y,x])
}
grid_6_numbers[[x]] <- as.numeric(grid_6_numbers[[x]])
}
grid_6_letters$Total <- rowSums(grid_6_numbers)
grid_6_letters <- grid_6_letters[order(grid_6_letters$Total,decreasing = TRUE),]
Anyway, if you wanted to get the various lower-level combinations, you could do it by simply using expand.grid on subsets of the list and combining them using rbind (with some judicious use of setNames as needed. Example:
grid_3 <- rbind(setNames(do.call(what = expand.grid,args = list(data_list[1:3],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[2:4],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[3:5],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[4:6],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")))
Anyway, with some time and programming, you can likely wrap this into a function that is much better than my example, but hopefully it will get you started.
Sorry I don't do any R anymore, so I'll try to help with a dirty code...
addPointsToSequence <- function(seq0, currRow){
i<-0;
for(i in 1:4){# 4 is the number of columns
seq2 = seq0
if (!is.na(dat[currRow,i])){
# add the point at the end of seq2
seq2 = cbind(seq2,dat[currRow,i])
# here I add the value, but you may prefer
# adding the colnames(dat)[i] and using the value to estimate the value of this sequence, in another variable
if(length(seq2) >= 3){
# save seq2 as an existing sequence where you need to
print (seq2)
}
if(currRow < 6){# 6 is the number of rows in dat (use nrow?)
addPointsToSequence(seq2, currRow+1)
}
}
}
}
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
for (startingRow in 1:4){
#4 is the last row you can start from to make a length3 sequence
emptySequence <- {};
addPointsToSequence(emptySequence , i);
}