the below snapshot visual is created using the "visNetwork" package. My requirement here is that I have to hard code the edges and also after using visHierarchicalLayout(), I am not able to see them in order, Please help me with a dynamic approach such that no matter how many numbers, I get consecutive numbers in order without hard code. Thanks and please help.
library(visNetwork)
nodes <- data.frame(id = 1:7, label = 1:7)
edges <- data.frame(from = c(1,2,3,4,5,6),
to = c(2,3,4,5,6,7))
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "to") %>%
visHierarchicalLayout()
Using level attribute does the job, it aligns the network based on the order given.
library(visNetwork)
nodes <- data.frame(id = 1:7, label = 1:7, level = 1:7)
# Extract the id
num <- nodes$id
# Repeat the numbers
num2 <- rep(num, each = 2)
# Remove the first and last numbers
num3 <- num2[c(-1, -length(num2))]
#Create a data frame
edges <- as.data.frame(matrix(num3, ncol = 2, byrow = TRUE))
names(edges) <- c("from", "to")
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "to") %>%
visHierarchicalLayout()
If I understand your question correctly, you want to create the edges data frame based on the id in the nodes data frame. Here is one option.
# Extract the id
num <- nodes$id
# Repeat the numbers
num2 <- rep(num, each = 2)
# Remove the first and last numbers
num3 <- num2[c(-1, -length(num2))]
# Create a data frame
edges <- as.data.frame(matrix(num3, ncol = 2, byrow = TRUE))
names(edges) <- c("from", "to")
edges
# from to
# 1 1 2
# 2 2 3
# 3 3 4
# 4 4 5
# 5 5 6
# 6 6 7
Related
So, I have a list of multiple tidygraph objects and what Im trying to do is return the index a specific tidygraph object, selected by the user. Hopefully my example below will explain the problem.
(ASIDE: I have attempted a solution that I show below, but at the moment it is super slow to run. Im hoping to come up with a different, faster solution.)
To begin, I create some data to turn into tidygraph objects, then I create the tidygraph objects and put them all together into a list:
library(tidygraph)
# create some data for the tbl_graph
nodes <- data.frame(name = c("Hadley", "David", "Romain", "Julia"),
level = c(1,1,1,1),
rank = c(1,1,1,1))
nodes1 <- data.frame(name = c("Hadley", "David", "Romain", "Julia"),
level = c(1,1,1,1),
rank = c(2,2,2,2))
nodes2 <- data.frame(name = c("Hadley", "David", "Romain", "Julia"),
level = c(1,1,1,1),
rank = c(3,3,3,3))
nodes3 <- data.frame(name = c("Hadley", "David", "Romain", "Julia"),
level = c(2,2,2,2),
rank = c(1,1,1,1))
edges <- data.frame(from = c(1, 1, 1, 2, 3, 3, 4, 4, 4),
to = c(2, 3, 4, 1, 1, 2, 1, 2, 3))
# create the tbl_graphs
tg <- tbl_graph(nodes = nodes, edges = edges)
tg_1 <- tbl_graph(nodes = nodes1, edges = edges)
tg_2 <- tbl_graph(nodes = nodes2, edges = edges)
tg_3 <- tbl_graph(nodes = nodes3, edges = edges)
# put into list
myList <- list(tg, tg_1, tg_2, tg_3)
For clarity, looking at the 1st list element looks like this:
> myList[1]
[[1]]
# A tbl_graph: 4 nodes and 9 edges
#
# A directed simple graph with 1 component
#
# Node Data: 4 × 3 (active)
name level rank
<chr> <dbl> <dbl>
1 Hadley 1 1
2 David 1 1
3 Romain 1 1
4 Julia 1 1
#
# Edge Data: 9 × 2
from to
<int> <int>
1 1 2
2 1 3
3 1 4
# … with 6 more rows
We can see that each object has a variable called level and another called rank. What Im trying to do is return the list index of an object by selecting the level and rank number. So, for example, if I select level = 1 and rank = 2, my function would return the index of the object with those values (in this case the 2nd list element). My attempted solution to this is below, but it's a very slow process... I was wondering if there is a better way to achieve what I want?
My Attempted Solution
In my solution, I begin by turning each of the tidygraph objects in a tibble to make them easier to manipulate. And this is why my function is so slow. In my data, I could have up to 200,000 tidygraph objects in a list, so going through them and converting them all to tibbles is a very slow process. I do that like so:
# seperating out the list to make it easier to manipulate
list_obj <- lapply(myList, function(x){
edges <- tidygraph::activate(x, edges) %>% tibble::as_tibble()
nodes <- tidygraph::activate(x, nodes) %>% tibble::as_tibble()
return(list(edges = edges, nodes = nodes))
} )
And then this is the function I actually use to extract the index of the chosen object:
# this function returns the tree index asked for by user
getTreeListNumber <- function(listObj, level, rank){
res <- 0
listNumber <- NA
for(i in 1:length(listObj)){
res <- level %in% listObj[[i]]$nodes$level && rank %in% listObj[[i]]$nodes$rank
if(res == TRUE){
listNumber <- i
}
}
return(listNumber)
}
For example:
> getTreeListNumber(list_obj, level = 1, rank = 2)
[1] 2
By selecting the level and rank, the function return the objects index within the list. But is there a faster way to achieve this result?
You may try -
getTreeListNumber <- function(listObj, level, rank){
which(sapply(myList, function(x) {
nodes <- tidygraph::activate(x, nodes) %>% tibble::as_tibble()
all(nodes$level == level & nodes$rank == rank)
}))
}
getTreeListNumber(myList, 1, 2)
#[1] 2
I'm working with the stackoverflow developer survey data-set and attempting to predict compensation based on technologies worked with and collaborative tools worked with. These two attributes are multi-valued with semicolons separating the individual values.
For instance, under the CollabToolsWorkedWith attribute in one row, there is Confluence;Jira;Github;Slack;Microsoft;Teams;Google Suite. I want to give each of these values their own column with a value of either 0 or 1 if the row had that value.
The end result would have each row contain a column for every single value under CollabToolsWorkedWith and each column would contain 0's and 1's based on whether or not the row contained that value.
You may get a quicker answer next time if you provide some sample data that everyone can quickly access. I found the 2020 data online. Here is my answer:
# read the data frame
rm(list = ls())
df <- read.csv("survey_results_public.csv")
# figure out which column you are talking about
data.frame(colnames(df))
table(df$NEWCollabToolsWorkedWith)
# convert to lower case and character
df$NEWCollabToolsWorkedWith <- as.character(df$NEWCollabToolsWorkedWith)
df$NEWCollabToolsWorkedWith <- tolower(df$NEWCollabToolsWorkedWith)
# keep only the useful variables and separate based on ;
library(tidyverse)
library(splitstackshape)
namesdf <- df %>% select(NEWCollabToolsWorkedWith)
namesdf <- cSplit(namesdf,"NEWCollabToolsWorkedWith", sep = ";", direction = "wide", drop=TRUE,
type.convert = TRUE)
# stack stuff on top of each other to find unique list of tools/platforms
long_data_frame <-
namesdf %>%
pivot_longer(cols = starts_with("NEWCollabToolsWorkedWith"), # use columns starting with "year"
names_to ="unique", # name of new column
names_prefix = "_",
values_drop_na = TRUE) %>%
distinct(value)
# clean the variable names
library(janitor)
long_data_frame$value = as.character(long_data_frame$value)
long_data_frame$value = janitor::make_clean_names(long_data_frame$value)
# get final unique list
table(long_data_frame$value)
> table(long_data_frame$value)
confluence facebook_workplace github gitlab
1 1 1 1
google_suite_docs_meet_etc jira microsoft_azure microsoft_teams
1 1 1 1
slack stack_overflow_for_teams trello
1 1 1
# create new variables
df$confluence <- NA
df$jira <- NA
df$slack = NA
df$microsoft_azure =NA
df$trello = NA
df$github = NA
df$gitlab = NA
df$google_suite_docs_meet_etc = NA
df$microsoft_teams = NA
df$stack_overflow_for_teams = NA
df$facebook_workplace =NA
# make a dummy variable based on string match
df$confluence <- as.integer(grepl(pattern = "confluence", x = df$NEWCollabToolsWorkedWith))
df$jira <- as.integer(grepl(pattern = "jira", x = df$NEWCollabToolsWorkedWith))
df$slack <- as.integer(grepl(pattern = "slack", x = df$NEWCollabToolsWorkedWith))
df$microsoft_azure <- as.integer(grepl(pattern = "microsoft azure", x = df$NEWCollabToolsWorkedWith))
df$trello <- as.integer(grepl(pattern = "trello", x = df$NEWCollabToolsWorkedWith))
df$github <- as.integer(grepl(pattern = "github", x = df$NEWCollabToolsWorkedWith))
df$gitlab <- as.integer(grepl(pattern = "gitlab", x = df$NEWCollabToolsWorkedWith))
df$google_suite_docs_meet_etc <- as.integer(grepl(pattern = "google", x = df$NEWCollabToolsWorkedWith))
df$microsoft_teams <- as.integer(grepl(pattern = "microsoft teams", x = df$NEWCollabToolsWorkedWith))
df$stack_overflow_for_teams <- as.integer(grepl(pattern = "overflow", x = df$NEWCollabToolsWorkedWith))
df$facebook_workplace <- as.integer(grepl(pattern = "facebook", x = df$NEWCollabToolsWorkedWith))
# proof that it went through
table(df$facebook_workplace)
> table(df$facebook_workplace)
0 1
62881 1580
I'm trying to create a graph in R using igraph based on rules. I have a graph with nodes, each of which has several attributes. I'd like to add edges based on those attributes. Toy example:
library(igraph)
make_empty_graph() %>%
add_vertices(
nv = 5,
attr = list(
this_attr = sample(c("a", "b"), 5, replace = TRUE)
)
) %>%
{something here to add edges where this_attr is the same)
This appears to be a solution if I were using Gremlin in Python, but I don't grok it/igraph enough to translate to igraph: Gremlin: adding edges between nodes having the same property
If tidygraph would make this easier, that'd be an acceptable dependency.
Any help would be appreciated.
Edit: This works but feels super messy.
g <- igraph::make_empty_graph() %>%
igraph::add_vertices(
nv = 5,
attr = list(
sample_attr = sample(c("a", "b"), 5, replace = TRUE)
)
)
g %>%
igraph::vertex_attr() %>%
unname() %>%
purrr::map(
function(this_attribute) {
unique(this_attribute) %>%
purrr::map(
function(this_value) {
utils::combn(
which(this_attribute == this_value), 2
) %>%
as.integer()
}
) %>% unlist()
}
) %>%
unlist() %>%
igraph::add_edges(g, .)
Something similar but cleaner would be fantastic.
So, I don't think igraph has anything as succinct as the gremlin example in which a general statement of connect any vertex (A) with any vertex (B) if they share an attribute However, R provides a bunch of ways to do this with matrices (as #Julius showed) and data frames. Below is how I'd go about this problem with igraph and R.
Given the following graph:
set.seed(4321)
g <- make_empty_graph() %>%
add_vertices(nv = 5, attr = list(sample_attr = sample(c("a", "b"), 5, replace = TRUE)))
We can make a data frame with information taken from the vertices and then left_join it to itself using the attribute column. I'm assuming direction doesn't matter here and that we want to get rid of duplicates. If that is the case, then simply filter the node columns using a < operator.
edge_list <- data.frame(
#id = V(g)$name #if it has a name.....
id = 1:vcount(g), #if no name exists, then then the order of a vertex represents an id
attr = V(g)$sample_attr #the first item in this vector corresponds to the first vertex/node
) %>%
dplyr::left_join(., ., by = 'attr') %>% #join the data frame with itself
dplyr::filter(id.x < id.y) #remove self pointing edges and duplicates
# 1 %--% 2 equals 2 %--% 1 connection and are duplicates
Once we have information the edge list in a data frame, we need to convert the pair of node columns into a pairwise vector. This can be done by converting the columns into a matrix, transposing the matrix so that the rows are now columns, then converting the matrix into a single (pair-wise) vector.
edge_vector <- edge_list %>%
dplyr::select(id.x, id.y) %>% #select only the node/vertex columns
as.matrix %>% #convert into a matrix so we can make a pairwise vector
t %>% #transpose matrix because matrices convert to vectors by columns
c #now we have a pairwise vector
Now, all we need to do is add the pairwise vector and the associated attributes to the graph.
g <- add_edges(g,
edge_vector,
attr = list(this_attr = edge_list$attr)) #order of pairwise vector matches order of edgelist
Let's plot this to see if it worked.
set.seed(4321)
plot(g,
vertex.label = V(g)$sample_attr,
vertex.color = ifelse(V(g)$sample_attr == 'a', 'pink', 'skyblue'),
edge.arrow.size = 0)
Another potential solution is to start with a data frame instead of an empty graph. The data frame would represent a node list that we can join to itself and create an edge list.
set.seed(4321)
node_list <- data.frame(id = 1:5,
attr= sample(c('a', 'b'), 5, replace = T))
edge_list <- merge(node_list, node_list, by = 'attr') %>% #base R merge
.[.$id.x < .$id.y, c('id.x', 'id.y', 'attr')] #rearrange columns in base so first two are node ids
g <- graph_from_data_frame(d = edge_list, directed = F, vertices = node_list)
set.seed(4321)
plot(g,
vertex.label = V(g)$attr,
vertex.color = ifelse(V(g)$attr == 'a', 'pink', 'skyblue'),
edge.arrow.size = 0)
Given a graph,
g <- make_empty_graph() %>%
add_vertices(nv = 5, attr = list(this_attr = sample(c("a", "b"), 5, replace = TRUE)))
we can first define this adjacency matrix in terms of the attribute
(auxAdj <- tcrossprod(table(1:gorder(g), V(g)$this_attr)) - diag(gorder(g)))
# 1 2 3 4 5
# 1 0 1 1 1 0
# 2 1 0 1 1 0
# 3 1 1 0 1 0
# 4 1 1 1 0 0
# 5 0 0 0 0 0
and use it to add edges as in
g <- add_edges(g, c(t(which(auxAdj == 1, arr.ind = TRUE))))
where
c(t(which(auxAdj == 1, arr.ind = TRUE)))
# [1] 2 1 3 1 4 1 1 2 3 2 4 2 1 3 2 3 4 3 1 4 2 4 3 4
meaning the we want edges (2,1), (3,1), (4,1) and so on.
Looking for some help on joining in order to make a forceNetwork() graph using networkd3. I just can't figure out what's wrong with the code below as I'm getting the following error/warning message.
I used this code before and I got it to work back then - just not sure what's different this time as I feel the input file is the same.
Warning messages:
1: Column `src`/`name` joining factors with different levels, coercing to character vector
2: Column `target`/`name` joining factors with different levels, coercing to character vector
# Load package
library(networkD3)
library(dplyr)
# Create data
src <- c(all_artists$from)
target <- c(all_artists$to)
networkData <- data.frame(src, target, stringsAsFactors = TRUE)
networkData
nodes <- data.frame(name = unique(c(src, target)), size = all_artists$related_artist_followers, stringsAsFactors = TRUE)
nodes$id <- 0:(nrow(nodes) - 1)
nodes
width <- c(all_artists$related_artist_followers)
width
# create a data frame of the edges that uses id 0:9 instead of their names
edges <- networkData %>%
left_join(nodes, by = c("src" = "name")) %>%
select(-src) %>%
rename(source = id) %>%
left_join(nodes, by = c("target" = "name")) %>%
select(-target) %>%
rename(target = id)
The dataset shows the artists that are related to each other - from is the nodes and to is the edges.
from to artist_popularity
Jay-Z Kanye West 80
Jay-Z P. Diddy 60
Kanye West Kid Cudi 40
The line where you build the nodes data frame seems unlikely to work as expected because there's no connection between the length of unique(c(src, target)) and all_artists$related_artist_followers. You could count the number of times a node/name appears in the networkData$src or all_artists$from column with...
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
Once you have the nodes data frame created, it's easy to convert the names in the networkData data frame to zero-indexed indices with...
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
Note that it is also mandatory to provide a Value parameter for the Links data frame and a Group parameter for the Nodes data frame (any parameter that does not have a default value in the help file is mandatory, otherwise you might get an error or unexpected behavior... that goes for all R functions, not just networkd3). You can create columns in your data frames for them like this...
networkData$value <- 1
nodes$group <- 1
So all together in a reproducible example, you might have...
from <- c("Jay-Z", "Jay-Z", "Kanye West")
to <- c("Kanye West", "P. Diddy", "Kid Cudi")
artist_popularity <- c(80, 60, 40)
all_artists <- data.frame(from, to, artist_popularity, stringsAsFactors = FALSE)
networkData <- data.frame(src = all_artists$from, target = all_artists$to,
stringsAsFactors = FALSE)
nodes <- data.frame(name = unique(c(networkData$src, networkData$target)),
stringsAsFactors = FALSE)
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
networkData$value <- 1
nodes$group <- 1
library(networkD3)
forceNetwork(Links = networkData, Nodes = nodes, Source = "src",
Target = "target", Value = "value", NodeID = "name",
Nodesize = "size", Group = "group", opacityNoHover = 1)
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL