Network graph R - joining - r

Looking for some help on joining in order to make a forceNetwork() graph using networkd3. I just can't figure out what's wrong with the code below as I'm getting the following error/warning message.
I used this code before and I got it to work back then - just not sure what's different this time as I feel the input file is the same.
Warning messages:
1: Column `src`/`name` joining factors with different levels, coercing to character vector
2: Column `target`/`name` joining factors with different levels, coercing to character vector
# Load package
library(networkD3)
library(dplyr)
# Create data
src <- c(all_artists$from)
target <- c(all_artists$to)
networkData <- data.frame(src, target, stringsAsFactors = TRUE)
networkData
nodes <- data.frame(name = unique(c(src, target)), size = all_artists$related_artist_followers, stringsAsFactors = TRUE)
nodes$id <- 0:(nrow(nodes) - 1)
nodes
width <- c(all_artists$related_artist_followers)
width
# create a data frame of the edges that uses id 0:9 instead of their names
edges <- networkData %>%
left_join(nodes, by = c("src" = "name")) %>%
select(-src) %>%
rename(source = id) %>%
left_join(nodes, by = c("target" = "name")) %>%
select(-target) %>%
rename(target = id)
The dataset shows the artists that are related to each other - from is the nodes and to is the edges.
from to artist_popularity
Jay-Z Kanye West 80
Jay-Z P. Diddy 60
Kanye West Kid Cudi 40

The line where you build the nodes data frame seems unlikely to work as expected because there's no connection between the length of unique(c(src, target)) and all_artists$related_artist_followers. You could count the number of times a node/name appears in the networkData$src or all_artists$from column with...
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
Once you have the nodes data frame created, it's easy to convert the names in the networkData data frame to zero-indexed indices with...
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
Note that it is also mandatory to provide a Value parameter for the Links data frame and a Group parameter for the Nodes data frame (any parameter that does not have a default value in the help file is mandatory, otherwise you might get an error or unexpected behavior... that goes for all R functions, not just networkd3). You can create columns in your data frames for them like this...
networkData$value <- 1
nodes$group <- 1
So all together in a reproducible example, you might have...
from <- c("Jay-Z", "Jay-Z", "Kanye West")
to <- c("Kanye West", "P. Diddy", "Kid Cudi")
artist_popularity <- c(80, 60, 40)
all_artists <- data.frame(from, to, artist_popularity, stringsAsFactors = FALSE)
networkData <- data.frame(src = all_artists$from, target = all_artists$to,
stringsAsFactors = FALSE)
nodes <- data.frame(name = unique(c(networkData$src, networkData$target)),
stringsAsFactors = FALSE)
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
networkData$value <- 1
nodes$group <- 1
library(networkD3)
forceNetwork(Links = networkData, Nodes = nodes, Source = "src",
Target = "target", Value = "value", NodeID = "name",
Nodesize = "size", Group = "group", opacityNoHover = 1)

Related

Combining rows based on conditions and saving others (in R)

I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))

Filter data in loop over vector and bind data frames

I am a noobie at writing for loops and could use some support getting mine to work. I have a data frame that I would like filter based on values stored in a vector. I need to filter the data frame multiple times, once for each value in my vector. After each filter, I want to store the subsetted data frame and bind all the data frames together.
I found this super handy code on how to store the results of each filtered dataset and bind together - Append data frames together in a for loop
However, I can't get the code to loop over each of the values in my vector. It only loops over the first value.
Here is an example:
area <- data.frame(
land = c("68N03E220090", "68N03E244635", "68N03E244352", "68N03E223241"),
type = c("home", "mobile", "home", "vacant"),
object_id = c(NA, 7, NA, 34)
)
block <- c("68N03E22", "68N03E24")
datalist = list()
for (value in block){
df <- area %>% filter(is.na(object_id) & grepl(paste0("^", block),land))
df$value <- value
datalist[[value]] <- df # add it to your list
}
df_filtered <- dplyr::bind_rows(datalist)
I receive a warning:
Warning messages:
1: In grepl(paste0("^", block), land) :
argument 'pattern' has length > 1 and only the first element will be used
2: In grepl(paste0("^", block), land) :
argument 'pattern' has length > 1 and only the first element will be used
I think you've got a typo/error in your filter; do you get the correct output when you change "block" to "value" in your grepl? E.g.
library(tidyverse)
area <- data.frame(
land = c("68N03E220090", "68N03E244635", "68N03E244352", "68N03E223241"),
type = c("home", "mobile", "home", "vacant"),
object_id = c(NA, 7, NA, 34)
)
block <- c("68N03E22", "68N03E24")
datalist = list()
for (value in block){
df <- area %>% filter(is.na(object_id) & grepl(paste0("^", value),land))
df$value <- value
datalist[[value]] <- df # add it to your list
}
df_filtered <- dplyr::bind_rows(datalist)
df_filtered
#> land type object_id value
#> 1 68N03E220090 home NA 68N03E22
#> 2 68N03E244352 home NA 68N03E24
For this example, you could also avoid the for-loop by using:
df_filtered_2 <- area %>%
filter(is.na(object_id) & grepl(pattern = paste0(block, collapse = "|"), x = land)) %>%
mutate(value = str_sub(land, 1, 8))
identical(df_filtered, df_filtered_2)
#> [1] TRUE

Separating multi-valued attributes into individual attributes R

I'm working with the stackoverflow developer survey data-set and attempting to predict compensation based on technologies worked with and collaborative tools worked with. These two attributes are multi-valued with semicolons separating the individual values.
For instance, under the CollabToolsWorkedWith attribute in one row, there is Confluence;Jira;Github;Slack;Microsoft;Teams;Google Suite. I want to give each of these values their own column with a value of either 0 or 1 if the row had that value.
The end result would have each row contain a column for every single value under CollabToolsWorkedWith and each column would contain 0's and 1's based on whether or not the row contained that value.
You may get a quicker answer next time if you provide some sample data that everyone can quickly access. I found the 2020 data online. Here is my answer:
# read the data frame
rm(list = ls())
df <- read.csv("survey_results_public.csv")
# figure out which column you are talking about
data.frame(colnames(df))
table(df$NEWCollabToolsWorkedWith)
# convert to lower case and character
df$NEWCollabToolsWorkedWith <- as.character(df$NEWCollabToolsWorkedWith)
df$NEWCollabToolsWorkedWith <- tolower(df$NEWCollabToolsWorkedWith)
# keep only the useful variables and separate based on ;
library(tidyverse)
library(splitstackshape)
namesdf <- df %>% select(NEWCollabToolsWorkedWith)
namesdf <- cSplit(namesdf,"NEWCollabToolsWorkedWith", sep = ";", direction = "wide", drop=TRUE,
type.convert = TRUE)
# stack stuff on top of each other to find unique list of tools/platforms
long_data_frame <-
namesdf %>%
pivot_longer(cols = starts_with("NEWCollabToolsWorkedWith"), # use columns starting with "year"
names_to ="unique", # name of new column
names_prefix = "_",
values_drop_na = TRUE) %>%
distinct(value)
# clean the variable names
library(janitor)
long_data_frame$value = as.character(long_data_frame$value)
long_data_frame$value = janitor::make_clean_names(long_data_frame$value)
# get final unique list
table(long_data_frame$value)
> table(long_data_frame$value)
confluence facebook_workplace github gitlab
1 1 1 1
google_suite_docs_meet_etc jira microsoft_azure microsoft_teams
1 1 1 1
slack stack_overflow_for_teams trello
1 1 1
# create new variables
df$confluence <- NA
df$jira <- NA
df$slack = NA
df$microsoft_azure =NA
df$trello = NA
df$github = NA
df$gitlab = NA
df$google_suite_docs_meet_etc = NA
df$microsoft_teams = NA
df$stack_overflow_for_teams = NA
df$facebook_workplace =NA
# make a dummy variable based on string match
df$confluence <- as.integer(grepl(pattern = "confluence", x = df$NEWCollabToolsWorkedWith))
df$jira <- as.integer(grepl(pattern = "jira", x = df$NEWCollabToolsWorkedWith))
df$slack <- as.integer(grepl(pattern = "slack", x = df$NEWCollabToolsWorkedWith))
df$microsoft_azure <- as.integer(grepl(pattern = "microsoft azure", x = df$NEWCollabToolsWorkedWith))
df$trello <- as.integer(grepl(pattern = "trello", x = df$NEWCollabToolsWorkedWith))
df$github <- as.integer(grepl(pattern = "github", x = df$NEWCollabToolsWorkedWith))
df$gitlab <- as.integer(grepl(pattern = "gitlab", x = df$NEWCollabToolsWorkedWith))
df$google_suite_docs_meet_etc <- as.integer(grepl(pattern = "google", x = df$NEWCollabToolsWorkedWith))
df$microsoft_teams <- as.integer(grepl(pattern = "microsoft teams", x = df$NEWCollabToolsWorkedWith))
df$stack_overflow_for_teams <- as.integer(grepl(pattern = "overflow", x = df$NEWCollabToolsWorkedWith))
df$facebook_workplace <- as.integer(grepl(pattern = "facebook", x = df$NEWCollabToolsWorkedWith))
# proof that it went through
table(df$facebook_workplace)
> table(df$facebook_workplace)
0 1
62881 1580

Using visNetwork to dynamically update nodes in R

the below snapshot visual is created using the "visNetwork" package. My requirement here is that I have to hard code the edges and also after using visHierarchicalLayout(), I am not able to see them in order, Please help me with a dynamic approach such that no matter how many numbers, I get consecutive numbers in order without hard code. Thanks and please help.
library(visNetwork)
nodes <- data.frame(id = 1:7, label = 1:7)
edges <- data.frame(from = c(1,2,3,4,5,6),
to = c(2,3,4,5,6,7))
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "to") %>%
visHierarchicalLayout()
Using level attribute does the job, it aligns the network based on the order given.
library(visNetwork)
nodes <- data.frame(id = 1:7, label = 1:7, level = 1:7)
# Extract the id
num <- nodes$id
# Repeat the numbers
num2 <- rep(num, each = 2)
# Remove the first and last numbers
num3 <- num2[c(-1, -length(num2))]
#Create a data frame
edges <- as.data.frame(matrix(num3, ncol = 2, byrow = TRUE))
names(edges) <- c("from", "to")
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "to") %>%
visHierarchicalLayout()
If I understand your question correctly, you want to create the edges data frame based on the id in the nodes data frame. Here is one option.
# Extract the id
num <- nodes$id
# Repeat the numbers
num2 <- rep(num, each = 2)
# Remove the first and last numbers
num3 <- num2[c(-1, -length(num2))]
# Create a data frame
edges <- as.data.frame(matrix(num3, ncol = 2, byrow = TRUE))
names(edges) <- c("from", "to")
edges
# from to
# 1 1 2
# 2 2 3
# 3 3 4
# 4 4 5
# 5 5 6
# 6 6 7

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL

Resources