I want to create a graph from a data frame with multiple data columns, where all of the columns contain vertices, like this:
example data
If two vertices are found in a row together, then they should be connected in the graph. In my example, vertex "Case no. 3" should be connected to the following vertices: "case no. 1", "Jon", "case no. 5", "Bill" (NA should be ignored).
Thanks in advance!
Your question is about manipulate the raw data, 'cause you need to construct your edgelist correctly. The only way to do this is to indicate two columns, with the sender of the link (col. 1), and the receiver of the link (col. 2). Self-directed links are allowed (e.g., from 'a' to 'a'). The others columns are caracteristics of the links, ever.
Your example edgelist show 3 columns of vertices: this is not a valid edgeslist, one of the columns is useless. So,
you'll have to construct a valid edgelist, by manipulate the data (see below).
Then, you should tell igraph what is your edgelist and construct a graph, like in this answer and/or this one (sorry for the shameless auto-promote).
In order to construct a valid edgelist from the example you provide, with tidyverse tools and the %>% operator:
# ↓ SAMPLE DATA (colnames are different from the ones you provided) ↓
raw_data <- data.frame(case_no=c(1, 2,3, 4),
related_case =c(3,5,5, NA) ,
received_by = c("Jon", "Wendy","Jon", NA) ,
packed_by = c(NA, "Wendy", "Bill", NA) )
# ↓ First series of links ↓
edges_list <- raw_data %>%
select(FROM = case_no, related_case, TO = received_by) %>%
mutate(TYPE = 'Received') # ↑ THIS IS ONLY THE FIRST COLUMNS OF RECEIVERS
# ↓ APPEND THE SECOND LIST OF RECEIVER TO THE FIRST VERSION OF THE EDGESLIST↓
edges_list <- select(raw_data, FROM = case_no, related_case, TO = packed_by) %>%
mutate(TYPE = 'Packed') %>% #↑ HERE THE SECOND COLUMN OF RECEIVERS↑
rbind(edges_list)
edges_list <- na.omit(edges_list) # ← REMOVE NA FILLED ROWS
edges_list %>% igraph::graph_from_data_frame(directed = T) %>%
igraph::plot.igraph() # CREATE YOUR GRAPH
Related
I have the present shapefile
heitaly<- readOGR("ProvCM01012017/ProvCM01012017_WGS84.shp")
FinalData<- merge(italy, HT, by.x="COD_PROV", by.y="Domain")
But I'm interesting not on all Italy, but also same provinces. How can I get them?
There are many ways to select a category into a shapefile. I don't know for what do you want. For example if it is to colour a specific region in a plot or to select a row from shapefile attribute table.
To plot:
plot(shape, col = shape$column_name == "element") # general example
plot(heitaly, col = heitaly$COD_PROV == "name of province") # your shapefile
To attribute table:
df <- shape %>% data.frame
This will give you the complete attribute table
row <- shape %>% data.frame %>% slice(1)
This will give you the first row with all columns. If you change the number 1 to another number, for example 3, will give you the information for row number 3
I hope have been useful
I am trying to visualise migration data with a Sankey diagram, in which names of nodes will be repeated between the "from" and "to" columns of the data frame.
Unfortunately, highcharter tries to use single nodes and makes the edges go back and forth:
# import and prepare the data
flows <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/13_AdjacencyDirectedWeighted.csv",
header = TRUE,
check.names = FALSE)
flows$from <- rownames(flows)
library(tidyr)
flows <- flows %>%
pivot_longer(-from, names_to = "to", values_to = "weight")
# visualise
library(highcharter)
hchart(flows, "sankey")
How would one force the nodes to be placed on two separate columns, while keeping the same colour for each area/continent?
I have used the workaround or renaming the "to" nodes so they don't share names (e.g. prepending "to " to each of them), but I would like to keep the same names and have the colours match.
# extra data preparation step for partial workaround
flows$to <- paste("to", flows$to)
I had the same trouble and it was very frustrating. The only way that worked relatively well for me was, following your approach, generating white space before the names in the "to" column, like this:
data %>% data_to_sankey() %>% mutate(to = paste(" ", to)) %>% hchart(type = "sankey")
I hope this can help you.
Thank you!
A fairly common data issue in some circles is coding an instrument, for example this one, where related items are separated in the instrument. The idea is to avoid cuing the respondent that all of these questions - say all those beginning with A or D in this example, are related.
rm(list=ls())
Column_Names = c('A2','B1','D1','A1','B3','B2','D3','A3','A4','D2')
Matrix <- matrix(nrow=10,ncol=10,floor(runif(n=100,min=0, max=10)))
Data <- as_tibble(Matrix)
names(Data) = Column_Names
Identifier = sample(letters, size=10)
Data <- Data %>% mutate(Identifier = Identifier)
rm(Matrix, Column_Names, Identifier)
Scoring these instruments typically requires you to take all the related items (B1, B2, B2,,) and do something with them. This is often something very simple - a mean, and it may be the case that some items have to be recoded (0=9, 1=8, 2-7,3-6 ...). All of that is easy. Typically there are sub-scales (A,B,C,D here), and these are calculated on fixed subsets of the variables - A1,A2,A3 might be one subscale, and D1,D2 and D3 another, for example.
The hard bit is selecting the columns. What happens in real examples, is there might be 50 items, and they should be in a fixed order. (These instruments provide the questions in a fixed order). The tricky bit is that different investigators will use different NAMES for the SAME variables. They don't use names like the ones in my example, where each subscale has a name that refers directly to the subscale.
There are only two sane ways to do this, and get it right (I think). One is to rename the variables to a common set of names, and score these. The other is to pick the variables out by position - so subscale A is columns 1, 3 and 8, subscale B is columns 2,5 and 6.
This is trivial, if you put the variable numbers in your code.
Data %>% select(2,5,6) %>% names()
It's not very generalisable. I have a case where I have 70 variables, 8 subscales with 4 to 8 variables in each, and an overall score with 40 variables. I would like, but can't see a way, to read in the column numbers from a file, and pull them out.
Scales <- c('A','B','D')
Numbers <- c('1,4,8,9','2,5,6','3,7,10')
Scale <- tibble(Scales,Numbers)
I'd like to get them out
Data %>% select(2,5,6) %>% names()
works, but
Data %>% select(Scale$Numbers[1]) %>% names()
Data %>% select(as.numeric(Scale$Numbers[1])) %>% names()
don't, as do many efforts involving quo, !!, and the like. I know Hadley Wickham disapproves of using column positions and I get why, but this is a reasonable use case, maybe the only one I've ever come across.
This is what I'm doing for this particular study -
Somatic <- Questionnaire1 %>%
select( 1,7,16,32,43,
45,50,51, Identifier) %>%
rowwise(Identifier) %>%
mutate(Somatic = mean(c_across(1:8),na.rm=TRUE)) %>%
select(Identifier, Somatic)
Cognitive <- Questionnaire %>%
select( 2, 4, 8,11,15,
21,22,23,36,42,
46,54,59, Identifier) %>%
rowwise(Identifier) %>%
mutate(Cognitive = mean(c_across(1:13),na.rm=TRUE)) %>%
select(Identifier, Cognitive)
In this application, there are nine of these for 8 subclasses, and one overall score. In this study, there are 5 different instruments being used, with five different sets of variables, and a total of twenty subscales. This is why I want to have a programmatic solution.
Suggestions welcomed!
Thanks
Anthony
It sounds like regardless you are going to need to create some type of mapping. Consider this approach to create a string of attributes that capture the group each question belongs to. I used "I" for the Identifier as the length needs to match the number of columns.
Attributes <- c("A", "B", "D", "A", "B", "B", "D", "A", "A", list(c("B", "D")), "I")
attr(Data, "Group") <- Attributes
Data %>%
select_if(purrr::map_lgl(attr(Data, "Group"), `%in%`, x = "B")) %>% names()
#[1] "B1" "B3" "B2" "D2"
Consider using a named list. Is this what you want?
Scales <- c('A','B','D')
Numbers = list(A = c(1,4,8,9), B = c(2,5,6), D = c(3, 7, 10))
Scale <- tibble(Scales,Numbers)
Data %>% select(Scale$Numbers$B) %>% names()
#[1] "B1" "B3" "B2"
We can also make this into a search function.
find <- function(x){
Data %>% select(Scale$Numbers[[x]]) %>% names()
}
find("B")
#[1] "B1" "B3" "B2"
Here is another approach so you don't have to identify the column numbers in advance.
Data %>%
select(starts_with("B")) %>%
names()
#[1] "B1" "B3" "B2"
I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))
I’m having a bit of trouble using graph_from_data_frame properly - ERROR: ... the data frame should contain at least two columns when it already does.
I have a data frame, lets use a cohort of students as an example.
Each row is a student name, and there are multiple columns of metadata, most of which irrelevant. I would like to use one specific column “Class”, denoting which class they’re in (lets they're in 15 classes of 30 each). I would like to make a graph such that every student is a vertex, and students with the same value in the “Class” column get an undirected edge.
How would this command look like?
Just an update to add some context: the number of nodes/edges I wished to plot were incredibly large (it's not literally a class of students), so much so that the 1-to-1 representations used in the examples would be unfeasible. Hence, I was looking for a more efficient way to encode edges.
library(tidyverse)
library(igraph)
df = tibble(
class = c("1","1","1","2","2","2","3","3","3"),
name = c("a","b","c","d","e","f","g","h","i")
)
names = df %>% select(name)
relations = df %>%
mutate(name2 = df$name)
for (i in unique(select(df,class))$class){
from = relations %>%
filter(class == i) %>%
select(name)
to = relations %>%
filter(class == i) %>%
select(name2)
# Form relationships between all students in each class
if (i == 1){edge_list = tidyr::crossing(from, to)}
else {edge_list = bind_rows(edge_list, tidyr::crossing(from, to))}
}
# Prevent self-loop edges and duplicate relationships
edge_list = edge_list %>% filter(name != name2)
edge_list = edge_list[!duplicated(t(apply(edge_list, 1, sort))), ]
plot(graph_from_data_frame(edge_list, directed = FALSE, vertices = names))