Storing node attributes by name in igraph for subsetting plots - r

I am working with a large social network that spans 5 years of data collection. My intention is to subset the data by month/year to analyze the change in various network metrics.
I want to store node attributes into the data frame such that they can be called back after subsetting. For the reproducible example, I want to subset my global network by weight (e.g. all nodes with edges >2).
Is there a way to store node attributes (i.e. Male/Female) into the nodes on the original (larger data frame) that can be recalled after filtering out various nodes (and ignored if the node is missing)? Meaning, if a node's order was changed in the df it will still be associated with the correct sex (i.e. node 1 is always a male regardless if other nodes have been removed or reordered).
I've found answers to creating node attributes for each individual sub-network but I need to generate ~50 subnetworks and being forced to label each individual networks node, in the correct order would be hours of extra work. (e.g. V(any.given.subnetwork)$Sex <- c("male","female","male","male","female","male")).
Very small reproducible example below to illustrate my code:
library(igraph)
library(ggraph)
library(dplyr)
#Load Global Data
set.seed(43)
df <- data.frame(id=1:6,id2=c(2,3,4,5,6,1), weight= c(4,3,1,8,2,7))
#make an igraph graph from the df dataframe
df.df <- graph.data.frame(df, directed = TRUE)
#make df.df and adjacency matrix
df.mat <- as_adjacency_matrix(df.df, type = "both", names = TRUE,
sparse = FALSE, attr= "weight")
#make it an igraph object
df.mat <- graph.adjacency(df.mat, mode= "directed", weighted=TRUE, diag = FALSE)
summary(df.mat)
#Applying "sex" attribute to the nodes
V(df.mat)$Sex <- c("male","female","male","male","female","male")
#Plot
ggraph(df.mat, layout = "nicely") +
geom_edge_link(aes(alpha= weight)) + geom_node_point(aes(color = factor(Sex)))
#filtering out any nodes from Global "df" that have a weight less than 2
df.mat.01 <- df %>%
filter(weight > 2)
#make an igraph graph from the df.01 dataframe
df.df.01 <- graph.data.frame(df.mat.01, directed = TRUE)
#make df.df.01 and adjacency matrix
df.mat01 <- as_adjacency_matrix(df.df.01, type = "both", names = TRUE,
sparse = FALSE, attr= "weight")
#make it an igraph object
df.mat01 <- graph.adjacency(df.mat01, mode= "directed", weighted=TRUE, diag = FALSE)
ggraph(df.mat01, layout = "nicely") +
geom_edge_link(aes(alpha= weight)) + geom_node_point(aes(color = factor(Sex)))
#Error in factor(Sex) : object 'Sex' not found

The package tidygraph makes such manipulations straightforward. You can convert your graph into a tbl_graph which is easier to manipulate, and can be plotted directly with ggraph or converted back to an igraph:
library(tidygraph)
df_mat <- as_tbl_graph(df.mat)
df_mat
#> # A tbl_graph: 6 nodes and 6 edges
#> #
#> # A directed simple graph with 1 component
#> #
#> # Node Data: 6 x 2 (active)
#> name Sex
#> <chr> <chr>
#> 1 1 male
#> 2 2 female
#> 3 3 male
#> 4 4 male
#> 5 5 female
#> 6 6 male
#> #
#> # Edge Data: 6 x 3
#> from to weight
#> <int> <int> <dbl>
#> 1 1 2 4
#> 2 2 3 3
#> 3 3 4 1
#> # ... with 3 more rows
For example, to filter out all edges with weight <= 2 and plot with ggraph, we can do:
df_mat %>%
activate(edges) %>%
filter(weight > 2) %>%
ggraph(layout = "nicely") +
geom_edge_link(aes(alpha= weight)) +
geom_node_point(aes(color = factor(Sex)))

Related

Spatial point distance analysis by group in R

I have a dataset which looks like this, though much larger
### ##Fake data for stack exdb <- data.frame(zone =
c(1,1,1,2,2,2), site = c("study", "collect", "collect", "study",
"collect", "collect"), x = c(53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
I need to run a point analysis between the STUDY site and the COLLECT site to see the distance in metres. The problem is that I have many different ZONES or groups that are all independent (i.e the distance from a point in zone 1 is irrelevant to a point in zone 2).
For this reason I need to do two things,
the point analysis, which computes the distance between the one study site per zone and the multiple collect sites in meters,
and then write a FOREACH or a LOOP function which calculates this distance for every group in the data set.
an optimal output would look like
exdb <- data.frame(zone = c(1,1,1,2,2,2),
site = c("study", "collect", "collect", "study", "collect", "collect"),
x = c(53.307726, 53.310660, 53.307089, 53.313831, 53.319087, 53.318792),
y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815),
dist = c(0, 10.3, 30.4, 0, 12.5, 11.2))
Where the study site in each zone is always 0, as it is the distance from this site, and the distance to each collect site is ONLY CALCULATED TO THE STUDY SITE IN EACH UNIQUE ZONE.
Thank you very much.
Kil
Simple Base R version, no other packages required.
Starting with exdb as above.
First add a new column called dist with the value "study" because the plan is to self-merge on zone and site=="study":
> exdb$dist = "study"
Self-Merge, keeping only the coordinate columns:
> MM = merge(exdb, exdb,
by.x=c("zone","site"),
by.y=c("zone","dist"))[,c("x.x","y.x","x.y","y.y")]
Use distGeo to overwrite the dist column. Keeps it neat and tidy:
> exdb$dist = distGeo(MM[,2:1],MM[,4:3])
> exdb
zone site x y dist
1 1 study 53.30773 -6.222291 0.0000
2 1 collect 53.31066 -6.217151 473.2943
3 1 collect 53.30709 -6.215080 485.8806
4 2 study 53.31383 -6.214152 0.0000
5 2 collect 53.31909 -6.218723 659.5238
6 2 collect 53.31879 -6.215815 563.1349
Returns same answer as #wimpel but with no additional dependencies and in fewer lines of code.
Maybe something like this?
Assuming x and y are latitude and longitude, we can use the haversine function to get the distance in meters after pivoting the table to have both points in a row between which the distance is being calculated from (in meters):
library(tidyverse)
library(pracma)
#>
#> Attaching package: 'pracma'
#> The following object is masked from 'package:purrr':
#>
#> cross
data <- data.frame(zone = c(1, 1, 1, 2, 2, 2), site = c(
"study", "collect", "collect", "study",
"collect", "collect"
), x = c(
53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792
), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
data %>%
pivot_wider(names_from = site, values_from = c(x, y)) %>%
unnest(y_collect, y_study, x_collect, x_study) %>%
mutate(
dist = list(x_study, y_study, x_collect, y_collect) %>% pmap_dbl(~haversine(c(..1, ..2), c(..3, ..4)) * 1000)
)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(y_collect, y_study, x_collect, x_study))`, with `mutate()` if needed
#> # A tibble: 4 x 6
#> zone x_study x_collect y_study y_collect dist
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 53.3 53.3 -6.22 -6.22 472.
#> 2 1 53.3 53.3 -6.22 -6.22 484.
#> 3 2 53.3 53.3 -6.21 -6.22 659.
#> 4 2 53.3 53.3 -6.21 -6.22 563.
Created on 2021-09-13 by the reprex package (v2.0.1)
I'm still learning the spatial side but does this work?
library(sf)
library(tidyverse)
exdb %>%
arrange(zone, desc(site)) %>% #ensure study is first
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
group_by(zone) %>%
mutate(
study_coord = geometry[1],
dist = st_distance(geometry, study_coord, by_element = T),
)
I believe this should work.. But I could not reproduce your distances in the desired output.
library(data.table)
library(purrr) # Or tidyverse
library(geosphere)
# Make your data a data.table
setDT(mydata)
# Split to a list based on zone and site
L <- split(mydata, by = c("zone", "site"), flatten = FALSE)
# Loop over list
L <- lapply(L, function(zone) {
#get reference point to take dustance from
point.study <- c(zone$study$y,zone$study$x)
zone$study$dist <- 0
# Calculate distance
zone$collect$dist <- unlist(purrr::pmap( list(a = zone$collect$y,
b = zone$collect$x ),
~(geosphere::distGeo( point.study, c(..1, ..2)))))
return(zone)
})
# Rowbind the results together
data.table::rbindlist(lapply(L, data.table::rbindlist))
# zone site x y dist
# 1: 1 study 53.30773 -6.222291 0.0000
# 2: 1 collect 53.31066 -6.217151 473.2943
# 3: 1 collect 53.30709 -6.215080 485.8806
# 4: 2 study 53.31383 -6.214152 0.0000
# 5: 2 collect 53.31909 -6.218723 659.5238
# 6: 2 collect 53.31879 -6.215815 563.1349

Can I separate two groups of vertices in an arc plot in ggraph/ggplot2 in R?

I have gotten a great graph by two distinct electrodes with the following code:
elec <- c("Fp1","Fp2","F7","F3","Fz","F4","F8","FC5","FC1","FC2","FC6","C3","Cz","C4","CP5","CP1","CP2","CP6","P7","P3","Pz","P4","P8","POz","Oz",
"Fp1","Fp2","F7","F3","Fz","F4","F8","FC5","FC1","FC2","FC6","C3","Cz","C4","CP5","CP1","CP2","CP6","P7","P3","Pz","P4","P8","POz","Oz")
edgelist <- get.edgelist(net)
# get vertex labels
label <- get.vertex.attribute(net, "name")
# get vertex groups
group <- get.vertex.attribute(net, "group")
# get vertex fill color
fill <- get.vertex.attribute(net, "color")
# get family
family <- get.vertex.attribute(net, "family")
# get vertex degree
degrees <- degree(net)
# data frame with groups, degree, labels and id
nodes <- data.frame(group, degrees, family, label, fill, id=1:vcount(net))
nodes$family <- factor(nodes$family, levels = unique(nodes$family))
nodes$label <- factor(nodes$label, levels = unique(nodes$label))
nodes <- as_tibble(nodes)
# prepare data for edges
edges <- as_tibble(edgelist)
net.tidy <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE, node_key = "label")
ggraph(net.tidy, layout = "linear") +
geom_edge_arc(alpha = 0.5) +
scale_edge_width(range = c(0.2, 2)) +
scale_colour_manual(values= vrtxc) +
geom_node_point(aes(size = degrees, color = family)) +
geom_node_text(aes(label = elec), angle = 90, hjust = 1, nudge_y = -0.5, size = 3) +
coord_cartesian(clip = "off") +
theme_graph()+
theme(legend.position = "top")
And I got a great graph that I like.
However, I would like to separate both sets of electrodes, in the middle, where Oz is, a little bit, to see there is a difference. In the node attributes I have them differentiated by group (1,2), and I would like to know whether this information could be used to expand through the x axis both sets of vertices, on set of 25 electrodes to the left, the other one to the right of the axis, leaving a space in the middle of them.
I attach some look at the data in case its useful
> nodes
# A tibble: 50 x 6
group degrees family label fill id
<fct> <dbl> <fct> <fct> <fct> <int>
1 1 5 fronp Fp1_1 #3B9AB2 1
2 1 9 fronp Fp2_1 #3B9AB2 2
3 1 6 fron F7_1 #5DAABC 3
4 1 7 fron F3_1 #5DAABC 4
5 1 11 fron Fz_1 #5DAABC 5
6 1 9 fron F4_1 #5DAABC 6
7 1 11 fron F8_1 #5DAABC 7
8 1 8 fronc FC5_1 #88BAAE 8
9 1 6 fronc FC1_1 #88BAAE 9
10 1 4 fronc FC2_1 #88BAAE 10
# … with 40 more rows
In case someone is interested, I ended up inserting some dummy rows in the nodes dataframe and it did the trick.

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

can I make a separate dataframe for edgelist attributes between root/terminal vertices in igraph? (R)

here is the head of the data I am working with:
motherinst inst time dist speed
2 20080713_235233_es_0_JWC 20080714_163628_es_0_XKK 0.6971644 1.4921751 2.1403490
3 20080714_163628_es_0_XKK 20080715_160601_es_0_LAL 0.9788542 2.3070819 2.3569210
7 20080715_160601_es_0_LAL 20080716_153449_es_1_UOW 0.9783333 2.8299124 2.8925851
8 20080715_160601_es_1_CUA 20080716_153449_es_2_GOC 0.9783333 0.4322427 0.4418154
9 20080715_160601_es_2_KOE 20080716_153449_es_3_POU 0.9783333 4.1533350 4.2453168
10 20080715_160601_es_2_KOE 20080716_153449_es_4_SOA 0.9783333 2.1224896 2.1694954
What I want to do is be able to specify a root and terminal vertex (either by # or "inst") and then make a separate dataframe with every single value of "dist" running down in rows between all the vertices in that root-terminal pair. Dist is stored as an edge attribute. So, essentially, I'm trying to make a dataframe of the total distance between all of those vertices, with each of their distances stored down the rows.
i.e.
JWC-XKK
1 0.69
2 .....
3 .....
Here is some approach:
foo %>%
mutate_at(vars(1:2), ~str_sub(., start = -3)) %>%
select(motherinst, inst, weight = dist) %>%
mutate(weight = weight * 1.5)-> foo_graph
foo_graph
motherinst inst weight
1 JWC XKK 2.238263
2 XKK LAL 3.460623
3 LAL UOW 4.244869
4 CUA GOC 0.648364
5 KOE POU 6.230003
6 KOE SOA 3.183734
Multiplication by 1.5 is just to make the width of the vertices more visual
library(igraph)
foo_plot <- graph.data.frame(foo_graph, directed = F)
plot(foo_plot, layout=layout_in_circle, edge.width = E(foo_plot)$weight, vertex.size = 30)
Plot:

r dendrogram - groupLabels not match real labels (package dendextend)

Let's do a quick 3-clusters classification on the iris dataset with the FactoMineR package:
library(FactoMineR)
model <- HCPC(iris[,1:4], nb.clust = 3)
summary(model$data.clust$clust)
1 2 3
50 62 38
We see that 50 observations are in cluster 1, 62 in cluster 2 and 38 in cluster 3.
Now, we want to visualize these 3 clusters in a dendrogram, with the package dendextend which enables to make pretty ones:
library(dendextend)
library(dplyr)
model$call$t$tree %>%
as.dendrogram() %>%
color_branches(k = 3, groupLabels = unique(model$data.clust$clust)) %>%
plot()
The problem is that the labels on the dendrogram don't meet the true labels of the classification. The cluster 2 should be the biggest one (62 observations according to the data), but on the dendrogram, we clearly see it is the smallest one.
I tried different thinks but nothing work for now, so if you have any idea of which input give to groupLabels = in order to match the real labels, that would be great.
Looking inside dendextend::color_branches, we can see that group labels are assigned using the command g <- dendextend::cutree(dend, k = k, h = h, order_clusters_as_data = FALSE).
This fact can be used for building a map between the cluster labels assigned by HCPC and group labels assigned by dendextend::color_branches.
library(FactoMineR)
library(dendextend)
library(dplyr)
model <- HCPC(iris[,1:4], nb.clust = 3)
clust.hcpc <- as.numeric(model$data.clust$clust)
clust.cutree <- dendextend:::cutree(model$call$t$tree, k=3, order_clusters_as_data = FALSE)
idx <- order(as.numeric(names(clust.cutree)))
clust.cutree <- clust.cutree[idx]
( tbl <- table(clust.hcpc, clust.cutree) )
###########
clust.cutree
clust.hcpc 1 2 3
1 50 0 0
2 0 0 62
3 0 36 2
This table shows that cluster labels 2 and 3 are matched with group labels 3 and 2, respectively. (Surprisingly, for two sample units this rule is not true.)
The groups levels that need to be passed to dendextend::color_branches can be found as follows:
( lbls <- apply(tbl,2,which.max) )
##############
1 2 3
1 3 2
Here is the dendrogram:
model$call$t$tree %>%
color_branches(k=3, groupLabels =lbls) %>%
set("labels_cex", .5) %>%
plot(horiz=T)

Resources