Calculate alters density in a large graph in R - r

I'm working with a large social network in R (560120 ties). I want to calculate the local density of nodes, as well as the density of their alters.
I achieved the former with the following code snippet, using the package igraph.
g <- graph_from_data_frame(edgelist, directed = FALSE)
egonet_list <- make_ego_graph(g)
dat <- data.frame(
id = names(V(g),
egonet_density = lapply(egonet_list, graph.density) %>% unlist()
)
However, I run into memory troubles when I try to calculate the network of ego's alters. I try to run the following:
alter_list <- make_ego_graph(g, order = 2, mindist = 1)
It does work for smaller graphs, but with my network setup, it is eating up all of my RAM (>110GB) and crashing.
Does anyone have a suggestion how to solve this issue in a memory-friendly way?

You can calculate the local density for one node at a time without saving the alter graphs.
library(igraph)
library(purrr)
make_ego_graph(g, order = 2, nodes = 1, mindist = 1)
V(g) %>%
map_dbl(~ make_ego_graph(g, order = 2, nodes = .x, mindist = 1)[[1]] %>%
graph.density())
You can take the code within map and write a function called get_alter_density() and use lapply if you prefer.

Related

Plotting a dendrogram in base R with only two leaves

I have some (legacy) code that plots a dendrogram from an n by n matrix of distances, using base R (4.1.1).
This works fine for n>=3, but fails for n=2.
numElements <- 2
data <- matrix(1, nrow = numElements, ncol = numElements)
data[1,2] <- 0
data <- (data + t(data))/2
d3 <- as.dist(data)
clust3 <- hclust(d3)
plot(clust3, hang = -1)
For n=2 I get this error:
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input
I would like a dendrogram with two leaves, which will show the height between just those two leaves.
Unfortunately, graphics:::plotHclust calls external C code, so I can't modify that directly. And also unfortunately, I'm trying to run this on a pre-built container on a virtual machine, so I need a base R solution without additional packages (else I'd just use ggdendro). I can catch the case of only two samples and run a separate plotting function, that is fine.
The basic dendrogram plot function can't handle a monotonic increase in the height of the gram, like described in this answer. So you could use as.dendrogram to convert it to a dendrogram object like this:
numElements <- 2
data <- matrix(1, nrow = numElements, ncol = numElements)
data[1,2] <- 0
data <- (data + t(data))/2
d3 <- as.dist(data)
clust3 <- hclust(d3)
plot(as.dendrogram(clust3), hang = -1)
Created on 2022-08-19 with reprex v2.0.2

R: How to Efficiently Visualize a Large Graph Network

I simulated some graph network data (~10,000 observations) in R and tried to visualize it using the visNetwork library in R. However, the data is very cluttered and is very difficult to analyze visually (I understand that in real life, network data is meant to be analyzed using graph query language).
For the time being, is there anything I can do to improve the visualization of the graph network I created (so I can explore some of the linkages and nodes that are all piled on top of each other)?
Can libraries such as 'networkD3' and 'diagrammeR' be used to better visualize this network?
I have attached my reproducible code below:
library(igraph)
library(dplyr)
library(visNetwork)
#create file from which to sample from
x5 <- sample(1:10000, 10000, replace=T)
#convert to data frame
x5 = as.data.frame(x5)
#create first file (take a random sample from the created file)
a = sample_n(x5, 9000)
#create second file (take a random sample from the created file)
b = sample_n(x5, 9000)
#combine
c = cbind(a,b)
#create dataframe
c = data.frame(c)
#rename column names
colnames(c) <- c("a","b")
graph <- graph.data.frame(c, directed=F)
graph <- simplify(graph)
graph
plot(graph)
library(visNetwork)
nodes <- data.frame(id = V(graph)$name, title = V(graph)$name)
nodes <- nodes[order(nodes$id, decreasing = F),]
edges <- get.data.frame(graph, what="edges")[1:2]
visNetwork(nodes, edges) %>% visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visInteraction(navigationButtons = TRUE)
Thanks
At the request of the OP, I am applying the method used in a previous answer
Visualizing the result of dividing the network into communities to this problem.
The network in the question was not created with a specified random seed.
Here, I specify the seed for reproducibility.
## reproducible version of OP's network
library(igraph)
library(dplyr)
set.seed(1234)
#create file from which to sample from
x5 <- sample(1:10000, 10000, replace=T)
#convert to data frame
x5 = as.data.frame(x5)
#create first file (take a random sample from the created file)
a = sample_n(x5, 9000)
#create second file (take a random sample from the created file)
b = sample_n(x5, 9000)
#combine
c = cbind(a,b)
#create dataframe
c = data.frame(c)
#rename column names
colnames(c) <- c("a","b")
graph <- graph.data.frame(c, directed=F)
graph <- simplify(graph)
As noted by the OP, a simple plot is a mess. The referenced previous answer
broke this into two parts:
Plot all of the small components
Plot the giant component
1. Small components
Different components get different colors to help separate them.
## Visualize the small components separately
SmallV = which(components(graph)$membership != 1)
SmallComp = induced_subgraph(graph, SmallV)
LO_SC = layout_components(SmallComp, layout=layout_with_graphopt)
plot(SmallComp, layout=LO_SC, vertex.size=9, vertex.label.cex=0.8,
vertex.color=rainbow(18, alpha=0.6)[components(graph)$membership[SmallV]])
More could be done with this, but that is fairly easy and not the substance of the question, so I will leave this as the representation of the small components.
2. Giant component
Simply plotting the giant component is still hard to read. Here are two
approaches to improving the display. Both rely on grouping the vertices.
For this answer, I will use cluster_louvain to group the nodes, but you
could try other community detection methods. cluster_louvain produces 47
communities.
## Now try for the giant component
GiantV = which(components(graph)$membership == 1)
GiantComp = induced_subgraph(graph, GiantV)
GC_CL = cluster_louvain(GiantComp)
max(GC_CL$membership)
[1] 47
Giant method 1 - grouped vertices
Create a layout that emphasizes the communities
GC_Grouped = GiantComp
E(GC_Grouped)$weight = 1
for(i in unique(membership(GC_CL))) {
GroupV = which(membership(GC_CL) == i)
GC_Grouped = add_edges(GC_Grouped, combn(GroupV, 2), attr=list(weight=6))
}
set.seed(1234)
LO = layout_with_fr(GC_Grouped)
colors <- rainbow(max(membership(GC_CL)))
par(mar=c(0,0,0,0))
plot(GC_CL, GiantComp, layout=LO,
vertex.size = 5,
vertex.color=colors[membership(GC_CL)],
vertex.label = NA, edge.width = 1)
This provides some insight, but the many edges make it a bit hard to read.
Giant method 2 - contracted communities
Plot each community as a single vertex. The size of the vertex
reflects the number of nodes in that community. The color represents
the degree of the community node.
## Contract the communities in the giant component
CL.Comm = simplify(contract(GiantComp, membership(GC_CL)))
D = unname(degree(CL.Comm))
set.seed(1234)
par(mar=c(0,0,0,0))
plot(CL.Comm, vertex.size=sqrt(sizes(GC_CL)),
vertex.label=1:max(membership(GC_CL)), vertex.cex = 0.8,
vertex.color=round((D-29)/4)+1)
This is much cleaner, but loses any internal structure of the communities.
Just a tip for 'real-life'. The best way to deal with large graphs is to either 1) filter the edges you are using by some measure, or 2) use some related variable as weight.

How to subset a network graph keeping the top n components?

I have a disconnected network with many small components.
I would like to keep only those components are above the 75th percentile in size.
In using decompose a list of network is produced that cannot be plotted as one.
library(igraph)
set.seed(123)
g <- erdos.renyi.game(100, 0.02, directed = FALSE, loops = TRUE)
components(g)$csize
components <- which(components(g)$csize>=quantile(components(g)$csize,.75))
g_final <- igraph::decompose(g, max.comps = length(components), min.vertices = 2)
I think that what you really want to use is induced_subgraph.
I really dislike that you used the name of the function components as the name of a variable, so I have changed it here to Components.
Components <- which(components(g)$csize>=quantile(components(g)$csize,.75))
BigComp <-induced_subgraph(g,
which(components(g)$membership %in% Components))
plot(BigComp)

R getting subtrees from dendrogram based on cutree labels

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.
For example
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree to get the labels for each of the rows in my dataset.
I know I can get the data for each corresponding cluster (cluster 1 for example) with:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the
I know I can access the branches of the dendrogram doing something like
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1?
I have tried
dnd[clusters == 1]
but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?
================= UPDATED answer
This can now be solved using the get_subdendrograms from dendextend.
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer
I think what can be helpful for you are these two functions:
The first one just iterates through all clusters and extracts substructure. It requires:
the dendrogram object from which we want to get the subdendrograms
the clusters labels (e.g. returned by cutree)
Returns a list of subdendrograms.
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:
the dendrogram object
positions of the elements in cluster
Returns a subdendrograms corresponding to the cluster of given elements.
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms
EDIT
As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643
I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)

Add elements to a previous subplot within an active base R graphics device?

Let's say I generate 9 groups of data in a list data and plot them each with a for loop. I could use *apply here too, whichever you prefer.
data = list()
layout(mat = matrix(1:9, nrow = 3))
for(i in 1:9){
data[[i]] = rnorm(n = 100, mean = i, sd = 1)
plot(data[[i]])
}
After creating all the data, I want to decide which one is best:
best_data = which.min(sapply(data, sd))
Now I want to highlight that best data on the plot to distinguish it. Is there a plotting function that lets me go back to a specified sub-plot in the active device and add an element (maybe a title)?
I know I could make a second for loop: for loop 1 generates the data, then I assess which is best, then for loop 2 creates the plots, but this seems less efficient and more verbose.
Does such a plotting function exist for base R graphics?
#rawr's answer is simple and easy. But I thought I'd point out another option that allows you to select the "best" data set before you plot, in case you want more flexibility to plot the "best" data set differently from the rest.
For example:
# Create the data
data = lapply(1:9, function(i) rnorm(n = 100, mean = i, sd = 1))
par(mar=c(4,4,1,1))
layout(mat = matrix(1:9, nrow = 3))
rng = range(data)
# Plot each data frame
lapply(1:9, function(i) {
# Select data frame with lowest SD
best = which.min(sapply(data, sd))
# Highlight data frame with lowest SD by coloring points red
plot(data[[i]], col=ifelse(best==i,"red","black"), pch=ifelse(best==i, 3, 1), ylim=rng)
})

Resources