How to plot only large communities/clusters in R - r

I have an igraph in g. Since the graph is huge I only want to plot communities with more than 10 members, but I want to plot them all in one plot.
My idea to remove unwanted elements is:
g <- delete_vertices(g, V(g)[igraph::clusters(g)$csize < 10])
but for some reason this plots a lot of single nodes, which is the opposite of what I try to achieve. Can you tell me where I am wrong?

Your idea is great, but the problem is that
igraph::clusters(g)$csize < 10
only returns a logical vector of clusters containing fewer than 10 members. Meanwhile, you need to know which vertices belong to those clusters.
Hence, we may proceed as follows.
set.seed(1)
g1 <- erdos.renyi.game(100, 1 / 70)
cls <- clusters(g1)
cls$csize
# [1] 1 1 43 2 11 1 1 1 2 1 2 5 1 1 4 4 1 1 1 1 2 1 2 1
# [25] 4 1 1 1 1 1 # Two clusters of interest
g2 <- delete_vertices(g1, V(g1)[cls$membership %in% which(cls$csize <= 10)])
plot(g2)

Related

How can I automate a basic genetic distance matrix in R?

I'm trying to create an algorithm that would produce a distance matrix from a dataframe. The idea is that the dataframe will contain three or more aligned genetic sequences and the algorithm will calculate the number of differences between each sequence and convert this into a dataframe. Hence, the input data would look something like this:
taxon1 taxon2 taxon3
1 g g g
2 a c c
3 a a a
4 a t c
5 g g g
6 c t t
So far, I have the following code to calculate the difference between two sequences (taxon 1 and taxon 2):
distance1_2 <- 0
for (i in 1:length(taxon1)){
if (taxon1[i] == taxon2[i]){
distance1_2 <- distance1_2
}
else{
distance1_2 <- distance1_2 + 1
}
}
distance1_2
How can I automate this without manually repeating the same code for each individual taxon combination? The finished matrix should look something like this:
t1 t2 t3
t1 0 4 5
t2 4 0 5
t3 5 5 0
I am not sure whether it is the following you want:
outer(df, df, Vectorize(\(x,y) sum(x != y)))
#> taxon1 taxon2 taxon3
#> taxon1 0 3 3
#> taxon2 3 0 1
#> taxon3 3 1 0

How to compute total within sum of square in hierarchical clustering

I read several textbook and online tutorials about clustering algorithms. In K-mean algorithm, when you run kmean() the information of total within sum of square is included. But we runhclust()in agglomerative hierarchical clustering, we can not find this information. So is that possible to compute TWSS for hclust()? Or is is reasonable to calculate the TWSS in hclust()?
The original data set is something like this:
1 -1.6768555093 -1.33937070 1.246858892 1.23171108 2.186761
2 -3.0832450282 1.28841533 0.286807651 1.54836547 3.494282
3 -1.4664760903 0.80289181 1.940444140 1.84226142 3.543522
4 -3.1109618863 0.32801815 -0.497680172 2.54236639 2.501975
5 -2.7603333486 0.49249130 1.041125723 1.75577604 2.868788
6 -4.3145154475 -2.01808802 1.227723818 0.09547962 2.570594
7 -1.6097707596 0.25391455 2.978627043 0.07428535 4.510882
Below is my code. In here, minClusters = 1, maxClusters = 10
hierarchy_mod <- hclust(Eucli_dis,method = "complete")
memb <- cutree(hierarchy_mod,minClusters:maxClusters)
memb_DT <- data.table(memb)
I got the result of a matrix and transfer it to data.table:
1 2 3 4 5 6 7 8 9 10
1: 1 1 1 1 1 1 1 1 1 1
2: 1 1 1 1 1 1 1 1 2 2
3: 1 1 1 1 1 1 1 1 2 2
4: 1 1 1 1 1 1 1 1 1 1
5: 1 1 1 1 1 1 1 1 2 2
...
The problem for me now is I don't know how to compute the TWSS in this scenario. I checked on-line tutorial and text books but in hclust(), no one calculate the TWSS...
Thank you!
TWSS is useful in comparing different results using kmeans because the starting configuration is usually random so different runs can give different results. That does not happen in hierarchical clustering since the cluster process is deterministic. But you can easily write R commands to compute it for any cluster result. First we need to make a reproducible example:
set.seed(4242)
x <- matrix(rnorm(125), 25, 5)
x.dist <- dist(x)
x.clus <- hclust(x.dist, metho="complete")
plot(x.clus)
x.grps <- cutree(x.clus, 3:5)
We are clustering 25 rows (cases) by 5 columns (variables). We will look at solutions involving 3 to 5 clusters. We can use the scale() function to compute the sums of squares by cluster and then sum them:
x.SS <- aggregate(x, by=list(x.grps[, 1]), function(x) sum(scale(x,
scale=FALSE)^2))
x.SS
SS <- rowSums(x.SS[, -1]) # Sum of squares for each cluster
TSS <- sum(x.SS[, -1]) # Total (within) sum of squares
You will have to run this code for x.grps[, 1], x.grps[, 2], and x.grps[, 3]. Or make it into a function and use apply() to get them all:
TSS <- function(x, g) {
sum(aggregate(x, by=list(g), function(x) sum(scale(x,
scale=FALSE)^2))[, -1])
}
TSS.all <- apply(x.grps, 2, function(g) TSS(x, g))
TSS.all

how to filter out small subgraphs in R

suppose I have a network like this with multiple subgraphs.
How can I only keep the subgraph with the most number of vertices while removing the rest? In this case I want to keep the subgraph on the left and remove the 3-vertices one the lower right. Thanks!
Given
set.seed(1)
g <- sample_gnp(20, 1 / 20)
plot(g)
we wish to keep the subgraph with 6 vertices. Using
(clu <- components(g))
# $membership
# [1] 1 2 3 4 5 4 5 5 6 7 8 9 10 3 5 11 5 3 12 5
# $csize
# [1] 1 1 3 2 6 1 1 1 1 1 1 1
# $no
# [1] 12
gMax <- induced_subgraph(g, V(g)[clu$membership == which.max(clu$csize)])
we then get
plot(gMax)
This assumes that there is a single largest connected subgraph. Otherwise the "first" one will be chosen.

Calculating molecular formulas out of mass of certain elements

For a chemistry project at school I want to calculate molecular masses of all possible combinations of molecular formulas including carbon (1 atom up to 100), oxygen (1 up to 50), hydrogen (1 up to 200), nitrogen (1 up to 20) and sulfur (1 up to 10) and save the results in one vector and the corresponding molecular formula string in another vector. The masses are numeric values: 12, 16, 1, 14 and 32. The strings are "C", "O", "H", "N", "S".
I want to delete molecular formulas that make no sense like C1 O100 H0 N20 S10 from the string and the corresponding mass, too. So to be more specific only leave the ones with a O/C relation between 0 and 1, a H/C relation between 2 and 1, a N/C relation between 0 and 0.2 and a S/C relation between 0 and 0.1.
Is there a easy way to do this, is using a for loop the only way or is there a faster way (maybe arrays?) and how can I take account to the relations of molecules?
Would be vary happy for some ideas or basic code to solve this.
..so #Gregor to disclude the relations of atoms that dont make sense probably will be better before the whole list is created? #Barker Yes atoms like Nitrogen should go from 0 to max. I am very new to R so when I try a loop I end up with the last value calculated...(reduced amount of dimensions).
z=matrix(0,1,5*20*10*2*2)
C=12
O=16
H=1
N=14
S=32
for( u in 1:length(z)) {
for(i in 1:5) {
for (j in 1:20) {
for(k in 1:10 ) {
for(l in 0:1) {
for(m in 0:1){
z[1,u] <- C*i+H*j+O*k+N*l+S*m
}
}
}
}
}
}
does anyone know where the mistake is here?
expand.grid is a good place to start in generating combinations. For example, to create a data.frame with combinations of H and C you could do this
mol = expand.grid(C = 1:3, H = 1:4)
mol
# C H
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
You can add on the other elements in expand.grid as well and also adjust the inputs up to 1:200 or however many you want. If your computer has enough memory, you'll be able to create the 10MM row data frame as specified in your question - though that is pretty big. If you could reduce the total number of combinations to 1MM it will be much easier on your memory.
The next step would be to delete rows that don't meet your ratio criteria. Here's one example, to make sure that the number of H is between 1 and 2 times the number of C:
mol = mol[mol$H >= mol$C & mol$H <= 2 * mol$C, ]
mol
# C H
# 1 1 1
# 4 1 2
# 5 2 2
# 8 2 3
# 9 3 3
# 11 2 4
# 12 3 4
Repeat steps like that for all your conditions.
Finally you can calculate the weights and put it in a new column:
mol$weight = with(mol, C * 12 + H * 1)
mol
# C H weight
# 1 1 1 13
# 4 1 2 14
# 5 2 2 26
# 8 2 3 27
# 9 3 3 39
# 11 2 4 28
# 12 3 4 40
You could use matrix multiplication for the weight calculation, but there's no need with a small number of possible elements. If you had 20 or more possible input elements it would make sense to do it that way.
Bonus! Formulas can be created with paste or paste0:
mol$formula = paste0("C", mol$C, " H", mol$H)
mol
# C H weight formula
# 1 1 1 13 C1 H1
# 4 1 2 14 C1 H2
# 5 2 2 26 C2 H2
# 8 2 3 27 C2 H3
# 9 3 3 39 C3 H3
# 11 2 4 28 C2 H4
# 12 3 4 40 C3 H4
Of course, most of these still won't make chemical sense - C1 H1 isn't something that would really exist, but maybe you can come up with even smarter conditions to get rid of more of the impossibilities!

Visualize strongly connected components in R

I have a weighted directed graph with three strongly connected components(SCC).
The SCCs are obtained from the igraph::clusters function
library(igraph)
SCC<- clusters(graph, mode="strong")
SCC$membership
[1] 9 2 7 7 8 2 6 2 2 5 2 2 2 2 2 1 2 4 2 2 2 3 2 2 2 2 2 2 2 2
SCC$csize
[1] 1 21 1 1 1 1 2 1 1
SCC$no
[1] 9
I want to visualize the SCCs with circles and a colored background as the graph below, is there any ways to do this in R? Thanks!
Take a look at the mark.groups argument of plot.igraph. Something like the following will do the trick:
# Create some toy data
set.seed(1)
library(igraph)
graph <- erdos.renyi.game(20, 1/20)
# Do the clustering
SCC <- clusters(graph, mode="strong")
# Add colours and use the mark.group argument
V(graph)$color <- rainbow(SCC$no)[SCC$membership]
plot(graph, mark.groups = split(1:vcount(graph), SCC$membership))

Resources