How can I automate a basic genetic distance matrix in R? - r

I'm trying to create an algorithm that would produce a distance matrix from a dataframe. The idea is that the dataframe will contain three or more aligned genetic sequences and the algorithm will calculate the number of differences between each sequence and convert this into a dataframe. Hence, the input data would look something like this:
taxon1 taxon2 taxon3
1 g g g
2 a c c
3 a a a
4 a t c
5 g g g
6 c t t
So far, I have the following code to calculate the difference between two sequences (taxon 1 and taxon 2):
distance1_2 <- 0
for (i in 1:length(taxon1)){
if (taxon1[i] == taxon2[i]){
distance1_2 <- distance1_2
}
else{
distance1_2 <- distance1_2 + 1
}
}
distance1_2
How can I automate this without manually repeating the same code for each individual taxon combination? The finished matrix should look something like this:
t1 t2 t3
t1 0 4 5
t2 4 0 5
t3 5 5 0

I am not sure whether it is the following you want:
outer(df, df, Vectorize(\(x,y) sum(x != y)))
#> taxon1 taxon2 taxon3
#> taxon1 0 3 3
#> taxon2 3 0 1
#> taxon3 3 1 0

Related

How to compute total within sum of square in hierarchical clustering

I read several textbook and online tutorials about clustering algorithms. In K-mean algorithm, when you run kmean() the information of total within sum of square is included. But we runhclust()in agglomerative hierarchical clustering, we can not find this information. So is that possible to compute TWSS for hclust()? Or is is reasonable to calculate the TWSS in hclust()?
The original data set is something like this:
1 -1.6768555093 -1.33937070 1.246858892 1.23171108 2.186761
2 -3.0832450282 1.28841533 0.286807651 1.54836547 3.494282
3 -1.4664760903 0.80289181 1.940444140 1.84226142 3.543522
4 -3.1109618863 0.32801815 -0.497680172 2.54236639 2.501975
5 -2.7603333486 0.49249130 1.041125723 1.75577604 2.868788
6 -4.3145154475 -2.01808802 1.227723818 0.09547962 2.570594
7 -1.6097707596 0.25391455 2.978627043 0.07428535 4.510882
Below is my code. In here, minClusters = 1, maxClusters = 10
hierarchy_mod <- hclust(Eucli_dis,method = "complete")
memb <- cutree(hierarchy_mod,minClusters:maxClusters)
memb_DT <- data.table(memb)
I got the result of a matrix and transfer it to data.table:
1 2 3 4 5 6 7 8 9 10
1: 1 1 1 1 1 1 1 1 1 1
2: 1 1 1 1 1 1 1 1 2 2
3: 1 1 1 1 1 1 1 1 2 2
4: 1 1 1 1 1 1 1 1 1 1
5: 1 1 1 1 1 1 1 1 2 2
...
The problem for me now is I don't know how to compute the TWSS in this scenario. I checked on-line tutorial and text books but in hclust(), no one calculate the TWSS...
Thank you!
TWSS is useful in comparing different results using kmeans because the starting configuration is usually random so different runs can give different results. That does not happen in hierarchical clustering since the cluster process is deterministic. But you can easily write R commands to compute it for any cluster result. First we need to make a reproducible example:
set.seed(4242)
x <- matrix(rnorm(125), 25, 5)
x.dist <- dist(x)
x.clus <- hclust(x.dist, metho="complete")
plot(x.clus)
x.grps <- cutree(x.clus, 3:5)
We are clustering 25 rows (cases) by 5 columns (variables). We will look at solutions involving 3 to 5 clusters. We can use the scale() function to compute the sums of squares by cluster and then sum them:
x.SS <- aggregate(x, by=list(x.grps[, 1]), function(x) sum(scale(x,
scale=FALSE)^2))
x.SS
SS <- rowSums(x.SS[, -1]) # Sum of squares for each cluster
TSS <- sum(x.SS[, -1]) # Total (within) sum of squares
You will have to run this code for x.grps[, 1], x.grps[, 2], and x.grps[, 3]. Or make it into a function and use apply() to get them all:
TSS <- function(x, g) {
sum(aggregate(x, by=list(g), function(x) sum(scale(x,
scale=FALSE)^2))[, -1])
}
TSS.all <- apply(x.grps, 2, function(g) TSS(x, g))
TSS.all

How to plot only large communities/clusters in R

I have an igraph in g. Since the graph is huge I only want to plot communities with more than 10 members, but I want to plot them all in one plot.
My idea to remove unwanted elements is:
g <- delete_vertices(g, V(g)[igraph::clusters(g)$csize < 10])
but for some reason this plots a lot of single nodes, which is the opposite of what I try to achieve. Can you tell me where I am wrong?
Your idea is great, but the problem is that
igraph::clusters(g)$csize < 10
only returns a logical vector of clusters containing fewer than 10 members. Meanwhile, you need to know which vertices belong to those clusters.
Hence, we may proceed as follows.
set.seed(1)
g1 <- erdos.renyi.game(100, 1 / 70)
cls <- clusters(g1)
cls$csize
# [1] 1 1 43 2 11 1 1 1 2 1 2 5 1 1 4 4 1 1 1 1 2 1 2 1
# [25] 4 1 1 1 1 1 # Two clusters of interest
g2 <- delete_vertices(g1, V(g1)[cls$membership %in% which(cls$csize <= 10)])
plot(g2)

Function to apply a nested clusters to class "dist"

I have a data.frame called mydf.
x y
0 A
1 A
2 A
3 A
0 B
2 B
0 C
3 C
3 D
...(20,000 rows)
I am using the GMD package (elbow method) to automatically identify clusters and decide the number of clusters.
library("GMD")
dist.obj <- dist(mydf$x[mydf$y=="A"])
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
cutree.obj <- cutree(hclust.obj,k=k)
mydf$cluster <- cutree.obj
I would like to apply those scripts for all categories (A, B, C, D, etc) in column y automatically, don't need to repeat the scripts one after one.
Problem 1: I got the error "Error: cannot allocate vector of size 2.1 Gb" when process this step:
css.obj <- css.hclust(dist.obj,hclust.obj)
Problem 2: I can process this, but am stuck and can't get any further.
dist.obj <- lapply(split(mydf, mydf$y), dist)
The desired result for mydf is
x y cluster
0 A 1
1 A 1
2 A 2
3 A 3
0 B 1
2 B 2
0 C 1
3 C 2
3 D 1
Can you please help me? Any solution is well appreciated. Cheers!

Working with long data format in R

Good day,
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
df <- data.frame(d,e,f)
I have data the looks like the above. What I need to do is for each unique element of d find the first non-zero value in f, and find the corresponding value in e. To be specific, I want another vector g so it looks like this:
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
g <- c(7,7,7,6,6,6,7,7,7)
df <- data.frame(d,e,f,g)
Suggestions to do this easily? I thought I could use split(), but I am having trouble using which() after the split. I can use ave like this:
foo <- function(x){which(x>0)[1]}
df$t <- ave(df$f,df$d,FUN=foo)
But I am having trouble finding the value of e. Any help is appreciated.
Someone else can provide a base R solution, but here's a way to do this using plyr:
> ddply(df,.(d),transform,g = head(e[f != 0],1))
d e f g
1 1 5 0 7
2 1 6 0 7
3 1 7 1 7
4 2 5 0 6
5 2 6 1 6
6 2 7 0 6
7 3 5 0 7
8 3 6 0 7
9 3 7 1 7
Note that I took your note about the "first nonzero element" literally, even though your example data only had a single unique nonzero element in the column (by group).
here's a way in base R
g <- inverse.rle(list(lengths=rle(d)$lengths, values=e[f != 0]))

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources