Selecting clusters below a certain height in a dendrogram R but only if the cluster is bigger than one - r

I'm looking to write some simple code that will select for certain clusters below a threshold height and highlight them (either with a box or by colour).
So far I have used cutree, which selects the clusters I am after, but it also selects all the clusters of size 1.
I've managed to use which to select the clusters I actually want, but as this is only a very small section of the data I have I don't want to have to go through manually to choose these. Is there a way that I can cut the tree but only select clusters bigger than one?
This is the code I'm using at the moment:
plot(hClust,hang = -1,cex=0.5)
abline(h= 0.0018,col = 'blue')
ct <- cutree(hClust, h = 0.0018)
clust <- rect.hclust(hClust, h=0.0018, which = c(1,2,4,8,23))

You do not provide your data so I will illustrate with the built-in mtcars data. Of course, the heights are different than yours. Same set-up as your problem:
hClust =hclust(dist(mtcars))
plot(hClust,hang = -1, cex=0.8)
abline(h= 28,col = 'blue')
Now we can call rect.hclust without printing (border=0), to get the clusters numbered as rect.hclust see them. Then we can select the clusters with more than one point and put the boxes around those.
clust <- rect.hclust(hClust, h=28, border=0)
NumMemb = sapply(clust, length)
clust <- rect.hclust(hClust, h=28, which=which(NumMemb>1))

Related

iGraph - Spacing between verticies

I have a dataset called data. The data is not that important, but every interaction has a name. I want to create a graph in iGraph with the following code:
tab <- count(data, B, S, K)
factors <- table(interaction(tab$B, tab$K),interaction(tab$S,tab$K))
graph1 <- graph_from_incidence_matrix(factors)
plot(graph1, vertex.size = 40, layout = layout.bipartite)
However, I get the following:
All the names of interactions are completely mixed together. I can make it a little more readable by lowering the vertex.size, but I want to find a solution to my problem.
I want to create more space between the verticies, but I cannot seem to find the right way.
I have tried creating a manual graph by using tkplot, but it is annoying that I manually have to sort them each time.
Best regards

How to cut a dendrogram in r

Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")
I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)

Node labels on circular phylogenetic tree

I am trying to create circular phylogenetic tree. I have this part of code:
fit<- hclust(dist(Data[,-4]), method = "complete", members = NULL)
nclus= 3
color=c('red','blue','green')
color_list=rep(color,nclus/length(color))
clus=cutree(fit,nclus)
plot(as.phylo(fit),type='fan',tip.color=color_list[clus],label.offset=0.2,no.margin=TRUE, cex=0.70, show.node.label = TRUE)
And this is result:
Also I am trying to show label for each node and to color branches. Any suggestion how to do that?
Thanks!
When you say "color branches" I assume you mean color the edges. This seems to work, but I have to think there's a better way.
Using the built-in mtcars dataset here, since you did not provide your data.
plot.fan <- function(hc, nclus=3) {
palette <- c('red','blue','green','orange','black')[1:nclus]
clus <-cutree(hc,nclus)
X <- as.phylo(hc)
edge.clus <- sapply(1:nclus,function(i)max(which(X$edge[,2] %in% which(clus==i))))
order <- order(edge.clus)
edge.clus <- c(min(edge.clus),diff(sort(edge.clus)))
edge.clus <- rep(order,edge.clus)
plot(X,type='fan',
tip.color=palette[clus],edge.color=palette[edge.clus],
label.offset=0.2,no.margin=TRUE, cex=0.70)
}
fit <- hclust(dist(mtcars[,c("mpg","hp","wt","disp")]))
plot.fan(fit,3); plot.fan(fit,5)
Regarding "label the nodes", if you mean label the tips, it looks like you've already done that. If you want different labels, unfortunately, unlike plot.hclust(...) the labels=... argument is rejected. You could experiment with the tiplabels(....) function, but it does not seem to work very well with type="fan". The labels come from the row names of Data, so your best bet IMO is to change the row names prior to clustering.
If you actually mean label the nodes (the connection points between the edges, have a look at nodelabels(...). I don't provide a working example because I can't imagine what labels you would put there.

In R, how can I make the branches of a classification tree not overlap in a plot?

I have a tree with a lot of branches. Here is my code to plot the tree. The problem is that the labels overlap each other, specially towards the bottom of the tree. Is there any way to plot the tree so that the labels don't overlap?
par(mfrow=c(1,1))
plot(prunedTree, type=c("uniform"))
text(prunedTree)
Note--I used "type=c("uniform"))" because it helped readability the lower branches. Also, prunedTree is the class "tree" from the tree package.
Here's a sample of what is being produced currently.
EDIT: Code to fully reproduce the issue.
load(url("https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda"))
samsungData$subject <- factor(samsungData$subject)
samsungData$activity <- factor(samsungData$activity)
samsungData <- samsungData[, !c(duplicated(names(samsungData)))]
names(samsungData) <- gsub("[.]", "", names(samsungData))
samsungData <- data.frame(samsungData)
trainDF <- samsungData[samsungData$subject %in% c(1,3,5,6),]
tree1 <- tree(activity ~ ., data=trainDF)
plot(tree1)
text(tree1)
You have several general options:
Use a wider graphics device. (i.e. png(...,width = 1200,height = ...))
Shrink the text using cex = 0.5 (or smaller)
Use more concise column (i.e. variable) names
Some combination of the previous three.
I thought I could get text.tree to use fewer significant digits in labeling the splits, but I can't seem to do that. rpart appears to use only 4 digits by default, so that would save you some space as well.
In addition to joran indications listed above, you can play with parameters:
srt to rtotate your text.
give different colors for text
For example :
plot(tree1)
text(tree1,col=rainbow(5)[1:25],srt=85,cex=0.8)

Trying to determine why my heatmap made using heatmap.2 and using breaks in R is not symmetrical

I am trying to cluster a protein dna interaction dataset, and draw a heatmap using heatmap.2 from the R package gplots. My matrix is symmetrical.
Here is a copy of the data-set I am using after it is run through pearson:DataSet
Here is the complete process that I am following to generate these graphs: Generate a distance matrix using some correlation in my case pearson, then take that matrix and pass it to R and run the following code on it:
library(RColorBrewer);
library(gplots);
library(MASS);
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
# location <- args[2];
# setwd(args[2]);
pdf("result.pdf", pointsize = 15, width = 18, height = 18)
mycol <- c("blue","white","red")
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
#colors <- colorpanel(75,"midnightblue","mediumseagreen","yellow")
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
dev.off()
The issue I am having is once I use breaks to help me control the color separation the heatmap no longer looks symmetrical.
Here is the heatmap before I use breaks, as you can see the heatmap looks symmetrical:
Here is the heatmap when breaks are used:
I have played with the cutoff's for the sequences to make sure for instance one sequence does not end exactly where the other begins, but I am not able to solve this problem. I would like to use the breaks to help bring out the clusters more.
Here is an example of what it should look like, this image was made using cluster maker:
I don't expect it to look identical to that, but I would like it if my heatmap is more symmetrical and I had better definition in terms of the clusters. The image was created using the same data.
After some investigating I noticed was that after running my matrix through heatmap, or heatmap.2 the values were changing, for example the interaction taken from the provided data set of
Pacdh-2
and
pegg-2
gave a value of 0.0250313 before the matrix was sent to heatmap.
After that I looked at the matrix values using result$carpet and the values were then
-0.224333135
-1.09805379
for the two interactions
So then I decided to reorder the original matrix based on the dendrogram from the clustered matrix so that I was sure that the values would be the same. I used the following stack overflow question for help:
Order of rows in heatmap?
Here is the code used for that:
rowInd <- rev(order.dendrogram(result$rowDendrogram))
colInd <- rowInd
data_ordered <- matrix_a[rowInd, colInd]
I then used another program "matrix2png" to draw the heatmap:
I still have to play around with the colors but at least now the heatmap is symmetrical and clustered.
Looking into it even more the issue seems to be that I was running scale(matrix_a) when I change my code to just be mtscaled <- as.matrix(matrix_a) the result now looks symmetrical.
I'm certainly not the person to attempt reproducing and testing this from that strange data object without code that would read it properly, but here's an idea:
..., col=bluered(20)[4:20], ...
Here's another though which should return the full rand of red which tha above strategy would not:
shift.BR<- colorRamp(c("blue","white", "red"), bias=0.5 )((1:16)/16)
heatmap.2( ...., col=rgb(shift.BR, maxColorValue=255), .... )
Or you can use this vector:
> rgb(shift.BR, maxColorValue=255)
[1] "#1616FF" "#2D2DFF" "#4343FF" "#5A5AFF" "#7070FF" "#8787FF" "#9D9DFF" "#B4B4FF" "#CACAFF" "#E1E1FF" "#F7F7FF"
[12] "#FFD9D9" "#FFA3A3" "#FF6C6C" "#FF3636" "#FF0000"
There was a somewhat similar question (also today) that was asking for a blue to red solution for a set of values from -1 to 3 with white at the center. This it the code and output for that question:
test <- seq(-1,3, len=20)
shift.BR <- colorRamp(c("blue","white", "red"), bias=2)((1:20)/20)
tpal <- rgb(shift.BR, maxColorValue=255)
barplot(test,col = tpal)
(But that would seem to be the wrong direction for the bias in your situation.)

Resources