Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")
I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)
I am trying to create circular phylogenetic tree. I have this part of code:
fit<- hclust(dist(Data[,-4]), method = "complete", members = NULL)
nclus= 3
color=c('red','blue','green')
color_list=rep(color,nclus/length(color))
clus=cutree(fit,nclus)
plot(as.phylo(fit),type='fan',tip.color=color_list[clus],label.offset=0.2,no.margin=TRUE, cex=0.70, show.node.label = TRUE)
And this is result:
Also I am trying to show label for each node and to color branches. Any suggestion how to do that?
Thanks!
When you say "color branches" I assume you mean color the edges. This seems to work, but I have to think there's a better way.
Using the built-in mtcars dataset here, since you did not provide your data.
plot.fan <- function(hc, nclus=3) {
palette <- c('red','blue','green','orange','black')[1:nclus]
clus <-cutree(hc,nclus)
X <- as.phylo(hc)
edge.clus <- sapply(1:nclus,function(i)max(which(X$edge[,2] %in% which(clus==i))))
order <- order(edge.clus)
edge.clus <- c(min(edge.clus),diff(sort(edge.clus)))
edge.clus <- rep(order,edge.clus)
plot(X,type='fan',
tip.color=palette[clus],edge.color=palette[edge.clus],
label.offset=0.2,no.margin=TRUE, cex=0.70)
}
fit <- hclust(dist(mtcars[,c("mpg","hp","wt","disp")]))
plot.fan(fit,3); plot.fan(fit,5)
Regarding "label the nodes", if you mean label the tips, it looks like you've already done that. If you want different labels, unfortunately, unlike plot.hclust(...) the labels=... argument is rejected. You could experiment with the tiplabels(....) function, but it does not seem to work very well with type="fan". The labels come from the row names of Data, so your best bet IMO is to change the row names prior to clustering.
If you actually mean label the nodes (the connection points between the edges, have a look at nodelabels(...). I don't provide a working example because I can't imagine what labels you would put there.
I have a plot of categorical variables as below:
http://i.imgur.com/d1hJP21.png
This is a very small subset of the actual data (n > 10000)
While jittering handles the overplotting, it is ugly and can lead to ambiguity. I was keen to instead place bubbles to show the number of points that are co-incident.
I can't seem to find a simple and repeatable way to do this.
Thank you in advance!
Edit:
Thanks for the feedback. Here is what I hope is a reproducible example:
First, a CSV of the data (long, but relevant in this example):
ID,g,wf,fi
1824848,14,2,4
1314001,14,2,3
670960,14,1,3
1313235,15,3,4
1172304,3,5,4
1859973,15,1,3
1826951,14,1,4
1868238,15,1,2
1911869,15,1,4
1911861,15,1,2
926829,14,1,3
1609578,3,4,4
1306895,3,5,4
1199557,15,1,4
692849,10,3,4
1923352,3,5,4
1881724,4,4,4
1384603,3,5,4
1928829,15,1,4
493503,3,5,4
902650,15,1,3
1887582,6,4,4
1887584,3,5,4
1933992,13,1,4
635372,3,3,4
1892765,15,1,2
1934773,13,2,4
1892530,14,2,4
936786,3,5,4
1897585,13,3,4
1895932,15,1,3
422785,15,1,3
1219573,8,1,4
1897817,3,2,4
1899612,14,3,4
1939157,15,1,4
1952043,14,1,3
1938048,14,1,3
1896607,15,1,2
1941385,15,1,3
1959437,3,5,4
1064010,15,1,3
1951600,13,3,4
541439,15,1,4
1938609,3,5,4
1958667,15,1,2
1943792,10,1,4
1943782,14,1,4
1893714,14,1,4
1335502,15,1,1
1950179,3,2,4
1959069,15,1,2
1958811,15,1,2
1958808,15,3,4
1959878,15,1,1
1949904,15,1,3
1961475,15,1,4
1876863,15,1,4
384705,15,1,3
1966338,15,1,4
1980290,3,4,4
1966997,15,2,4
1967107,15,1,1
1976077,15,1,2
1967579,11,1,4
1967387,4,2,4
1973408,3,3,4
1684881,3,3,3
...and the plot code:
sx <- ggplot(dx, aes(x=fi, y=wf)) +
geom_point(shape=19, alpha=1, size=1, position=position_jitter(width=0.1,height=.1))
print(sx)
I really don't know where to go from here, other than manually making a count matrix...
Thanks again (sorry, new to stackoverflow).
I have a tree with a lot of branches. Here is my code to plot the tree. The problem is that the labels overlap each other, specially towards the bottom of the tree. Is there any way to plot the tree so that the labels don't overlap?
par(mfrow=c(1,1))
plot(prunedTree, type=c("uniform"))
text(prunedTree)
Note--I used "type=c("uniform"))" because it helped readability the lower branches. Also, prunedTree is the class "tree" from the tree package.
Here's a sample of what is being produced currently.
EDIT: Code to fully reproduce the issue.
load(url("https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda"))
samsungData$subject <- factor(samsungData$subject)
samsungData$activity <- factor(samsungData$activity)
samsungData <- samsungData[, !c(duplicated(names(samsungData)))]
names(samsungData) <- gsub("[.]", "", names(samsungData))
samsungData <- data.frame(samsungData)
trainDF <- samsungData[samsungData$subject %in% c(1,3,5,6),]
tree1 <- tree(activity ~ ., data=trainDF)
plot(tree1)
text(tree1)
You have several general options:
Use a wider graphics device. (i.e. png(...,width = 1200,height = ...))
Shrink the text using cex = 0.5 (or smaller)
Use more concise column (i.e. variable) names
Some combination of the previous three.
I thought I could get text.tree to use fewer significant digits in labeling the splits, but I can't seem to do that. rpart appears to use only 4 digits by default, so that would save you some space as well.
In addition to joran indications listed above, you can play with parameters:
srt to rtotate your text.
give different colors for text
For example :
plot(tree1)
text(tree1,col=rainbow(5)[1:25],srt=85,cex=0.8)
I will appreciate it so much if anyone of you show me how to color the main branches on the Fan clusters.
Please use the following example:
library(ape)
library(cluster)
data(mtcars)
plot(as.phylo(hclust(dist(mtcars))),type="fan")
You will need to be more specific about what you mean by "color the main branches" but this may give you some ideas:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "green")[1+(phyl$edge.length >40) ])
The odd numbered edges are the radial arms in a fan plot so this mildly ugly (or perhaps devilishly clever?) hack colors only the arms with length greater than 40:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "black", "green")[
c(TRUE, FALSE) + 1 + (phyl$edge.length >40) ])
If you want to color the main branches to indicate which class that sample belongs to, then you might find the function ColorDendrogram in the R package sparcl useful (can be downloaded from here). Here's some sample code:
library(sparcl)
# Create a fake two sample dataset
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)
This will generate a dendrogram where the leaves are colored according to which of the two samples they came from.