How to consistently plot a tree after removing tips? - r

Imagine I have a tree (or dendrogram)
require(ape)
fulltree <- rtree(n=50, br=NULL)
...and then I remove some tips
prunetree <- drop.tip(fulltree,node=5)
If I plot the pruned tree, R rescales it so that only those tips remaining are considered.
par(mfrow=c(1,2))
plot(fulltree, type="fan")
plot(prunetree, type="fan")
But this makes it really hard to tell what part of the tree is now missing.
Is there a simple way that I can plot the pruned tree in the same scale/arrangement/etc. as the complete tree so that none of the remaining branches appear to move around? (In this example, I would get some kind of pac-man shape rather than a full circle) I'm thinking this could be done by coloring branches white or light grey. It would be really useful if someone wanted to animate a tree that was losing tips.

The problem with this, is as you stated, the data is removed from the new tree so it is rescaled. To fix this, you might be better off plotting the tree with a new color for the desired tip(s).
We can do this using the excellent package ggtree (amongst other methods):
set.seed(1234)
library(ggtree)
library(gridExtra)
fulltree <- rtree(n=10, br=NULL)
col <- rep(1, 2*fulltree$Nnode + 1)
col[5] <- 10
grid.arrange(ggtree(fulltree, layout = "fan") + geom_text(aes(label=label)),
ggtree(fulltree, col = col, layout = "circular") + geom_text(aes(label=label)))
The actual coloring comes from the col[5] <- 20: change the col[5] to your desired dropped tip, and the 20 to your desired colour.

Thanks jeremycg for the ggtree tip. I think this was more what I was looking for.
require(ape)
library(ggtree)
library(gridExtra)
library(ggplot2)
set.seed(1234)
fulltree <- rtree(n=50, br=NULL)
#These are the tips to drop
prunetips <- c("t41","t44","t42","t8")
#But get the tips to keep
keeptips <- fulltree$tip.label[!fulltree$tip.label %in% prunetips]
#Group the tips to keep
prunetree <- groupOTU(fulltree, focus=keeptips)
#And plot
ggtree(prunetree, layout="fan", aes(color=group))+
scale_color_manual(values=c("lightgrey","black"))+
geom_tiplab()

Related

How to cut a dendrogram in r

Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")
I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)

Node labels on circular phylogenetic tree

I am trying to create circular phylogenetic tree. I have this part of code:
fit<- hclust(dist(Data[,-4]), method = "complete", members = NULL)
nclus= 3
color=c('red','blue','green')
color_list=rep(color,nclus/length(color))
clus=cutree(fit,nclus)
plot(as.phylo(fit),type='fan',tip.color=color_list[clus],label.offset=0.2,no.margin=TRUE, cex=0.70, show.node.label = TRUE)
And this is result:
Also I am trying to show label for each node and to color branches. Any suggestion how to do that?
Thanks!
When you say "color branches" I assume you mean color the edges. This seems to work, but I have to think there's a better way.
Using the built-in mtcars dataset here, since you did not provide your data.
plot.fan <- function(hc, nclus=3) {
palette <- c('red','blue','green','orange','black')[1:nclus]
clus <-cutree(hc,nclus)
X <- as.phylo(hc)
edge.clus <- sapply(1:nclus,function(i)max(which(X$edge[,2] %in% which(clus==i))))
order <- order(edge.clus)
edge.clus <- c(min(edge.clus),diff(sort(edge.clus)))
edge.clus <- rep(order,edge.clus)
plot(X,type='fan',
tip.color=palette[clus],edge.color=palette[edge.clus],
label.offset=0.2,no.margin=TRUE, cex=0.70)
}
fit <- hclust(dist(mtcars[,c("mpg","hp","wt","disp")]))
plot.fan(fit,3); plot.fan(fit,5)
Regarding "label the nodes", if you mean label the tips, it looks like you've already done that. If you want different labels, unfortunately, unlike plot.hclust(...) the labels=... argument is rejected. You could experiment with the tiplabels(....) function, but it does not seem to work very well with type="fan". The labels come from the row names of Data, so your best bet IMO is to change the row names prior to clustering.
If you actually mean label the nodes (the connection points between the edges, have a look at nodelabels(...). I don't provide a working example because I can't imagine what labels you would put there.

How to use a bubbleplot in ggplot2/R to deal with overplotting

I have a plot of categorical variables as below:
http://i.imgur.com/d1hJP21.png
This is a very small subset of the actual data (n > 10000)
While jittering handles the overplotting, it is ugly and can lead to ambiguity. I was keen to instead place bubbles to show the number of points that are co-incident.
I can't seem to find a simple and repeatable way to do this.
Thank you in advance!
Edit:
Thanks for the feedback. Here is what I hope is a reproducible example:
First, a CSV of the data (long, but relevant in this example):
ID,g,wf,fi
1824848,14,2,4
1314001,14,2,3
670960,14,1,3
1313235,15,3,4
1172304,3,5,4
1859973,15,1,3
1826951,14,1,4
1868238,15,1,2
1911869,15,1,4
1911861,15,1,2
926829,14,1,3
1609578,3,4,4
1306895,3,5,4
1199557,15,1,4
692849,10,3,4
1923352,3,5,4
1881724,4,4,4
1384603,3,5,4
1928829,15,1,4
493503,3,5,4
902650,15,1,3
1887582,6,4,4
1887584,3,5,4
1933992,13,1,4
635372,3,3,4
1892765,15,1,2
1934773,13,2,4
1892530,14,2,4
936786,3,5,4
1897585,13,3,4
1895932,15,1,3
422785,15,1,3
1219573,8,1,4
1897817,3,2,4
1899612,14,3,4
1939157,15,1,4
1952043,14,1,3
1938048,14,1,3
1896607,15,1,2
1941385,15,1,3
1959437,3,5,4
1064010,15,1,3
1951600,13,3,4
541439,15,1,4
1938609,3,5,4
1958667,15,1,2
1943792,10,1,4
1943782,14,1,4
1893714,14,1,4
1335502,15,1,1
1950179,3,2,4
1959069,15,1,2
1958811,15,1,2
1958808,15,3,4
1959878,15,1,1
1949904,15,1,3
1961475,15,1,4
1876863,15,1,4
384705,15,1,3
1966338,15,1,4
1980290,3,4,4
1966997,15,2,4
1967107,15,1,1
1976077,15,1,2
1967579,11,1,4
1967387,4,2,4
1973408,3,3,4
1684881,3,3,3
...and the plot code:
sx <- ggplot(dx, aes(x=fi, y=wf)) +
geom_point(shape=19, alpha=1, size=1, position=position_jitter(width=0.1,height=.1))
print(sx)
I really don't know where to go from here, other than manually making a count matrix...
Thanks again (sorry, new to stackoverflow).

In R, how can I make the branches of a classification tree not overlap in a plot?

I have a tree with a lot of branches. Here is my code to plot the tree. The problem is that the labels overlap each other, specially towards the bottom of the tree. Is there any way to plot the tree so that the labels don't overlap?
par(mfrow=c(1,1))
plot(prunedTree, type=c("uniform"))
text(prunedTree)
Note--I used "type=c("uniform"))" because it helped readability the lower branches. Also, prunedTree is the class "tree" from the tree package.
Here's a sample of what is being produced currently.
EDIT: Code to fully reproduce the issue.
load(url("https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda"))
samsungData$subject <- factor(samsungData$subject)
samsungData$activity <- factor(samsungData$activity)
samsungData <- samsungData[, !c(duplicated(names(samsungData)))]
names(samsungData) <- gsub("[.]", "", names(samsungData))
samsungData <- data.frame(samsungData)
trainDF <- samsungData[samsungData$subject %in% c(1,3,5,6),]
tree1 <- tree(activity ~ ., data=trainDF)
plot(tree1)
text(tree1)
You have several general options:
Use a wider graphics device. (i.e. png(...,width = 1200,height = ...))
Shrink the text using cex = 0.5 (or smaller)
Use more concise column (i.e. variable) names
Some combination of the previous three.
I thought I could get text.tree to use fewer significant digits in labeling the splits, but I can't seem to do that. rpart appears to use only 4 digits by default, so that would save you some space as well.
In addition to joran indications listed above, you can play with parameters:
srt to rtotate your text.
give different colors for text
For example :
plot(tree1)
text(tree1,col=rainbow(5)[1:25],srt=85,cex=0.8)

How to color branches in cluster dendrogram?

I will appreciate it so much if anyone of you show me how to color the main branches on the Fan clusters.
Please use the following example:
library(ape)
library(cluster)
data(mtcars)
plot(as.phylo(hclust(dist(mtcars))),type="fan")
You will need to be more specific about what you mean by "color the main branches" but this may give you some ideas:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "green")[1+(phyl$edge.length >40) ])
The odd numbered edges are the radial arms in a fan plot so this mildly ugly (or perhaps devilishly clever?) hack colors only the arms with length greater than 40:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "black", "green")[
c(TRUE, FALSE) + 1 + (phyl$edge.length >40) ])
If you want to color the main branches to indicate which class that sample belongs to, then you might find the function ColorDendrogram in the R package sparcl useful (can be downloaded from here). Here's some sample code:
library(sparcl)
# Create a fake two sample dataset
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)
This will generate a dendrogram where the leaves are colored according to which of the two samples they came from.

Resources