Add a dendrogram to a a plotly::subplot figure - r

Since this post doesn't seem to be responded I tried generating it myself using R's plotly.
What I'm trying to do is plot several homologous genomic DNA segments, which are essentially horizontally laid out boxes that represent genes, and to their left a phylogenetic tree that represents the evolutionary relationships between the species of the respective genomes.
The genes belong to several groups, not all are represented in every genome.
Here is the list of data.frames that represent the genomic DNA segments:
dna.segs.list <- list(data.frame(name=c(paste0("B.",1:3),paste0("C.",1:3)),y=0.2,width=0.75,group=c(rep("B",3),rep("C",3)),stringsAsFactors=F),
data.frame(name=c(paste0("A.",1:2),paste0("C.",1:3)),y=0.2,width=0.75,group=c(rep("A",2),rep("C",3)),stringsAsFactors=F),
data.frame(name=c(paste0("A.",1:2),"B.1"),y=0.2,width=0.75,group=c(rep("A",2),"B"),stringsAsFactors=F),
data.frame(name=c(paste0("B.",1:3),paste0("C.",1:3)),y=0.2,width=0.75,group=c(rep("B",3),rep("C",3))),
data.frame(name=paste0("A.",1:3),y=0.2,width=0.75,group=rep("A",3),stringsAsFactors=F))
Here's how I create a single plot of all of them:
x.range <- c(-1,9)
dna.segs.plot.list <- lapply(1:length(dna.segs.list),function(s){
dna.seg.df <- dna.segs.list[[s]]
dna.seg.df$group <- factor(dna.seg.df$group,levels=c("A","B","C"))
dna.seg.plot <- plotly::plot_ly(dna.seg.df,showlegend=s==1) %>%
plotly::add_bars(x=~name,y=~y,width=~width,color=~group,colors=c("red","blue","green")) %>%
plotly::layout(legend=list(x=1,y=0)) %>%
plotly::layout(xaxis=list(title=NA,zeroline=F,tickangle=45,range=x.range),yaxis=list(title=NA,zeroline=F,showgrid=F,range=c(0,1),showticklabels=F))
return(dna.seg.plot)
})
dna.segs.plot <- plotly::subplot(dna.segs.plot.list,shareX = F,nrows = length(dna.segs.plot.list))
Which gives:
The problem here already is that I need to customize the legend so that I only plot it once on the one hand (otherwise it will repeat for each genome), but will include all gene groups.
Then I create the phylogenetic tree and convert it to a ggplot object so that I can add it to the dna.segs.plot:
tree.obj <- ape::read.tree(text="(((species1:0.08,species2:0.075):0.028,(species3:0.06,species4:0.06):0.05):0.0055,species5:0.1);")
tree.dend <- dendextend::as.ggdend(phylogram::as.dendrogram.phylo(tree.obj))
leaf.heights <- dplyr::filter(tree.dend$nodes,!is.na(leaf))$height
leaf.xs <- dplyr::filter(tree.dend$nodes,!is.na(leaf))$x
leaf.seqments.idx <- which(tree.dend$segments$yend %in% leaf.heights & tree.dend$segments$x %in% leaf.xs)
tree.dend$segments$yend[leaf.seqments.idx] <- max(tree.dend$segments$yend[leaf.seqments.idx])
tree.dend$segments$col[leaf.seqments.idx] <- "black"
tree.dend$labels$y <- max(tree.dend$segments$yend[leaf.seqments.idx])
tree.dend$labels$x <- tree.dend$segments$x[leaf.seqments.idx]
tree.dend$labels$col <- "black"
tree.dend$segments$lwd <- 0.5
tree.ggdend <- ggplot(tree.dend,labels=F,horiz=T)+guides(fill=F)+coord_flip()+annotate("text",size=4.5,hjust=0,x=tree.dend$label$x,y=tree.dend$label$y,label=tree.dend$label$label)+labs(x="",y="")+theme_minimal()+
theme(axis.text=element_blank(),axis.ticks=element_blank(),panel.grid=element_blank(),legend.position="none",legend.text=element_blank(),legend.background=element_blank(),legend.key=element_blank())
And finally, to combine the two I use:
dna.segs.tree.plot <- plotly::subplot(tree.ggdend,plotly::plotly_empty(),dna.segs.plot %>% plotly::layout(showlegend=T),nrows=1,margin=c(0,0,0,0),widths=c(0.39,0.02,0.59))
Which gives me:
Which is close to what I want but the issues I need help with are:
Having the tips of the tree and the DNA segments aligned
Hoe to get the tree labels not get run over by the branches as they do now
How to avoid getting the ---(black,solid,1) (NA,1) parts of the legend off (I'm assuming they get added due to the tree)
Taking care of the legend issue I described above - getting it to show all groups.
Thanks

Related

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

r coding for customising vegan plot

I am attempting to produce an NMDS plot in vegan, but really struggling with the code. I am trying to display the site points and species points differently, with the site points coloured according to treatment. Both lines work individually, but I cannot work out how to combine these two lines of code into one line to form one graph. I am using ordipointlabel to prevent overlap. These are the two lines of code I want to combine into one.
ordipointlabel(NMDS10, scaling=2, display="species", select=sel)
ordipointlabel(NMDS10,display="sites", col=c(rep("darkgreen",4),rep("blue4",4)),cex=0.75)
You can access directly to ordinpointlabel object and make it look like you wish. Please see the sample:
library(vegan)
data(dune)
NMDS10 <- metaMDS(dune[1:8, ])
pdf(file = NULL)
y <- ordipointlabel(NMDS10, display=c("sites", "species"))
dev.off()
# select sites & species
sel <- unlist(dimnames(dune[1:8, ]))[-(20:ncol(dune))]
# messing with ordipointlabel object
y$points <- y$points[rownames(y$points) %in% sel, ]
y$args$pcol[] = rep("red", length(y$args$pcol))
y$args$pcol[1:8] <- c(rep("darkgreen", 4), rep("blue4", 4))
y$par$cex <- 0.75
plot(y)

How to cut a dendrogram in r

Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")
I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)

How to color connecting lines when comparing two dendrograms in R dendextend

I am using dendextend's tanglegram to compare two dendrograms. Almost everything is working, including coloring the node labels to correspond to the clusters. What is not working is that I want all the connecting lines (from a node label in one dendrogram to the node label in the other dendrogram) to be black. Currently they are randomly colored, which makes it difficult to interpret.
Here is the R script. What is curious to me is that if I run it in RGui, it is correct (all connecting lines black), but the png has connecting lines of various colors.
suppressPackageStartupMessages(library(dendextend))
library(dendextend)
# for StackOverflow question, data is here rather than loaded from csv file
hab1 <- data.frame(matrix(c(100,90.6,88.9,89.2,91.2,98.2,91,55.9,91.5,97.2,90.6,100,93.9,85.3,98.3,90.1,96.2,53,88.7,91.6,88.9,93.9,100,82.9,94.4,88,93.4,51.9,87.1,90.4,89.2,85.3,82.9,100,86.6,89.8,85.2,60.7,95.8,91,91.2,98.3,94.4,86.6,100,90.6,96.4,53.4,89.2,92.2,98.2,90.1,88,89.8,90.6,100,90.4,56,91.8,97,91,96.2,93.4,85.2,96.4,90.4,100,52.4,88.6,92,55.9,53,51.9,60.7,53.4,56,52.4,100,59.8,56,91.5,88.7,87.1,95.8,89.2,91.8,88.6,59.8,100,93.3,97.2,91.6,90.4,91,92.2,97,92,56,93.3,100),nrow=10,ncol=10))
# set the column names, which are used for node labels
colnames(hab1) <- c("W01","W02","W03","W04","W05","W06","W07","W08","W09","W10")
hclust1 <- hclust(as.dist(100 - hab1), method="average")
dend1 <- as.dendrogram(hclust1)
hab2 <- data.frame(matrix(c(100,89.5,87.4,88.1,90.1,96.4,89.7,55.1,89.9,96,89.5,100,93.3,85.3,98.3,89.5,96,52.9,88.2,91.6,87.4,93.3,100,82.4,93.9,87,92.7,51.5,86.1,89.9,88.1,85.3,82.4,100,86.6,89.3,85.1,60.6,95.2,91,90.1,98.3,93.9,86.6,100,90.1,96.2,53.3,88.7,92.2,96.4,89.5,87,89.3,90.1,100,89.7,55.5,90.7,96.4,89.7,96,92.7,85.1,96.2,89.7,100,52.2,88,91.8,55.1,52.9,51.5,60.6,53.3,55.5,52.2,100,59.4,55.9,89.9,88.2,86.1,95.2,88.7,90.7,88,59.4,100,92.7,96,91.6,89.9,91,92.2,96.4,91.8,55.9,92.7,100),nrow=10,ncol=10))
# set the column names, which are used for node labels
colnames(hab2) <- c("W01","W02","W03","W04","W05","W06","W07","W08","W09","W10")
hclust2 <- hclust(as.dist(100 - hab2), method="average")
dend2 <- as.dendrogram(hclust2)
# colors for the node labels
colors_to_use1 <- c("purple","orange","blue","darkolivegreen","orange","purple","magenta","red","darkolivegreen","cyan")
colors_to_use2 <- c("purple","orange","blue","darkolivegreen","orange","purple","magenta","red","darkolivegreen","cyan")
# sort the colors based on their order in dend1
colors_to_use_dend1 <- colors_to_use1[order.dendrogram(dend1)]
labels_colors(dend1) <- colors_to_use_dend1
# sort the colors based on their order in dend2
colors_to_use_dend2 <- colors_to_use2[order.dendrogram(dend2)]
labels_colors(dend2) <- colors_to_use_dend2
dends_1_2 <- dendlist(dend1, dend2)
x <- dends_1_2 %>% untangle(method = "step2side") %>% tanglegram(color_lines = c("black"))
png("Exclude vs not exclude.png")
x %>% plot(main = "Exclude vs not exclude")
dev.off( )
What am I doing wrong? I have tried using common_subtrees_color_lines = FALSE for tanglegram, but to no avail.

Gantt plot in base r - modifying plot properties

I would like to ask a follow-up question related to the answer given in this post [Gantt style time line plot (in base R) ] on Gantt plots in base r. I feel like this is worth a new question as I think these plots have a broad appeal. I'm also hoping that a new question would attract more attention. I also feel like I need more space than the comments of that question to be specific.
The following code was given by #digEmAll . It takes a dataframe with columns referring to a start time, end time, and grouping variable and turns that into a Gantt plot. I have modified #digEmAll 's function very slightly to get the bars/segments in the Gantt plot to be contiguous to one another rather than having a gap. Here it is:
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col])
maxval <- max(data[,end.col])
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(op) # reset the plotting margins
}
Here are some sample data. You will notice that I have four groups 1-4. However, not all dataframes have all four groups. Some only have two, some only have 3.
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,1,2,2,2,1,1,1))
mydf2 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,2,2,3,4,3,2,1,1))
mydf3 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(4,4,4,4,4,4,3,2,3,3))
mydf4 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,2,3,3,3,2,1,1))
Here I run the above function, but specify four colors for plotting:
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf2, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf3, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf4, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
These are the plots:
What I would like to do is modify the function so that:
1) it will plot on the y-axis all four groups regardless of whether they actually appear in the data or not.
2) Have the same color associated with each group for every plot regardless of how many groups there are. As you can see, mydf2 has four groups and all four colors are plotted (1-red, 2-orange, 3-yellow, 4-gray). These colors are actually plotted with the same groups for mydf3 as that only contains groups 2,3,4 and the colors are picked in reverse order. However mydf1 and mydf4 have different colors plotted for each group as they do not have any group 4's. Gray is still the first color chosen but now it is used for the lowest occurring group (group2 in mydf1 and group3 in mydf3).
It appears to me that the main thing I need to work on is the vector 'resources' inside the function, and have that not just contain the unique groups but all. When I try manually overriding to make sure it contains all the groups, e.g. doing something as simple as resources <-as.factor(1:4) then I get an error:
'Error in rect(start, yBottom, end, yTop, col = color) : cannot mix zero-length and non-zero- length coordinates'
Presumably the for loop does not know how to plot data that do not exist for groups that don't exist.
I hope that this is a replicable/readable question and it's clear what I'm trying to do.
EDIT: I realize that to solve the color problem, I could just specify the colors for the 3 groups that exist in each of these sample dfs. However, my intention is to use this plot as an output to a function whereby it wouldn't be known ahead of time if all of the groups exist for a particular df.
I slightly modified your function to account for NA in start and end dates :
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col],na.rm=T)
maxval <- max(data[,end.col],na.rm=T)
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(mar=op) # reset the plotting margins
invisible()
}
In this way, if you simply append all your possible group values to your data you'll get them printed on the y axis. e.g. :
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010,
group=c(1,1,1,1,2,2,2,1,1,1))
# add all the group values you want to print with NA dates
mydf1 <- rbind(mydf1,data.frame(startyear=NA,endyear=NA,group=1:4))
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
About the colors, at the moment the ordered res.colors are applied to the sorted groups; so the 1st color in res.colors is applied to 1st (sorted) group and so on...

Resources