Fixing Aesthetics for Dendrogram Plots - r

I am trying to plot 4 different dendrograms on a single plot and divide the data into 4 clusters for each of the dendrograms. However, after running the code, the plots that I keep getting are as follows.
This is the code that I am using:
data <- df[, c(2, 1305:2185)]
data_HC <-
data %>% remove_rownames %>% column_to_rownames(var = "Name")
h1 <- data_HC %>% dist %>% hclust(method='average') %>% as.dendrogram
h2 <- data_HC %>% dist %>% hclust(method='complete') %>% as.dendrogram
h3 <- data_HC %>% dist %>% hclust(method='ward.D') %>% as.dendrogram
h4 <- data_HC %>% dist %>% hclust(method='single') %>% as.dendrogram
compare_clusters <- function(data_1, data_2, data_3, data_4){
par(mfrow=c(2,2))
cols = c('red', 'green', 'blue', 'pink')
plot(data_1, main='Average Linkage')
cut_avg_h1 <- cutree(data_1, k = 4)
rect.dendrogram(data_1 , k = 4, border = cols)
plot(data_2, main='Complete Linkage')
cut_avg_h2 <- cutree(data_2, k = 4)
rect.dendrogram(data_2 , k = 4, border = cols)
plot(data_3, main="Ward's Linkage")
cut_avg_h3 <- cutree(data_3, k = 4)
rect.dendrogram(data_3 , k = 4, border = cols)
plot(data_4, main='Single Linkage')
cut_avg_h4 <- cutree(data_4, k = 4)
rect.dendrogram(data_4 , k = 4, border = cols)
}
plot <- compare_clusters(h1, h2, h3, h4)
There are several things I am trying to do here.
(1) Trying to get rid of the two main branches for the Ward's methods so that all the 4 clusters are better visible. Ideally I want to only plot the data from height = 50.
(2) Reduce the font size of all the labels so the names are better visible.
(3) Reduce the gap between the branches and the labels. Right now, the distance between the names and the branches is extremely huge.
(4) Save this picture. I am not able to save this picture and I have looked everywhere on how I can save the picture.
I would be very grateful for any help on this! Thank you.

By adding height to all tree by raise.dendrogram() may be the options. Or by plot h3[[1]] and h3[[2]] may helps.
By using library(dendextend), use set("labels_cex", 0.5), change 0.5 part to change size of label.
Sorry...I cannot find out.
png(filename="your/file/location/name.png")
compare_clusters(h1, h2, h3, h4)
dev.off()

Related

Changing color of columns above a certain value in barplot() in R

I am trying to create a barplot that illustrates the number of issue reports for certain issues in an 'imaginary' business. I have used abline() to add a line at a certain value and would like to change the color of the bar above this point. Is such a thing possible to do? Thanks in advance.
My code is as follows:
H<-c(30,35,7,12)
M<-c("Issue 1","Issue 2","Issue 3","Issue 4")
par(mar=c(5,5,5,5))
barplot(H,names.arg=M,ylab="No. of issues",col="light grey",
main="Issue Analysis",border="black",las=2)
abline(h=20,col="red",lty=2)
And the produced graph is:
So, you should see that above the red line is the portion of the first two bars that I want to highlight.
Thanks
You can divide the bars in two parts (some will be zero) and draw them as stacked bars.
Not to have a line between the stacked bars I first draw the stacked bars without borders and then just the border of the original (non-stacked) bars.
h <- 20
H <- c(30, 35, 7, 12)
M <- paste("Issue", seq_along(H))
H2 <- rbind(pmin(H, h), pmax(H-h, 0))
barplot(H2, names.arg=M, ylab="No. of issues", col=gray(c(.7, .8)),
main="Issue Analysis", las=2, border=NA)
barplot(H, col=NA, yaxt='n', add=T)
abline(h=h, col="red", lty=2)
If you want to use ggplot2, you can use the following function I created (could also use code without function). This is created by slightly modifying the object using ggplot_build. You can define the "threshold" and "colorbar" you want to give the bar area above the threshold. Here you can see an example with threshold 20 and 10:
H<-c(30,35,7,12)
M<-c("Issue 1","Issue 2","Issue 3","Issue 4")
df <- data.frame(H = H,
M = M)
library(ggplot2)
library(dplyr)
library(tidyr)
threshold_bar <- function(df, threshold, colorbar){
p <- ggplot(df, aes(x = M, y = H)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = threshold, colour = "red")
q <- ggplot_build(p)
q$data[[1]] <- q$data[[1]] %>%
group_by(group) %>%
group_modify(~ add_row(.x, .before = 0)) %>%
fill(everything(), .direction = "up") %>%
mutate(ymin = ifelse(row_number() == 2 & y > threshold, threshold, ymin)) %>%
mutate(fill = case_when(ymin == threshold ~ colorbar,
TRUE ~ fill)) %>%
ungroup()
q <- ggplot_gtable(q)
plot(q)
}
threshold <- 20
colorbar <- "blue"
threshold_bar(df, threshold, colorbar)
threshold <- 10
colorbar <- "blue"
threshold_bar(df, threshold, colorbar)
Created on 2022-08-16 by the reprex package (v2.0.1)
Please note: The function could of course be optimized and extended if you want.

Using text annotations with plotly::subplot

I have data I'd like to plot the distribution density of. The data are from three groups, where for each there are three states, each with a probability, and these probabilities sum to 1.
I'm trying to use R's plotly to plot, for each group, the density of the probabilities, color coded by state, and add some text annotation to each such group plot. Finally I'm trying to combine all of these group plots using plotly::subplot.
Here's the code to generate the data and a list of group plots:
library(dplyr)
library(reshape2)
library(plotly)
set.seed(1)
plot.list <- lapply(1:3,function(g){
if(g == 1){
show.legend <- T
} else{
show.legend <- F
}
df <- data.frame(id=LETTERS,t(sapply(1:length(LETTERS),function(x){
probs <- runif(3,0,1)
return(probs/sum(probs))
}))) %>% dplyr::rename(S1=X1,S2=X2,S3=X3) %>%
reshape2::melt() %>% dplyr::rename(state=variable,probability=value)
df$state <- factor(df$state,levels=c("S1","S2","S3"))
density.df <- do.call(rbind,lapply(levels(df$state),function(s){
dens <- density(dplyr::filter(df,state == s)$probability)
return(data.frame(x=dens$x,y=dens$y,state=s,stringsAsFactors=F))
}))
density.df$state <- factor(density.df$state,levels=levels(df$state))
dens.plot <- plot_ly(x=~density.df$x,y=~density.df$y,type='scatter',mode='lines',color=~density.df$state,showlegend=show.legend) %>%
layout(xaxis=list(title="Probability",zeroline=F),yaxis=list(title="Count",zeroline=F)) %>%
add_annotations(x=0.75,y="top",text=paste0("text: ",g))
if(show.legend) dens.plot <- dens.plot %>% add_annotations(text="State",xref="paper",yref="paper",x=1.02,xanchor="left",y=1.02,yanchor="top",legendtitle=T,showarrow=F)
return(dens.plot)
})
Note that I'm only adding the legend to the first group so it appears only once in the final grouped plot (there's probably a more elegant way of achieving that).
And here's the plotly::subplot command I'm using:
subplot(plot.list,nrows=3,shareX=T,shareY=T,titleX=T,titleY=T)
Which gives:
As you can see the text annotation is stuck at "top" of the first plot rather than at the top of each individual plot.
Any idea how do I get each annotation to be located at the top of its corresponding sub-plot?
Preamble. For reasons that are not entirely obvious to me (but relating to how values for annotations are scaled when running subplot), annotations seem to go awry with vertically stacked subplots. To see this, run the MWE at https://plot.ly/r/text-and-annotations/#subplot-annotations, but change
subplot(p1, p2, titleX = TRUE, titleY = TRUE)
to
subplot(p1, p2, titleX = TRUE, titleY = TRUE, nrows = 2)
In the vertically stacked version, the annotations are not where we would expect them to be. To achieve your desired outcome would require some post-processing of the subplot output. Now, on to your main question.
First, in add_annotations, add xref and yref arguments that correspond to each subplot. In each element of plot.list, I also add an additional element y_anno to keep track of where we would like the annotation to go (at the maximum value of the densities in each subplot).
plot.list <- lapply(1:3,function(g){
if(g == 1){
show.legend <- T
} else{
show.legend <- F
}
df <- data.frame(id=LETTERS,t(sapply(1:length(LETTERS),function(x){
probs <- runif(3,0,1)
return(probs/sum(probs))
}))) %>% dplyr::rename(S1=X1,S2=X2,S3=X3) %>%
reshape2::melt() %>% dplyr::rename(state=variable,probability=value)
df$state <- factor(df$state,levels=c("S1","S2","S3"))
density.df <- do.call(rbind,lapply(levels(df$state),function(s){
dens <- density(dplyr::filter(df,state == s)$probability)
return(data.frame(x=dens$x,y=dens$y,state=s,stringsAsFactors=F))
}))
density.df$state <- factor(density.df$state,levels=levels(df$state))
dens.plot <- plot_ly(x=~density.df$x,
y=~density.df$y,
type='scatter',
mode='lines',
color=~density.df$state,
showlegend=show.legend) %>%
layout(xaxis=list(title="Probability",zeroline=F),yaxis=list(title="Count",zeroline=F)) %>%
add_annotations(x = 0.75,
y = max(density.df$y),
text = paste0("text: ", g),
xref = paste0("x", g), # add this
yref = paste0("y", g), # add this
ax = 0,
ay = 0)
if(show.legend) dens.plot <- dens.plot %>% add_annotations(text="State",xref="paper",yref="paper",x=1.02,xanchor="left",y=1.02,yanchor="top",legendtitle=T,showarrow=F)
dens.plot$y_anno <- max(density.df$y) # add this
return(dens.plot)
})
Now if we run subplot(plot.list,nrows=3,shareX=T,shareY=T,titleX=T,titleY=T), the text will be in each subplot, but not at the top (due to the phenomenon I described in the preamble). To fix this, we can post-process the subplot output:
p <- subplot(plot.list, nrows = 3,shareX = T,shareY = T,titleX = T,titleY = T)
for (i in seq_along(plot.list)) {
for (j in seq_along(p$x$layout$annotations)) {
if (p$x$layout$annotations[[j]]$yref == paste0("y", i))
p$x$layout$annotations[[j]]$y <- plot.list[[i]]$y_anno
}
}
Now p gives us
which is close to what we want.

Drawing rectangles around specified labels in a dendrogram with 'dendextend'

I'm currently constructing a dendrogram and I'm using 'dendextend' to tweak the look of it.
I've been able to do everything I want to (labelling leaves and highlighting branches of my chosen clusters), except drawing rectangles around pre-defined clusters.
My data (which can be sourced from this file: Barra_IBS_example.matrix) was clustered with 'pvclust', so 'pvrect' draws the rects in the correct position, but it cuts the labels (see image below), so I want to reproduce it with 'rect.dendrogram', however, I can't figure out how to tell the function to use the clustering data from 'pvclust'.
This is the code I'm using:
idnames <- dimnames(ibs_mat)[[1]]
ibs.pv <- pvclust(ibs_mat, nboot=1000)
ibs.clust <- pvpick(ibs.pv, alpha=0.95)
names(ibs.clust$clusters) <- paste0("Cluster", 1:length(ibs.clust$clusters))
# Choose a colour palette
pal <- brewer.pal(length(ibs.clust$clusters), "Paired")
# Transform the list to a dataframe
ibs_meta <- bind_rows(lapply(names(ibs.clust$clusters),
function(l) data.frame(Cluster=l, Sample = ibs.clust$clusters[[l]])))
# Add the rest of the non-clustered samples (and assign them as Cluster0), add colour to each cluster
ibs_table <- ibs_meta %>%
rbind(., data.frame(Cluster = "Cluster0",
Sample = idnames[!idnames %in% .$Sample])) %>%
mutate(Cluster_int=as.numeric(sub("Cluster", "", Cluster))) %>%
mutate(Cluster_col=ifelse(Cluster_int==0, "#000000",
pal[Cluster_int])) %>%
.[match(ibs.pv$hclust$labels[ibs.pv$hclust$order], .$Sample),]
hcd <- as.dendrogram(ibs.pv) %>%
#pvclust_show_signif(ibs.pv, show_type = "lwd", signif_value = c(2, 1),alpha=0.25) %>%
set("leaves_pch", ifelse(ibs_table$Cluster_int>0,19,18)) %>% # node point type
set("leaves_cex", 1) %>% # node point size
set("leaves_col", ibs_table$Cluster_col) %>% #node point color
branches_attr_by_labels(ibs_meta$Sample, TF_values = c(2, Inf), attr = c("lwd")) %>% # change branch width
# rect.dendrogram(k=12, cluster = ibs_table$Cluster_int, border = 8, lty = 5, lwd = 1.5,
# lower_rect = 0) %>% # add rectangles around clusters
plot(main="Barramundi samples IBS based clustering")
pvrect(ibs.pv, alpha=0.95, lwd=1.5)
Many thanks, Ido
ok, this took more work than I had hoped, but I got a solution for you.
I created a new function called pvrect2 and just pushed it to the latest version of dendextend on github. Here is a self contained example demonstrating the solution:
devtools::install_github('talgalili/dendextend')
library(pvclust)
library(dendextend)
data(lung) # 916 genes for 73 subjects
set.seed(13134)
result <- pvclust(lung[, 1:20], method.dist="cor", method.hclust="average", nboot=10)
par(mar = c(9,2.5,2,0))
dend <- as.dendrogram(result)
dend %>%
pvclust_show_signif(result, signif_value = c(3,.5)) %>%
pvclust_show_signif(result, signif_value = c("black", "grey"), show_type = "col") %>%
plot(main = "Cluster dendrogram with AU/BP values (%)")
# pvrect(result, alpha=0.95)
pvrect2(result, alpha=0.95)
text(result, alpha=0.95)
UvdV.png

How can I create subplots in plotly using R where each subplot is two traces

Here is a toy example I have got stuck on
library(plotly)
library(dplyr)
# construct data.frame
df <- tibble(x=c(3,2,3,5,5,5,2),y=c("a","a","a","b","b","b","b"))
# construct data.frame of last y values
latest <- df %>%
group_by(y) %>%
slice(n())
# plot for one value of y (NB not sure why value for 3 appears?)
p <- plot_ly() %>%
add_histogram(data=subset(df,y=="b"),x= ~x) %>%
add_histogram(data=subset(latest,y=="b"),x= ~x,marker=list(color="red")) %>%
layout(barmode="overlay",showlegend=FALSE,title= ~y)
p
How can i set these up as subplots, one for each unique value of y? In the real world example, I would have 20 different y's so would ideally loop or apply the code. In addition, it would be good to set standard x scales of say c(1:10) and have, for example, 2 rows
TIA
build a list containing each of the plots
set the bin sizes manually for the histograms, otherwise the automatic selection will choose different bins for each of the traces within a plot (making it look strange as in you example where the bars of each trace are different widths)
use subplot to put it all together
add titles to individual subplots using a list of annotations, as explained here
Like this:
N = nlevels(factor(df$y))
plot_list = vector("list", N)
lab_list = vector("list", N)
for (i in 1:N) {
this_y = levels(factor(df$y))[i]
p <- plot_ly() %>%
add_trace(type="histogram", data=subset(df,y==this_y), x=x, marker=list(color="blue"),
autobinx=F, xbins=list(start=0.5, end=6.5, size=1)) %>%
add_trace(type="histogram", data=subset(latest,y==this_y), x = x, marker=list(color="red"),
autobinx=F, xbins=list(start=0.5, end=6.5, size=1)) %>%
layout(barmode="overlay", showlegend=FALSE)
plot_list[[i]] = p
titlex = 0.5
titley = c(1.05, 0.45)[i]
lab_list[[i]] = list(x=titlex, y=titley, text=this_y,
showarrow=F, xref='paper', yref='paper', font=list(size=18))
}
subplot(plot_list, nrows = 2) %>%
layout(annotations = lab_list)

How to adjust lines length in dendrogram?

Using this code in R,
library("dendextend")
library("dendextendRcpp")
dist2 <- read.csv("distanceMatrix.csv",sep=";",header=TRUE)
mat <- as.matrix(dist2)
# using piping to get the dend
dend <- dist2 %>% dist %>% hclust %>% as.dendrogram %>% set("labels", colnames(mat))
foo <- function(k){
svg(filename = "dendrogram_newest.svg",width = 25,height = 14)
# plot + color the dend's branches before, based on k clusters:
dend %>% color_branches(k) %>% plot()
# add horiz line:
abline(h = heights_per_k.dendrogram(dend)[k], lwd = 2, lty = 2, col = "purple")
dev.off()}
foo(6)
I get this:
So, how to shorten these lines. This way is almost unreadable.
And yes, my labels are ordered just like in first row of my distanceMatrix.csv. This order has nothing to do with relations inside of distanceMatrix. I mean, dendrogram is ok but values of labels aren't the right one.
Thanks

Resources