Plotting a dendrogram in base R with only two leaves - r

I have some (legacy) code that plots a dendrogram from an n by n matrix of distances, using base R (4.1.1).
This works fine for n>=3, but fails for n=2.
numElements <- 2
data <- matrix(1, nrow = numElements, ncol = numElements)
data[1,2] <- 0
data <- (data + t(data))/2
d3 <- as.dist(data)
clust3 <- hclust(d3)
plot(clust3, hang = -1)
For n=2 I get this error:
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input
I would like a dendrogram with two leaves, which will show the height between just those two leaves.
Unfortunately, graphics:::plotHclust calls external C code, so I can't modify that directly. And also unfortunately, I'm trying to run this on a pre-built container on a virtual machine, so I need a base R solution without additional packages (else I'd just use ggdendro). I can catch the case of only two samples and run a separate plotting function, that is fine.

The basic dendrogram plot function can't handle a monotonic increase in the height of the gram, like described in this answer. So you could use as.dendrogram to convert it to a dendrogram object like this:
numElements <- 2
data <- matrix(1, nrow = numElements, ncol = numElements)
data[1,2] <- 0
data <- (data + t(data))/2
d3 <- as.dist(data)
clust3 <- hclust(d3)
plot(as.dendrogram(clust3), hang = -1)
Created on 2022-08-19 with reprex v2.0.2

Related

Save multiple ggplots from a for loop in a single plot in a particular layout

I am trying to plot a single image that contains 35 ggplots. The order of the plots in the single image is fixed and is shown below.
I also want blank grids as shown in the grid image. Each grid should have the plot with a particular drug number. I have a data frame "drug_dctv2" which I am splitting, and making into a list to read data into the for loop.
The problem is: In plot_list[[i]], only the last plot is saved 35 times with i (1 to 35). I am also not sure how to save the plots in the particular order as shown in the grid.
Through my internet search, I found library like "cowplot" and "gridextra" but I couldn't find a proper way to implement these.
I made a plot layout file which contains the drug names in the following order as shown in the grid image and in place of blank spaces, I inserted "tab". But I do not find a way to proceed from there.
I am new to R. Any help and suggestion will be appreciated.
Data set looks like as shown below. Each Drug has 10 data points.
**Drug_name conc viab**
Drug_1 1 1.0265
Drug_1 0.1 1.2365
Drug_1 0.01 0.5896
-- -- --
Drug_2 1 2.0584
Drug_2 0.1 1.0277
Drug_2 0.01 1.5696
-- -- --
#
split <- split(file,rep(1:35,each=10)) #### this will be used in the for loop
plot_list = list()
for(i in 1:length(split))
{
data <- split[[i]]
c <- data$conc
v <- data$viab
p = ggplot(data = data,aes(x=c,y=v))+geom_point()+ylim(0,1.5)+
scale_x_continuous(trans='log10')+
theme(axis.text = element_blank(),axis.title = element_blank()) +
geom_line(data=line_data, aes(x=x,y=y2),color ="red",size=1)
plot_list[[i]] = p
}
Thank you in advance !!
ggplot, as many tidyverse packages, use delayed non standard evaluation. The expression you provide inside aes is not evaluated until the plot is built (e.g. printed or saved).
The expression in your question refers to the vectors c and v defined in the for loop. These vectors change on each iteration, but the aes call only contains an expression to the reference to c and v in the environment where the for loop is running, so the c and v values used in the plot are the ones available when the plot is printed or saved.
You can, as mentioned in the comments, use a column from the data frame directly, since ggplot evaluates the data frame when ggplot() is called.
An alternative if you wanted to keep using c and v, is to make sure each iteration runs in an independent environment, so ggplot references for c and v point to the different c and v on each loop iteration. This can be done for instance replacing the for loop with an lapply call.
plot_list <- lapply(split, function(data_drug) {
c <- data_drug$conc
v <- data_drug$viab
ggplot(data = data_drug,aes(x=c,y=v))+geom_point()+ylim(0,1.5)+
scale_x_continuous(trans='log10')+
theme(axis.text = element_blank(),axis.title = element_blank()) +
geom_line(data=line_data, aes(x=x,y=y2),color ="red",size=1)
})
This is one beautiful example where a for loop and an lapply call produce different results and it's a great learning experience about non standard evaluation and variable environments.
To combine the plots look at cowplot::plot_grid https://wilkelab.org/cowplot/articles/plot_grid.html
Something like this should work
library(cowplot)
plot_grid(
plot_list[[35]], plot_list[[5]], plot_list[[3]], plot_list[[2]],
plot_list[[34]], plot_list[[1]], plot_list[[4]], plot_list[[6]],
plot_list[[32]], plot_list[[8]], NULL, NULL,
plot_list[[30]], plot_list[[7]], plot_list[[33]] , NULL,
labels = "AUTO", ncol = 4
)
You can put all the function arguments in a list and use do.call to call the function with the arguments:
plot_order <- c(
35, 5, 3, 2,
34, 1, 4, 6,
32, 8, NA, NA
)
plot_grid_args <- c(plot_list[plot_order], list(ncol = 4))
do.call(plot_grid, plot_grid_args)
So, Finally I was able to solve this problem.
I made a variable layout with the position of the drugs as they are in the split[i] list. For eg: drug_35 has to come first on the grid and it is on 35th position in split[i] list, so in "layout" variable 35 comes first and so on.
I made a text file with the grid layout as shown above in the image and then read that file in the R script and by some lines of codes I was able to make the layout variable. For the sake of simplicity I am not showing those code lines here. But, I hope the concept is clear.
lay <- read.delim("layout.txt",stringsAsFactors = FALSE,sep = "\t", header = F)
lay1 = c(t(lay))
col_n = ncol(lay)
row_n = nrow(lay)
split <- split(file,rep(1:35,each=10))
## layout = 35 5 3 2 34 1 4 6 32 8 0 0 30 7 33 .....
## 0 means blank spaces
png("PLOT.png", width = 6, height = 10, units = "in", res = 400)
par(mfrow=c(row_n,col_n),mar=c(2,0.7,1.5,0.5)) ## margins: bottom, left, top and right
for(i in layout)
{
if(i== 0) { frame(); next; }
## Here if 0 comes then the for loop will be skipped and frame() will generate a blank in the grid image
data <- split[[i]]
c <- data$conc
v <- data$viab
plot(c,v,xlab = NULL,ylab = NULL, axes = F,log = "x")
}
dev.off()

R getting subtrees from dendrogram based on cutree labels

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.
For example
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree to get the labels for each of the rows in my dataset.
I know I can get the data for each corresponding cluster (cluster 1 for example) with:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the
I know I can access the branches of the dendrogram doing something like
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1?
I have tried
dnd[clusters == 1]
but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?
================= UPDATED answer
This can now be solved using the get_subdendrograms from dendextend.
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer
I think what can be helpful for you are these two functions:
The first one just iterates through all clusters and extracts substructure. It requires:
the dendrogram object from which we want to get the subdendrograms
the clusters labels (e.g. returned by cutree)
Returns a list of subdendrograms.
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:
the dendrogram object
positions of the elements in cluster
Returns a subdendrograms corresponding to the cluster of given elements.
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms
EDIT
As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643
I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)

Color branches of dendrogram using an existing column

I have a data frame which I am trying to cluster. I am using hclust right now. In my data frame, there is a FLAG column which I would like to color the dendrogram by. By the resulting picture, I am trying to figure out similarities among various FLAG categories. My data frame looks something like this:
FLAG ColA ColB ColC ColD
I am clustering on colA, colB, colC and colD. I would like to cluster these and color them according to FLAG categories. Ex - color red if 1, blue if 0 (I have only two categories). Right now I am using the vanilla version of cluster plotting.
hc<-hclust(dist(data[2:5]),method='complete')
plot(hc)
Any help in this regard would be highly appreciated.
If you want to color the branches of a dendrogram based on a certain variable then the following code (largely taken from the help for the dendrapply function) should give the desired result:
x<-1:100
dim(x)<-c(10,10)
groups<-sample(c("red","blue"), 10, replace=TRUE)
x.clust<-as.dendrogram(hclust(dist(x)))
local({
colLab <<- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
i <<- i+1
attr(n, "edgePar") <-
c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
}
n
}
mycols <- groups
i <- 0
})
x.clust.dend <- dendrapply(x.clust, colLab)
plot(x.clust.dend)
I think Arhopala's answer is good. I took the liberty to take a step further, and added the function assign_values_to_leaves_edgePar to the dendextend package (starting from version 0.17.2, which is now on github). This version of the function is a bit more robust and flexible from Arhopala's answer since:
It is a general function which can work in different problems/settings
The function can deal with other edgePar parameters (col, lwd, lty)
The function offers recycling of partial vectors, and various warnings massages when needed.
To install the dendextend package you can use install.packages('dendextend'), but for the latest version, use the following code:
require2 <- function (package, ...) {
if (!require(package)) install.packages(package); library(package)
}
## require2('installr')
## install.Rtools() # run this if you are using Windows and don't have Rtools installed (you must have it for devtools)
# Load devtools:
require2("devtools")
devtools::install_github('talgalili/dendextend')
Now that we have dendextend installed, here is a second take on Arhopala's answer:
x<-1:100
dim(x)<-c(10,10)
set.seed(1)
groups<-sample(c("red","blue"), 10, replace=TRUE)
x.clust<-as.dendrogram(hclust(dist(x)))
x.clust.dend <- x.clust
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = groups, edgePar = "col") # add the colors.
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)
Here is the result:
p.s.: I personally prefer using pipes for this type of coding (which will give the same result as above, but is easier to read):
x.clust <- x %>% dist %>% hclust %>% as.dendrogram
x.clust.dend <- x.clust %>%
assign_values_to_leaves_edgePar(value = groups, edgePar = "col") %>% # add the colors.
assign_values_to_leaves_edgePar(value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)

R: How to have nice intersections between multiple wireframes (lattice)

Given the following R code:
require(lattice)
x <- c(1:10)
y <- c(1:10)
g <- expand.grid(x = 1:10, y = 1:10, gr = 1:2)
g$z <- c(as.vector(outer(x,y,"*")), rep(50,100))
wireframe(z ~ x * y, data = g, groups = gr)
The intersection of the resulting surfaces is ugly, since it follows the grid-lines.
Is there a way to make the intersection between the surfaces look nicer (besides increasing the resolution of the grid)? Maybe by passing some parameters or using another package for visualization?
Well, since I couldn't resist goofing off, here are a few possiblities for ways to smooth your data.
The package scvm appears to have some 2D model-fitting tools.
The fields package is recommended here: How can I smooth an array in R?
The DiceKriging package is reviewed here: https://stats.stackexchange.com/questions/13510/fitting-multivariate-natural-cubic-spline

Add to ggplot with element of different length

I'm new to ggplot2 and I'm trying to figure out how I can add a line to an already existing plot I created. The original plot, which is the cumulative distribution of a column of data T1 from a data frame x, has about 100,000 elements in it. I have successfully plotted this using ggplot2 and stat_ecdf() with the code I posted below. Now I want to add another line using a set of (x,y) coordinates, but when I try this using geom_line() I get the error message:
Error in data.frame(x = c(0, 7.85398574631245e-07, 3.14159923334398e-06, :
arguments imply differing number of rows: 1001, 100000
Here's the code I'm trying to use:
> set.seed(42)
> x <- data.frame(T1=rchisq(100000,1))
> ps <- seq(0,1,.001)
> ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
> p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(aes(ts,ps))
That's what produces the error from above. Now here's the code using base graphics that I used to use but that I am now trying to move away from:
plot(ecdf(x$T1),xlab="T1",ylab="Cum. Prob.",xlim=c(0,4),ylim=c(0,1),main="Empirical vs. Theoretical Distribution of T1")
lines(ts,ps)
I've seen some other posts about adding lines in general, but what I haven't seen is how to add a line when the two originating vectors are not of the same length. (Note: I don't want to just use 100,000 (x,y) coordinates.)
As a bonus, is there an easy way, similar to using abline, to add a drop line on a ggplot2 graph?
Any advice would be much appreciated.
ggplot deals with data.frames, you need to make ts and ps a data.frame then specify this extra data.frame in your call to geom_line:
set.seed(42)
x <- data.frame(T1=rchisq(100000,1))
ps <- seq(0,1,.001)
ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
tpdf <- data.frame(ts=ts,ps=ps)
p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(data=tpdf, aes(ts,ps))

Resources