How to label just one observation in hierarchical clustering tree with dendextend? - r

I'd like to create a hierarchical clustering tree of a relatively large dataset (>3000 obs). Unfortunately, by including so many labels at the terminal nodes, the tree looks very cluttered and contains lots of unnecessary information. So to reduce the clutter, I'd like to just label one observation of interest. I have removed all of the labels but I don't know how to retrieve and add the label that I'm interested in.
For this MWE, let's assume, I'd like to add the letter k to my dendrogram.
library(dendextend)
library(cluster)
library(tidyverse)
set.seed(1)
a <- rnorm(20)
b <- rnorm(20)
c <- rnorm(20)
df <- as.data.frame(a, b, c)
names(df) <- letters[length(df)]
my_dist <- dist(df)
my_clust <- hclust(my_dist)
my_dend <- as.dendrogram(my_clust)
plot(color_branches(my_dend, k = 3), leaflab = "none", horiz = T)

You can specify the labels set function. If you only want to show one, make the others be the null string.
LAB = rep("", nobs(my_dend))
LAB[15] = "N15"
my_dend = set(my_dend, "labels", LAB)
plot(color_branches(my_dend, k = 3), horiz = T)

Related

Circular heat maps in R?

Similar questions have been asked here and here, however, none of the other answers solve my problem.
Im trying to join together two (or more) separate heat maps and turn them into a circle. Im trying to achieve something like the image below (which I made by following the circlize package tutorial found here:
In my data, I have multiple matrices, where each matrix represents a different year. I want to try and create a circular heat map (like the one in the image) where each section of the circular heatmap is a single year.
In my example below, I am just using 2 years (so 2 heat maps) but I cant seem to get it to work:
library(circlize)
# create matrix
mat1 <- matrix(runif(80), 10, 8)
mat2 <- matrix(runif(80), 10, 8)
rownames(mat1) <- rownames(mat2) <- paste0('a', 1:10)
colnames(mat1) <- colnames(mat2) <- paste0('b', 1:8)
# join together
matX <- cbind(mat1, mat2)
# set splits
split <- c(rep('a', 8), rep('b', 8))
split = factor(split, levels = unique(split))
# create circular heatmap
col_fun1 = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
circos.heatmap(matX, split = split, col = col_fun1, rownames.side = "inside")
circos.clear()
The above code makes:
Im not sure where I am going wrong!? As when I use the ComplexHeatmap package, I am splitting the matrices correctly, as shown below:
# using ComplexHeatmap package
library(ComplexHeatmap)
Heatmap(matX, column_split = split, show_row_dend = F, show_column_dend = F)
Any suggestions as to how I could achieve this?

How do I display box-plots of different data sets above each other in R?

new to R and just wondering is it possible to display these two box plots either side by side or above each other to allow for comparison, rather then producing two seperate box plots.
PBe <- PB$`Enterococci (cfu/100ml)`
BRe <- BR$`Enterococci (cfu/100ml)`
boxplot(BRe, horizontal = TRUE, col = "3", outline=FALSE)
boxplot(PBe, horizontal = TRUE, col = "4", outline=FALSE)
You could use the boxplot function directly:
boxplot(list(BRe = BRe, PBe = PBE), col = c(3, 4))
You could add all the other parameters as you wish
We obviously don't have your data, so let's make a minimal reproducible example.
First we create two data frames, one called PB and one called BR. Each has a numeric column called Enterococci (cfu/100ml) containing random numbers between 100 and 1000:
set.seed(1)
PB <- data.frame(a = sample(100:1000, 100, TRUE))
BR <- data.frame(a = sample(100:1000, 50, TRUE))
names(PB) <- "Enterococci (cfu/100ml)"
names(BR) <- "Enterococci (cfu/100ml)"
Now, if we extract these columns as per your code, we can concatenate them together using c
PBe <- PB$`Enterococci (cfu/100ml)`
BRe <- BR$`Enterococci (cfu/100ml)`
value <- c(PBe, BRe)
Now, the trick is to create another vector that labels which data frame these numbers originally came from as a factor variable. We can do that with:
dataset <- factor(c(rep("PB", nrow(PB)), rep("BR", nrow(BR))))
And now we can just call plot on these two vectors. This will automatically give us a side-by-side boxplot:
plot(dataset, value, xlab = "Data set", ylab = "Enterococci (cfu/100ml)")
If you would prefer it to be horizontal, we can do:
boxplot(value ~ dataset, horizontal = TRUE,
ylab = "Data set",
xlab = "Enterococci (cfu/100ml)")

How to combine state distribution plot and separate legend in traminer?

Plotting several clusters using seqdplot in TraMineR can make the legend messy, especially in combination with numerous states. This calls for additional options for modifying the legend which is available with the function seqlegend. However, I have a hard time combining a state distribution plot (seqdplot) with a separate modified legend (seqlegend). Ideally one wants to plot the clusters (e.g. 9) without a legend and then add the separate legend in the available bottom right row, but instead the separate legend is generating a new plot window. Can anyone help?
Here's an example using the biofam data. With the data I use in my own research the legend becomes much more messy since I have 11 states.
#Data
library(TraMineR)
library(WeightedCluster)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
#OM distances
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE")
#9 clusters
wardCluster <- hclust(as.dist(biofam.om), method = "ward.D2")
cluster9 <- cutree(wardCluster, k = 9)
#State distribution plot
seqdplot(biofam.seq, group = cluster9, with.legend = F)
#Separate legend
seqlegend(biofam.seq, title = "States", ncol = 2)
#Combine state distribution plot and separate legend
#??
Thank you.
The seqplot function does not allow to control the number of columns of the legend, nor does it allow to add a legend title. So you have to compose the plot yourself by generating a separated plot for each group with the legend disabled and adding the legend afterwards. Here is how you can do that:
cluster9 <- factor(cluster9)
levc <- levels(cluster9)
lev <- length(levc)
par(mfrow=c(5,2))
for (i in 1:lev)
seqdplot(biofam.seq[cluster9 == levc[i],], border=NA, main=levc[i], with.legend=FALSE)
seqlegend(biofam.seq, ncol=4, cex = 1.2, title='States')
========================
Update, Oct 1, 2018 =================
Since TraMineR V 2.0-9, the seqplot family of functions now support (when applicable) the argument ncol to control the number of columns in the legend. To add a title to the legend, you still have to proceed as shown above.
AFAIK seqlegend() doesn't work when the other plots you are plotting utilizes the groups arguments. In your case the only thing seqlegend() is adding is a title "States". If you are looking to add a legend so you can customize what is in the legend and so forth, you can accomplish that by providing the corresponding alphabet and states that are used in your analysis.
The package's website has several walkthroughs and guides enumerating the various options and so forth: Link to their webiste
#Data
library(TraMineR)
library(WeightedCluster)
data(biofam)
## Generate alphabet and states
alphabet <- 0:7
states <- letters[seq_along(alphabet)]
biofam.seq <- seqdef(biofam[501:600, 10:25], states = states, alphabet = alphabet)
#OM distances
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE")
#9 clusters
wardCluster <- hclust(as.dist(biofam.om), method = "ward.D2")
cluster9 <- cutree(wardCluster, k = 9)
#State distribution plot
seqdplot(biofam.seq, group = cluster9, with.legend = TRUE)

Edit betadisper permutest plot

I have used the script below to generate this betadisper plot between 2 communities.
In my "df", the first column is station names (x13)
I have 2 questions:
There is a point behind the "ABC" label, so how do I make the label transparent? Preferably adding different colours to each community?
How do I add the station names next to each point so I can visually compare which stations are most similar?
Script:
df <-read.csv("NMDS matrix_csv_NEW.csv", header=T, row.names=1, sep= ",")
df
Label<-rownames(df)
Label
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
groups
mod <- betadisper(dis, groups)
mod
anova(mod)
permutest(mod, pairwise = TRUE)
plot(mod)
plot(mod, ellipse = TRUE, hull = FALSE, main= "MultiVariate Permutation")
To answer 2), here's how to plot the station names on top of the points.
text(mod$vectors[,1:2], label=Label)
Here is a possibile solution to your problem.
Download the myplotbetadisp.r file from this link and place the file in the working directory (warning, do not save the file as myplotbetadisp.r.txt!).
Some additional options are available in myplotbetadisper function:
fillrect, filling color of the box where centroid labels are printed;
coltextrect, vector of colors for centroid labels;
alphaPoints, alpha trasparency for centroid points;
labpoints, vectors of labels plotted close to points;
poslabPoints, position specifier for the text in labpoints.
library(vegan)
# A dummy data generation process
set.seed(1)
n <- 100
df <- matrix(runif(13*n),nrow=13)
# Compute dissimilarity indices
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
# Analysis of multivariate homogeneity of group dispersions
mod <- betadisper(dis, groups)
source("myplotbetadisp.r")
labPts <- LETTERS[1:13]
col.fill.rect <- addAlpha(col2rgb("gray65"), alpha=0.5)
col.text.rect <- apply(col2rgb(c("blue","darkgreen")), 2, addAlpha, alpha=0.5)
transp.centroids <- 0.7
myplotbetadisper(mod, ellipse = TRUE, hull = FALSE,
fillrect=col.fill.rect, coltextrect=col.text.rect,
alphaPoints=transp.centroids, labPoints=labPts,
main= "MultiVariate Permutation")
Here is the plot
Hope it can help you.

Add elements to a previous subplot within an active base R graphics device?

Let's say I generate 9 groups of data in a list data and plot them each with a for loop. I could use *apply here too, whichever you prefer.
data = list()
layout(mat = matrix(1:9, nrow = 3))
for(i in 1:9){
data[[i]] = rnorm(n = 100, mean = i, sd = 1)
plot(data[[i]])
}
After creating all the data, I want to decide which one is best:
best_data = which.min(sapply(data, sd))
Now I want to highlight that best data on the plot to distinguish it. Is there a plotting function that lets me go back to a specified sub-plot in the active device and add an element (maybe a title)?
I know I could make a second for loop: for loop 1 generates the data, then I assess which is best, then for loop 2 creates the plots, but this seems less efficient and more verbose.
Does such a plotting function exist for base R graphics?
#rawr's answer is simple and easy. But I thought I'd point out another option that allows you to select the "best" data set before you plot, in case you want more flexibility to plot the "best" data set differently from the rest.
For example:
# Create the data
data = lapply(1:9, function(i) rnorm(n = 100, mean = i, sd = 1))
par(mar=c(4,4,1,1))
layout(mat = matrix(1:9, nrow = 3))
rng = range(data)
# Plot each data frame
lapply(1:9, function(i) {
# Select data frame with lowest SD
best = which.min(sapply(data, sd))
# Highlight data frame with lowest SD by coloring points red
plot(data[[i]], col=ifelse(best==i,"red","black"), pch=ifelse(best==i, 3, 1), ylim=rng)
})

Resources