Clusters labels in dendrogram - r

I'm wondering - is there any way how to add labels of clusters into dendrogram. See simple example:
hc = hclust(dist(mtcars))
plot(hc, hang = -1)
rect.hclust(hc, k = 3, border = "red")
Desired output should look like this:
Thanks for any suggestions!

You need to get the coordinates of the place to put your clusters' labels:
First axis:
As you are calling rect.hclust, you might as well assign the result so you can use it to find the beginning of clusters (the first one begins at 1 the 2nd at 1 + the length of the first, etc.
rh <- rect.hclust(hc, k = 3, border = "red")
beg_clus <- head(cumsum(c(1, lengths(rh))), -1)
Second axis:
You just want to be above the red rectangle, which is at the middle of the height where you have k-1 clusters and the height where you have k clusters. Let's say you're aiming at 4/5 of the distance instead of 1/2:
y_clus <- weighted.mean(rev(hc$height)[2:3], c(4, 1))
Putting the labels:
text(x=beg_clus, y=y_clus, col="red", labels=LETTERS[1:3], font=2)

An alternative to adding text labels is in the mjcgraphics package that deals with cluster labels. See https://github.com/drmjc/mjcgraphics and https://rdrr.io/github/drmjc/mjcgraphics/man/rect.hclust.labels.html
rect.hclust.labels(hc, k=3, border = 1 ) # adds labels to clusters

Related

Superimpose text across panels

I made a matrix of figures using par(mfrow = (3, 4)). Now I want to overlay each row with texts like this:
What is the simplest way to achieve this?
Thank you!
OPTION 1
Add the text after plotting all figures:
par(list(mfrow = c(3, 4),
mar=c(2,2,1,1)))
lapply(1:12,FUN=function(x) plot(1:100,runif(100),cex=0.2))
##You will have to manually adjust these values to fit your figure
xval = -150
yval = 0.5
y_incr = 1.59
text(x=xval, y=yval, labels="TextToAdd3",col=rgb(0,0,1,0.5), cex=3, xpd=NA)
text(x=xval, y=yval+y_incr, labels="TextToAdd2",col=rgb(0,0,1,0.5), cex=3, xpd=NA)
text(x=xval, y=yval+y_incr*2, labels="TextToAdd1",col=rgb(0,0,1,0.5), cex=3, xpd=NA)
OPTION 2
Centre caption on the left margin every time you plot in the third column. This means less stuffing around with manually adjusting values (plot looks the same as above):
par(list(mfrow = c(3, 4),
mar=c(2,2,1,1)))
texts=list("TextToAdd1",
"TextToAdd3",
"TextToAdd3")
for(i in 1:12){
plot(1:100,runif(100),cex=0.2)
if((i+1)%%4==0){
mtext(text=texts[[i/3]],side=2,line=par()$mar[2], las=1,col=rgb(0,0,1,0.5), cex=3,adj=0.5)
}
}

Easiest way to plot inequalities with hatched fill?

Refer to the above plot. I have drawn the equations in excel and then shaded by hand. You can see it is not very neat. You can see there are six zones, each bounded by two or more equations. What is the easiest way to draw inequalities and shade the regions using hatched patterns ?
To build up on #agstudy's answer, here's a quick-and-dirty way to represent inequalities in R:
plot(NA,xlim=c(0,1),ylim=c(0,1), xaxs="i",yaxs="i") # Empty plot
a <- curve(x^2, add = TRUE) # First curve
b <- curve(2*x^2-0.2, add = TRUE) # Second curve
names(a) <- c('xA','yA')
names(b) <- c('xB','yB')
with(as.list(c(b,a)),{
id <- yB<=yA
# b<a area
polygon(x = c(xB[id], rev(xA[id])),
y = c(yB[id], rev(yA[id])),
density=10, angle=0, border=NULL)
# a>b area
polygon(x = c(xB[!id], rev(xA[!id])),
y = c(yB[!id], rev(yA[!id])),
density=10, angle=90, border=NULL)
})
If the area in question is surrounded by more than 2 equations, just add more conditions:
plot(NA,xlim=c(0,1),ylim=c(0,1), xaxs="i",yaxs="i") # Empty plot
a <- curve(x^2, add = TRUE) # First curve
b <- curve(2*x^2-0.2, add = TRUE) # Second curve
d <- curve(0.5*x^2+0.2, add = TRUE) # Third curve
names(a) <- c('xA','yA')
names(b) <- c('xB','yB')
names(d) <- c('xD','yD')
with(as.list(c(a,b,d)),{
# Basically you have three conditions:
# curve a is below curve b, curve b is below curve d and curve d is above curve a
# assign to each curve coordinates the two conditions that concerns it.
idA <- yA<=yD & yA<=yB
idB <- yB>=yA & yB<=yD
idD <- yD<=yB & yD>=yA
polygon(x = c(xB[idB], xD[idD], rev(xA[idA])),
y = c(yB[idB], yD[idD], rev(yA[idA])),
density=10, angle=0, border=NULL)
})
In R, there is only limited support for fill patterns and they can only be
applied to rectangles and polygons.This is and only within the traditional graphics, no ggplot2 or lattice.
It is possible to fill a rectangle or polygon with a set of lines drawn
at a certain angle, with a specific separation between the lines. A density
argument controls the separation between the lines (in terms of lines per inch)
and an angle argument controls the angle of the lines.
here an example from the help:
plot(c(1, 9), 1:2, type = "n")
polygon(1:9, c(2,1,2,1,NA,2,1,2,1),
density = c(10, 20), angle = c(-45, 45))
EDIT
Another option is to use alpha blending to differentiate between regions. Here using #plannapus example and gridBase package to superpose polygons, you can do something like this :
library(gridBase)
vps <- baseViewports()
pushViewport(vps$figure,vps$plot)
with(as.list(c(a,b,d)),{
grid.polygon(x = xA, y = yA,gp =gpar(fill='red',lty=1,alpha=0.2))
grid.polygon(x = xB, y = yB,gp =gpar(fill='green',lty=2,alpha=0.2))
grid.polygon(x = xD, y = yD,gp =gpar(fill='blue',lty=3,alpha=0.2))
}
)
upViewport(2)
There are several submissions on the MATLAB Central File Exchange that will produce hatched plots in various ways for you.
I think a tool that will come handy for you here is gnuplot.
Take a look at the following demos:
feelbetween
statistics
some tricks

avoiding over-crowding of labels in r graphs

I am working on avoid over crowding of the labels in the following plot:
set.seed(123)
position <- c(rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5),
rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5))
group <- c(rep (1, length (position)/2),rep (2, length (position)/2) )
mylab <- paste ("MR", 1:length (group), sep = "")
barheight <- 0.5
y.start <- c(group-barheight/2)
y.end <- c(group+barheight/2)
mydf <- data.frame (position, group, barheight, y.start, y.end, mylab)
plot(0,type="n",ylim=c(0,3),xlim=c(0,10),axes=F,ylab="",xlab="")
#Create two horizontal lines
require(fields)
yline(1,lwd=4)
yline(2,lwd=4)
#Create text for the lines
text(10,1.1,"Group 1",cex=0.7)
text(10,2.1,"Group 2",cex=0.7)
#Draw vertical bars
lng = length(position)/2
lg1 = lng+1
lg2 = lng*2
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, mydf$mylab[1:lng], srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
You can see some areas are crowed with the labels - when x value is same or similar. I want just to display only one label (when there is multiple label at same point). For example,
mydf$position[1:5] are all 0,
but corresponding labels mydf$mylab[1:5] -
MR1 MR2 MR3 MR4 MR5
I just want to display the first one "MR1".
Similarly the following points are too close (say the difference of 0.35), they should be considered a single cluster and first label will be displayed. In this way I would be able to get rid of overcrowding of labels. How can I achieve it ?
If you space the labels out and add some extra lines you can label every marker.
clpl <- function(xdata, names, y=1, dy=0.25, add=FALSE){
o = order(xdata)
xdata=xdata[o]
names=names[o]
if(!add)plot(0,type="n",ylim=c(y-1,y+2),xlim=range(xdata),axes=F,ylab="",xlab="")
abline(h=1,lwd=4)
dy=0.25
segments(xdata,y-dy,xdata,y+dy)
tpos = seq(min(xdata),max(xdata),len=length(xdata))
text(tpos,y+2*dy,names,srt=90,adj=0)
segments(xdata,y+dy,tpos,y+2*dy)
}
Then using your data:
clpl(mydf$position[lg1:lg2],mydf$mylab[lg1:lg2])
gives:
You could then think about labelling clusters underneath the main line.
I've not given much thought to doing multiple lines in a plot, but I think with a bit of mucking with my code and the add parameter it should be possible. You could also use colour to show clusters. I'm fairly sure these techniques are present in some of the clustering packages for R...
Obviously with a lot of markers even this is going to get smushed, but with a lot of clusters the same thing is going to happen. Maybe you end up labelling clusters with a this technique?
In general, I agree with #Joran that cluster labelling can't be automated but you've said that labelling a group of lines with the first label in the cluster would be OK, so it is possible to automate some of the process.
Putting the following code after the line lg2 = lng*2 gives the result shown in the image below:
clust <- cutree(hclust(dist(mydf$position[1:lng])),h=0.75)
u <- rep(T,length(unique(clust)))
clust.labels <- sapply(c(1:lng),function (i)
{
if (u[clust[i]])
{
u[clust[i]] <<- F
as.character(mydf$mylab)[i]
}
else
{
""
}
})
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, clust.labels, srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
(I've only labelled the clusters on the lower line -- the same principle could be applied to the upper line too). The parameter h of cutree() might have to be adjusted case-by-case to give the resolution of labels that you want, but this approach is at least easier than labelling every cluster by hand.

Bubble chart for integer variables where the largest bubble has a diameter of 1 (on the x or y axis scale)?

I want to achieve the following outcomes:
Rescale the size of the bubbles such that the largest bubble has a
diameter of 1 (on whichever has the more compressed scale of the x
and y axes).
Rescale the size of the bubbles such that the smallest bubble has a diameter of 1 mm
Have a legend with the first and last points the minimum non-zero
frequency and the maximum frequency.
The best I have been able to do is as follows, but I need a more general solution where the value of maxSize is computed rather than hard-coded. If I was doing it in the traditional R plots I would use par("pin") to work out the size of plot area and work backwards, but I cannot figure out how to access this information with ggplot2. Any suggestions?
library(ggplot2)
agData = data.frame(
class=rep(1:7,3),
drv = rep(1:3,rep(7,3)),
freq = as.numeric(xtabs(~class+drv,data = mpg))
)
agData = agData[agData$freq != 0,]
rng = range(agData$freq)
mn = rng[1]
mx = rng[2]
minimumArea = mx - mn
maxSize = 20
minSize = max(1,maxSize * sqrt(mn/mx))
qplot(class,drv,data = agData, size = freq) + theme_bw() +
scale_area(range = c(minSize,maxSize),
breaks = seq(mn,mx,minimumArea/4), limits = rng)
Here is what it looks like so far:
When no ggplot, lattice or other highlevel package seems to do the job without hours of fine tuning I always revert to the base graphics. The following code gets you what you want, and after it I have another example based on how I would have plotted it.
Note however that I have set the maximum radius to 1 cm, but just divide size.range/2 to get diameter instead. I just thought radius gave me nicer plots, and you'll probably want to adjust things anyways.
size.range <- c(.1, 1) # Min and max radius of circles, in cm
# Calculate the relative radius of each circle
radii <- sqrt(agData$freq)
radii <- diff(size.range)*(radii - min(radii))/diff(range(radii)) + size.range[1]
# Plot in two panels
mar0 <- par("mar")
layout(t(1:2), widths=c(4,1))
# Panel 1: The circles
par(mar=c(mar0[1:3],.5))
symbols(agData$class, agData$drv, radii, inches=size.range[2]/cm(1), bg="black")
# Panel 2: The legend
par(mar=c(mar0[1],.5,mar0[3:4]))
symbols(c(0,0), 1:2, size.range, xlim=c(-4, 4), ylim=c(-2,4),
inches=1/cm(1), bg="black", axes=FALSE, xlab="", ylab="")
text(0, 3, "Freq")
text(c(2,0), 1:2, range(agData$freq), col=c("black", "white"))
# Reset par settings
par(mar=mar0)
Now follows my suggestion. The largest circle has a radius of 1 cm and area of the circles are proportional to agData$freq, without forcing a size of the smallest circle. Personally I think this is easier to read (both code and figure) and looks nicer.
with(agData, symbols(class, drv, sqrt(freq),
inches=size.range[2]/cm(1), bg="black"))
with(agData, text(class, drv, freq, col="white"))

R par(omd) does not contain within mfrow

I would like two plots to show up in two seperate spaces in the plot so I do:
par(mfrow=c(1,2))
plot(1:10,1:10)
Now I would like the second plot to be about 25% shorter than the first plot so I adjust omd:
tmp <- par()$omd
tmp[4] <- 0.75
par(omd=tmp)
plot(1:10,1:10)
The problem is that the second plot shows up ontop of the first plot. How do I avoid this margin issue?
Maybe try using layout instead?
layout(matrix(c(1, 1, 0, 2), ncol = 2L), widths = c(1,1),heights = c(0.5,1))
par(mar = c(3,2,2,2))
plot(1:10,1:10)
par(mar = c(3,2,2,2))
plot(1:10,1:10)
I guess maybe you'd want to set the heights to c(0.2,0.8) to get a 25% reduction?
Edit
But I don't think that omd does what you think it does. It changes the region inside the outer margins, which will always include both plot regions when setting par(mfrow = c(1,2)). What you really want to change, I think is plt, which alters the size of the current plotting region (using quartz, as I'm on a mac):
quartz(width = 5,height = 5)
par(mfrow=c(1,2))
vec <- par("plt")
plot(1:10,1:10)
par(plt = vec * c(1,1,1,0.75))
plot(1:5,1:5)

Resources