How to generate a Scree Plot for Hierarchical Cluster in R? - r

I have generated a dendrogram using plot() function and used hclust() for hierarchical clustering. I am looking to generate a scree plot for the same. Any suggestions?

It's a little late, but I have an answer.
# creating a dissimilarity matrix
res.dist <- dist(USArrests, method = "euclidean")
# creating an object of class "hclust"
res.hc <- hclust(d = res.dist, method = "ward.D2")
As can be found in the documentation to hclust, it is a list of values. You can inspect them by using
View(res.hc)
Now, the variable height has exactly what is needed for a scree plot. The following code generates a scree plot:
> ggplot(res.hc$height %>%
+ as.tibble() %>%
+ add_column(groups = length(res.hc$height):1) %>%
+ rename(height=value),
+ aes(x=groups, y=height)) +
+ geom_point() +
+ geom_line()
Basically, what you do is plot the height for a number of groups.
(It might not be very elegant, I'd be delighted to hear shorter versions to generate the same outcome).
My outcome is:

library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),
rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

Please do take a look at the youtube link here, which will be of use
https://www.youtube.com/watch?v=aMYCFtoBrdA
Regards
code
Link to R code on google drive for download
https://drive.google.com/file/d/0Byo-GmbU7XciVGRQcTk3QzdTMjA/view?usp=sharing
R code
#-----------------------------------------------
# Hierarchical clustering with the sample data
#------------------------------------------------
# Reading data into R similar to CARDS
temp_str <- "Name physics math
P 15 20
Q 20 15
R 26 21
X 44 52
Y 50 45
Z 57 38
A 80 85
B 90 88
C 98 98"
base_data <- read.table(textConnection(
temp_str), header = TRUE)
closeAllConnections()
# Check distinct categories of Variables useing STR function
str(base_data)
# Plot data
plot(base_data$physics, base_data$math,
pch=21, bg=c("red","green3","blue","red","green3","blue",
"red","green3","blue")[unclass(base_data$Name)],
main="Base Data")
# Step 01- obtain distance matrix (right way)
my_dist <- dist(base_data[c(2,3)], method = "euclidean")
print(my_dist)
# Step 02- Apply Hierarchical Clustering
fit <- hclust(my_dist, method="ward.D2")
# Step 03- Display dendogram
plot(fit, labels = base_data$Name)
Dendogram_Height=0
for (i in 2:9) Dendogram_Height[i] <- fit$height[i-1]
plot(1:9, Dendogram_Height, type="b", xlab="Sequence of merging",
ylab="Dendogram Height")
plot(9:1, Dendogram_Height, type="b", xlab="# of clusters",
ylab="Dendogram Height")
# Step 04- draw dendogram with color borders
# One can use this step to take a look at execution
rect.hclust(fit, k=8, border="red")
plot(fit, labels = base_data$Name)
rect.hclust(fit, k=7, border="red")
plot(fit, labels = base_data$Name)
rect.hclust(fit, k=6, border="red")
# draw color borders around required clusterd
plot(fit, labels = base_data$Name)
rect.hclust(fit, k=3, border="blue")
# cut tree into 3 clusters
my_groups <- cutree(fit, k=3)

Related

How to add labels to original data given clustering result using hclust

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".

R map a k-means clustering of a self organising map back to the data

I've used K-means clustering to classify a self-organising map (SOM and would not like to back code the data with the SOM clusters.
Example script below.
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
# Plot codes map
myPal1=colorRampPalette(c("black","orange","red","green"))
plot(som1,
type="codes",
palette.name = myPal1,
main="Codes",
shape="straight",
border ="gray")
# Extract the codebooks from SOM
cds <- as.data.frame(som1$codes)
# Compute WSS for up to 6 clusters for codebook vectors
wss <- (nrow(cds)-1)*sum(apply(cds,2,var))
for (i in 2:6){
wss[i] <- sum(kmeans(cds,centers=i)$withinss)
}
# Plot the scree plot
par(mar = c(8,5,8,2))
plot(1:6,
wss,
type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares",
main="Within cluster sum of squares (WCSS)",
col="blue",
lwd =2)
# Scree plot - 3 clusters look sensible choice
nCls =3
som1.km <- kmeans(cds, nCls, nstart = 20)
# Plot the SOM codes map with 3 clusters as background
MyPal3 <- c("grey80", 'aquamarine', 'burlywood1')
par(mar = c(0,5,0,2))
plot(som1,
type="codes",
palette.name= myPal1,
bgcol = MyPal3[som1.km$cluster],
main = "k-mean cluster",
shape="straight",
border ="gray"
)
legend("right",
x=7,
y=4,
cex=1.5,
title="Cluster",
legend = c(1:nCls),
fill= MyPal3[c(1:nCls)]
)
# Get the SOM cell number number assoicated with each of the 150 data
SOM.clss <- as.data.frame(som1$unit.classif)
names(SOM.clss) <- "Cell.Nmbr"
unique(SOM.clss)
# Get the k-means 3-class classification of the 36 SOM cells
kMns.clst <- as.data.frame(som1.km$cluster)
names(kMns.clst) <- "Clstr"
# Add a SOM cell reference for a lookup table
kMns.clst$Cell.Nmbr <- 1:nrow(kMns.clst)
# Use the lookup table to map the cluster number to each datum
dt.clst <- merge(SOM.clss,kMns.clst,by="Cell.Nmbr")
# Add the cluster column to the original data
iris.clst <- cbind(iris,dt.clst)
# Compute means as a reality check
aggregate(iris.clst[,1:4],
by=list(iris.clst$Clstr),
FUN=mean
)
The answer seems to make sense but I'm not sure if the approach is correct. Is this correct and if so is there a more efficient method of doing this back-coding exercise?

Edit betadisper permutest plot

I have used the script below to generate this betadisper plot between 2 communities.
In my "df", the first column is station names (x13)
I have 2 questions:
There is a point behind the "ABC" label, so how do I make the label transparent? Preferably adding different colours to each community?
How do I add the station names next to each point so I can visually compare which stations are most similar?
Script:
df <-read.csv("NMDS matrix_csv_NEW.csv", header=T, row.names=1, sep= ",")
df
Label<-rownames(df)
Label
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
groups
mod <- betadisper(dis, groups)
mod
anova(mod)
permutest(mod, pairwise = TRUE)
plot(mod)
plot(mod, ellipse = TRUE, hull = FALSE, main= "MultiVariate Permutation")
To answer 2), here's how to plot the station names on top of the points.
text(mod$vectors[,1:2], label=Label)
Here is a possibile solution to your problem.
Download the myplotbetadisp.r file from this link and place the file in the working directory (warning, do not save the file as myplotbetadisp.r.txt!).
Some additional options are available in myplotbetadisper function:
fillrect, filling color of the box where centroid labels are printed;
coltextrect, vector of colors for centroid labels;
alphaPoints, alpha trasparency for centroid points;
labpoints, vectors of labels plotted close to points;
poslabPoints, position specifier for the text in labpoints.
library(vegan)
# A dummy data generation process
set.seed(1)
n <- 100
df <- matrix(runif(13*n),nrow=13)
# Compute dissimilarity indices
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
# Analysis of multivariate homogeneity of group dispersions
mod <- betadisper(dis, groups)
source("myplotbetadisp.r")
labPts <- LETTERS[1:13]
col.fill.rect <- addAlpha(col2rgb("gray65"), alpha=0.5)
col.text.rect <- apply(col2rgb(c("blue","darkgreen")), 2, addAlpha, alpha=0.5)
transp.centroids <- 0.7
myplotbetadisper(mod, ellipse = TRUE, hull = FALSE,
fillrect=col.fill.rect, coltextrect=col.text.rect,
alphaPoints=transp.centroids, labPoints=labPts,
main= "MultiVariate Permutation")
Here is the plot
Hope it can help you.

using R to plot interaction plot

I have created a model using following
age hrs charges
530.6071 792.10 3474.60
408.6071 489.70 1247.06
108.0357 463.00 1697.07
106.6071 404.15 1676.33
669.4643 384.65 1701.13
556.4643 358.15 1630.30
665.4643 343.85 2468.83
508.4643 342.35 3366.44
106.0357 335.25 2876.82
interaction_model <- rlm( charges~age+hrs+age*hrs, age_vs_hrs_charges_cleaned);
Any idea how i can plot this in 3D?
I already plotted using
library(effects);
plot(effect(term="age:hrs", mod=interaction_model,default.levels=20),multiline=TRUE);
but this is not very clear visualization.
Any help?
There are several ways to do this.
model <- lm( charges~age+hrs+age*hrs, df)
# set up grid of (x,y) values
age <- seq(0,1000, by=20)
hrs <- seq(0,1000, by=20)
gg <- expand.grid(age=age, hrs=hrs)
# prediction from the linear model
gg$charges <-predict(model,newdata=gg)
# contour plot
library(ggplot2)
library(colorRamps)
library(grDevices)
jet.colors <- colorRampPalette(matlab.like(9))
ggplot(gg, aes(x=age, y=hrs, z=charges))+
stat_contour(aes(color=..level..),binwidth=200, size=2)+
scale_color_gradientn(colours=jet.colors(8))
# 3D scatterplot
library(scatterplot3d)
scatterplot3d(gg$age, gg$hrs, gg$charges)
# interactive 3D scatterplot (just a screen shot here)
library(rgl)
plot3d(gg$age,gg$hrs,gg$charges)
# interactive 3D surface plot with shading (screen shot)
colorjet <- jet.colors(100)
open3d()
rgl.surface(x=age, z=hrs, y=0.05*gg$charges,
color=colorzjet[ findInterval(gg$charges, seq(min(gg$charges), max(gg$charges), length=100))] )
axes3d()
A little while ago I wrote a couple of functions to display the results of a (general) linear model, together with colour coded data points, in either 3D (interactive, using rgl) or 2D (using a contour plot) :
# plot predictions of a (general) linear model as a function of two explanatory variables as an image / contour plot
# together with the actual data points
# mean value is used for any other variables in the model
plotImage=function(model=NULL,plotx=NULL,ploty=NULL,plotPoints=T,plotContours=T,plotLegend=F,npp=1000,xlab=NULL,ylab=NULL,zlab=NULL,xlim=NULL,ylim=NULL,pch=16,cex=1.2,lwd=0.1,col.palette=NULL) {
n=npp
require(rockchalk)
require(aqfig)
require(colorRamps)
require(colorspace)
require(MASS)
mf=model.frame(model);emf=rockchalk::model.data(model)
if (is.null(xlab)) xlab=plotx
if (is.null(ylab)) ylab=ploty
if (is.null(zlab)) zlab=names(mf)[[1]]
if (is.null(col.palette)) col.palette=rev(rainbow_hcl(1000,c=100))
x=emf[,plotx];y=emf[,ploty];z=mf[,1]
if (is.null(xlim)) xlim=c(min(x)*0.95,max(x)*1.05)
if (is.null(ylim)) ylim=c(min(y)*0.95,max(y)*1.05)
preds=predictOMatic(model,predVals=c(plotx,ploty),n=npp,divider="seq")
zpred=matrix(preds[,"fit"],npp,npp)
zlim=c(min(c(preds$fit,z)),max(c(preds$fit,z)))
par(mai=c(1.2,1.2,0.5,1.2),fin=c(6.5,6))
graphics::image(x=seq(xlim[1],xlim[2],len=npp),y=seq(ylim[1],ylim[2],len=npp),z=zpred,xlab=xlab,ylab=ylab,col=col.palette,useRaster=T,xaxs="i",yaxs="i")
if (plotContours) graphics::contour(x=seq(xlim[1],xlim[2],len=npp),y=seq(ylim[1],ylim[2],len=npp),z=zpred,xlab=xlab,ylab=ylab,add=T,method="edge")
if (plotPoints) {cols1=col.palette[(z-zlim[1])*999/diff(zlim)+1]
pch1=rep(pch,length(n))
cols2=adjustcolor(cols1,offset=c(-0.3,-0.3,-0.3,1))
pch2=pch-15
points(c(rbind(x,x)),c(rbind(y,y)), cex=cex,col=c(rbind(cols1,cols2)),pch=c(rbind(pch1,pch2)),lwd=lwd) }
box()
if (plotLegend) vertical.image.legend(zlim=zlim,col=col.palette) # TO DO: add z axis label, maybe make legend a bit smaller?
}
# plot predictions of a (general) linear model as a function of two explanatory variables as an interactive 3D plot
# mean value is used for any other variables in the model
plotPlaneFancy=function(model=NULL,plotx1=NULL,plotx2=NULL,plotPoints=T,plotDroplines=T,npp=50,x1lab=NULL,x2lab=NULL,ylab=NULL,x1lim=NULL,x2lim=NULL,cex=1.5,col.palette=NULL,segcol="black",segalpha=0.5,interval="none",confcol="lightgrey",confalpha=0.4,pointsalpha=1,lit=T,outfile="graph.png",aspect=c(1,1,0.3),zoom=1,userMatrix=matrix(c(0.80,-0.60,0.022,0,0.23,0.34,0.91,0,-0.55,-0.72,0.41,0,0,0,0,1),ncol=4,byrow=T),windowRect=c(0,29,1920,1032)) { # or library(colorRamps);col.palette <- matlab.like(1000)
require(rockchalk)
require(rgl)
require(colorRamps)
require(colorspace)
require(MASS)
mf=model.frame(model);emf=rockchalk::model.data(model)
if (is.null(x1lab)) x1lab=plotx1
if (is.null(x2lab)) x2lab=plotx2
if (is.null(ylab)) ylab=names(mf)[[1]]
if (is.null(col.palette)) col.palette=rev(rainbow_hcl(1000,c=100))
x1=emf[,plotx1]
x2=emf[,plotx2]
y=mf[,1]
if (is.null(x1lim)) x1lim=c(min(x1),max(x1))
if (is.null(x2lim)) x2lim=c(min(x2),max(x2))
preds=predictOMatic(model,predVals=c(plotx1,plotx2),n=npp,divider="seq",interval=interval)
ylim=c(min(c(preds$fit,y)),max(c(preds$fit,y)))
open3d(zoom=zoom,userMatrix=userMatrix,windowRect=windowRect)
if (plotPoints) plot3d(x=x1,y=x2,z=y,type="s",col=col.palette[(y-min(y))*999/diff(range(y))+1],size=cex,aspect=aspect,xlab=x1lab,ylab=x2lab,zlab=ylab,lit=lit,alpha=pointsalpha)
if (!plotPoints) plot3d(x=x1,y=x2,z=y,type="n",col=col.palette[(y-min(y))*999/diff(range(y))+1],size=cex,aspect=aspect,xlab=x1lab,ylab=x2lab,zlab=ylab)
if ("lwr" %in% names(preds)) persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=matrix(preds[,"lwr"],npp,npp),color=confcol, alpha=confalpha, lit=lit, back="lines",add=TRUE)
ypred=matrix(preds[,"fit"],npp,npp)
cols=col.palette[(ypred-min(ypred))*999/diff(range(ypred))+1]
persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=ypred,color=cols, alpha=0.7, lit=lit, back="lines",add=TRUE)
if ("upr" %in% names(preds)) persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=matrix(preds[,"upr"],npp,npp),color=confcol, alpha=confalpha, lit=lit, back="lines",add=TRUE)
if (plotDroplines) segments3d(x=rep(x1,each=2),y=rep(x2,each=2),z=matrix(t(cbind(y,fitted(model))),nc=1),col=segcol,lty=2,alpha=segalpha)
if (!is.null(outfile)) rgl.snapshot(outfile, fmt="png", top=TRUE)
}
Here is what you get as output with your model :
data=data.frame(age=c(530.6071,408.6071,108.0357,106.6071,669.4643,556.4643,665.4643,508.4643,106.0357),
hrs=c(792.10,489.70,463.00,404.15,384.65,358.15,343.85,342.35,335.25),
charges=c(3474.60,1247.06,1697.07,1676.33,1701.13,1630.30,2468.83,3366.44,2876.82))
library(MASS)
fit1=rlm( charges~age+hrs+age*hrs, data)
plotPlaneFancy(fit1, plotx1 = "age", plotx2 = "hrs")
plotPlaneFancy(fit1, plotx1 = "age", plotx2 = "hrs",interval="confidence")
(or interval="prediction" to show 95% prediction intervals)
plotImage(fit1,plotx="age",ploty="hrs",plotContours=T,plotLegend=T)

Easiest way to plot inequalities with hatched fill?

Refer to the above plot. I have drawn the equations in excel and then shaded by hand. You can see it is not very neat. You can see there are six zones, each bounded by two or more equations. What is the easiest way to draw inequalities and shade the regions using hatched patterns ?
To build up on #agstudy's answer, here's a quick-and-dirty way to represent inequalities in R:
plot(NA,xlim=c(0,1),ylim=c(0,1), xaxs="i",yaxs="i") # Empty plot
a <- curve(x^2, add = TRUE) # First curve
b <- curve(2*x^2-0.2, add = TRUE) # Second curve
names(a) <- c('xA','yA')
names(b) <- c('xB','yB')
with(as.list(c(b,a)),{
id <- yB<=yA
# b<a area
polygon(x = c(xB[id], rev(xA[id])),
y = c(yB[id], rev(yA[id])),
density=10, angle=0, border=NULL)
# a>b area
polygon(x = c(xB[!id], rev(xA[!id])),
y = c(yB[!id], rev(yA[!id])),
density=10, angle=90, border=NULL)
})
If the area in question is surrounded by more than 2 equations, just add more conditions:
plot(NA,xlim=c(0,1),ylim=c(0,1), xaxs="i",yaxs="i") # Empty plot
a <- curve(x^2, add = TRUE) # First curve
b <- curve(2*x^2-0.2, add = TRUE) # Second curve
d <- curve(0.5*x^2+0.2, add = TRUE) # Third curve
names(a) <- c('xA','yA')
names(b) <- c('xB','yB')
names(d) <- c('xD','yD')
with(as.list(c(a,b,d)),{
# Basically you have three conditions:
# curve a is below curve b, curve b is below curve d and curve d is above curve a
# assign to each curve coordinates the two conditions that concerns it.
idA <- yA<=yD & yA<=yB
idB <- yB>=yA & yB<=yD
idD <- yD<=yB & yD>=yA
polygon(x = c(xB[idB], xD[idD], rev(xA[idA])),
y = c(yB[idB], yD[idD], rev(yA[idA])),
density=10, angle=0, border=NULL)
})
In R, there is only limited support for fill patterns and they can only be
applied to rectangles and polygons.This is and only within the traditional graphics, no ggplot2 or lattice.
It is possible to fill a rectangle or polygon with a set of lines drawn
at a certain angle, with a specific separation between the lines. A density
argument controls the separation between the lines (in terms of lines per inch)
and an angle argument controls the angle of the lines.
here an example from the help:
plot(c(1, 9), 1:2, type = "n")
polygon(1:9, c(2,1,2,1,NA,2,1,2,1),
density = c(10, 20), angle = c(-45, 45))
EDIT
Another option is to use alpha blending to differentiate between regions. Here using #plannapus example and gridBase package to superpose polygons, you can do something like this :
library(gridBase)
vps <- baseViewports()
pushViewport(vps$figure,vps$plot)
with(as.list(c(a,b,d)),{
grid.polygon(x = xA, y = yA,gp =gpar(fill='red',lty=1,alpha=0.2))
grid.polygon(x = xB, y = yB,gp =gpar(fill='green',lty=2,alpha=0.2))
grid.polygon(x = xD, y = yD,gp =gpar(fill='blue',lty=3,alpha=0.2))
}
)
upViewport(2)
There are several submissions on the MATLAB Central File Exchange that will produce hatched plots in various ways for you.
I think a tool that will come handy for you here is gnuplot.
Take a look at the following demos:
feelbetween
statistics
some tricks

Resources