Cutting out a cluster from dendrogram - r

I am using this link to plot a nice dendrogram with colored labels as per the categories.
The second answer is what I am looking at in this link (Tree cut and Rectangles around clusters for a horizontal dendrogram in R )which uses the code below:
d <- dist(t(mat[,3:ncol(mat)]), method = "euclidean")
H.fit <- hclust(d, method="ward")
groups <- cutree(H.fit, k=16) # cut tree into clusters
hcdata<- dendro_data(H.fit, type="rectangle")
hcdata$labels <- merge(x = hcdata$labels, y = pm_gtex_comb, by.x = "label", by.y = "sample",all=TRUE)
ggplot() +
geom_segment(data=segment(hcdata), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(hcdata), aes(x, y, label=label, hjust=0, color=cluster),
size=3) +
geom_rect(data=rect, aes(xmin=X1-.3, xmax=X2+.3, ymin=0, ymax=ymax),
color="red", fill=NA)+
geom_hline(yintercept=0.33, color="blue")+
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme_dendro()
I want to cut out some of the clusters as I have 16 clusters,with 145 labels so that I can view only few clusters as I want to focus/cut-out/zoom in only on couple of them.Is there any way to do this on hclust object .This is only for having a nice visualization as the figure gets messy with 145 labels.Since I want to color as per the labels,I think ggdendro suits pretty well.
For example in this link ,if you look at 3)Zooming-in on dendrograms
http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html

You could try prune from the package dendextend (which can do lots of other nifty things):
library(dendextend)
hc <- hclust(dist(USArrests), "ave")
clusters <- cutree(hc, k=3)
par(mfrow=c(1,2), mar=c(6, 4, 2, 3))
plot(as.dendrogram(hc), main="regular")
plot(dend <- prune(as.dendrogram(hc), names(clusters[clusters==1])),
ylim=range(hc$height), main="without cluster #1")
or if you insist on ggdendro:
ggdendro::ggdendrogram(dend)
A ggplot2 plot can be created also by using dendextend:
library(dendextend)
ggd1 <- as.ggdend(dend)
library(ggplot2)
ggplot(ggd1)

Related

How to plot skewed and normal data in r

Using hypothetical data I want to generate these three plots in one plot.
I wonder how I can do it. Is it possible to do it using ggplot2 or fGarch packages?
Here an approach with ggplot2
library("ggplot2")
x <- 0:100
y <- c(dnorm(x, mean=50, sd=10),
dlnorm(x, meanlog=3, sdlog=.7),
dlnorm(100-x, meanlog=3, sdlog=.7))
df <- data.frame(
x=x,
y=y,
type=rep(c("normal", "right skewed", "left skewed"), each=101)
)
ggplot(df, aes(x, y, color=type)) + geom_line()

Changing ellipse line type in fviz_cluster

I am using fviz_cluster from the to plot my kmeans results, obtained using kmeans function.
Below, I'm reporting the example present in the "factoextra" package guideline.
data("iris")
iris.scaled <- scale(iris[, -5])
km.res <- kmeans(iris.scaled, 3, nstart = 25)
fviz_cluster(km.res, data = iris[, -5], repel=TRUE,
ellipse.type = "convex")
Typing this command you will probably observe three clusters, each with a different colour. For each of those, however, I want to fix the same colour but varying line type of the ellipses. Do you know how to do it?
One solution is to use the data you get from fviz_cluster() in order to build your custom plot, by using ggplot.
Basically you just need to access the x,y coordinates of each new point, plus the info about the clusters, then you can recreate yourself the plot.
First save the data used for the plot from fviz_cluster(), then you can use chull() to find the convex hull per each cluster, then you can plot.
library(ggplot2)
library(factoextra)
# your example:
iris.scaled <- scale(iris[, -5])
km.res <- kmeans(iris.scaled, 3, nstart = 25)
p <- fviz_cluster(km.res, data = iris[, -5], repel=TRUE,
ellipse.type = "convex") # save to access $data
# save '$data'
data <- p$data # this is all you need
# calculate the convex hull using chull(), for each cluster
hull_data <- data %>%
group_by(cluster) %>%
slice(chull(x, y))
# plot: you can now customize this by using ggplot sintax
ggplot(data, aes(x, y)) + geom_point() +
geom_polygon(data = hull_data, alpha = 0.5, aes(fill=cluster, linetype=cluster))
Of course now you can change the axis labels, add a title and add labelling per each point if you need.
Here an example possibly closer to your needs:
ggplot(data, aes(x, y)) + geom_point() +
geom_polygon(data = hull_data, alpha=0.2, lwd=1, aes(color=cluster, linetype=cluster))
linetype changes the line per each cluster, you need to use lwd to make them thicker, also it's better to remove the fill argument and use color instead.

Coloring clusters in ggdendro with long labels

I am creating dendrograms using ggdendro and coloring them according to cutpoints in the branches. I'm using the approach provided by #jlhoward in this question (Colorize Clusters in Dendogram with ggplot2) but I run into problems when my leaf labels are very long.
Here is some example code:
df <- USArrests
labs <- paste("veryverylongtitlename",1:50,sep="")
rownames(df) <- labs
library(ggplot2)
library(ggdendro)
hc <- hclust(dist(df), "ave") # heirarchal clustering
dendr <- dendro_data(hc, type="rectangle") # convert for ggplot
clust <- cutree(hc,k=2) # find 2 clusters
clust.df <- data.frame(label=names(clust), cluster=factor(clust))
# dendr[["labels"]] has the labels, merge with clust.df based on
label column
dendr[["labels"]] <- merge(dendr[["labels"]],clust.df, by="label")
# plot the dendrogram; note use of color=cluster in geom_text(...)
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend,
yend=yend)) +
geom_text(data=label(dendr), aes(x, y, label=label, hjust=0, color=cluster),
size=3) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())
As you can see, the labels here get cut off. I found this answer (decrease size of dendogram (or y-axis) ggplot), but I don't want to use it because I very much like the ability to use cutree to define my clusters. How can I manipulate the above code to fit the long labels? Many thanks!

ggplot2: add conditional density curves describing both dimensions of scatterplot

I have scatterplots of 2D data from two categories. I want to add density lines for each dimension -- not outside the plot (cf. Scatterplot with marginal histograms in ggplot2) but right on the plotting surface. I can get this for the x-axis dimension, like this:
set.seed(123)
dim1 <- c(rnorm(100, mean=1), rnorm(100, mean=4))
dim2 <- rnorm(200, mean=1)
cat <- factor(c(rep("a", 100), rep("b", 100)))
mydf <- data.frame(cbind(dim2, dim1, cat))
ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) +
geom_point() +
stat_density(aes(x=dim1, y=(-2+(..scaled..))),
position="identity", geom="line")
It looks like this:
But I want an analogous pair of density curves running vertically, showing the distribution of points in the y-dimension. I tried
stat_density(aes(y=dim2, x=0+(..scaled..))), position="identity", geom="line)
but receive the error "stat_density requires the following missing aesthetics: x".
Any ideas? thanks
You can get the densities of the dim2 variables. Then, flip the axes and store them in a new data.frame. After that it is simply plotting them on top of the other graph.
p <- ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) +
geom_point() +
stat_density(aes(x=dim1, y=(-2+(..scaled..))),
position="identity", geom="line")
stuff <- ggplot_build(p)
xrange <- stuff[[2]]$ranges[[1]]$x.range # extract the x range, to make the new densities align with y-axis
## Get densities of dim2
ds <- do.call(rbind, lapply(unique(mydf$cat), function(lev) {
dens <- with(mydf, density(dim2[cat==lev]))
data.frame(x=dens$y+xrange[1], y=dens$x, cat=lev)
}))
p + geom_path(data=ds, aes(x=x, y=y, color=factor(cat)))
So far I can produce:
distrib_horiz <- stat_density(aes(x=dim1, y=(-2+(..scaled..))),
position="identity", geom="line")
ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) +
geom_point() + distrib_horiz
And:
distrib_vert <- stat_density(data=mydf, aes(x=dim2, y=(-2+(..scaled..))),
position="identity", geom="line")
ggplot(data=mydf, aes(x=dim2, y=dim1, colour=as.factor(cat))) +
geom_point() + distrib_vert + coord_flip()
But combining them is proving tricky.
So far I have only a partial solution since I didn't manage to obtain a vertical stat_density line for each individual category, only for the total set. Maybe this can nevertheless help as a starting point for finding a better solution. My suggestion is to try with the ggMarginal() function from the ggExtra package.
p <- ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) +
geom_point() + stat_density(aes(x=dim1, y=(-2+(..scaled..))),
position="identity", geom="line")
library(ggExtra)
ggMarginal(p,type = "density", margins = "y", size = 4)
This is what I obtain:
I know it's not perfect, but maybe it's a step in a helpful direction. At least I hope so. Looking forward to seeing other answers.

Making adjustments to a forest plot using ggplot2

I'm trying to create a forest plot in R from meta-analysis results. However, I'm having difficulties adjusting the line thickness & the center points as well as getting rid of the automatic legend and creating my own legend.
#d is a data frame with 4 columns
#d$x gives variable names
#d$y gives center point
#d$ylo gives lower limits
#d$yhi gives upper limits
#data
d <- data.frame(x = toupper(letters[1:10]),
y = rnorm(10, 0, 0.1))
d <- transform(d, ylo = y-1/10, yhi=y+1/10)
d$x <- factor(d$x, levels=rev(d$x)) #Reverse ordering in the way that it's is in the
#function
credplot.gg <- function(d){
require(ggplot2)
p <- ggplot(d, aes(x=x, y=y, ymin=ylo, ymax=yhi,group=x,colour=x))+
geom_pointrange()+ theme_bw()+ coord_flip()+
guides(color=guide_legend(title="Cohort"))+
geom_hline(aes(x=0),colour = 'red', lty=1)+
xlab('Cohort') + ylab('Beta') + ggtitle('rs6467890_CACNA2D1')
return(p)
}
credplot.gg(d)
The issues that I'm having are:
when insert "size" into ggplot(d, aes(x=x, y=y, ymin=ylo, ymax=yhi, group=x,colour=x), size=1.5) the line and points are extremely large
How do I get rid of the legend that is automatically generated with the plot and how do I create my own legend?
I'm fairly new to r so and any help is gladly appreciated

Resources