Related
I want to plot the gradient plot of intensities, something like this:
I though myself about creating a gradient grid whose distribution was my "I" function, but I have no idea how to do it or if there is an explicit package in R to accomplish this task.
Thank you so much for even thinking about this.
a <- 5*10^(-6)
d <- 0.5*0.005
l <- 500*10^(-9)
n <- pi
theta <- seq(-n,n,length=3500)
I <- function(x){(cos((pi*d*sin(x))/l))^2*(sin((pi*a*sin(x))/l)/((pi*a*sin(x))/l))^2}
y1 <- lapply(theta,I)
y <- unlist(y1)
df <- data.frame(theta,y)
I2 <- function(x){(sin((pi*a*sin(x))/l)/((pi*a*sin(x))/l))^2}
y12 <- lapply(theta,I2)
y2 <- unlist(y12)
df2 <- data.frame(theta,y2)
p = ggplot()
p +
geom_line(data = df, aes(theta,y)) +
xlim(-0.3,0.3) +
geom_line(data = df2, aes(theta,y2))
Making use of patchwork this could be achieved like so:
For the gradient make a second ggplot of rectangles using e.g. geom_rect where you map intensity on color and/or fill
This gradient plot could then be glued to the main plot via patchwork
To get a nice gradient plot
I tripled the number of grid points for the gradient plot,
mapped the cubic root of intensity on color and
get rid of all unnecessary elemnts like y-axis, color guide, ...
BTW:
As your functions are vectorized you don't need lapply to compute the intensities.
Instead of adjusting the limits via xlim() (which removes rows falling outside of the range), set them using coord_cartesian.
library(ggplot2)
library(tibble)
library(patchwork)
a <- 5*10^(-6)
d <- 0.5*0.005
l <- 500*10^(-9)
n <- pi
theta <- seq(-n,n,length=3500)
I <- function(x){(cos((pi*d*sin(x))/l))^2*(sin((pi*a*sin(x))/l)/((pi*a*sin(x))/l))^2}
y <- I(theta)
df <- data.frame(theta,y)
I2 <- function(x){(sin((pi*a*sin(x))/l)/((pi*a*sin(x))/l))^2}
y2 <- I2(theta)
df2 <- data.frame(theta,y2)
p1 = ggplot() +
geom_line(data = df, aes(theta,y)) +
geom_line(data = df2, aes(theta,y2)) +
coord_cartesian(xlim = c(-0.3,0.3))
g <- tibble(
xmin = seq(-n, n, length = 3 * 3500),
xmax = dplyr::lead(xmin),
y = I(xmin)
)
p2 <- ggplot(g, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1, color = y^(1/3))) +
geom_rect() +
coord_cartesian(xlim = c(-0.3,0.3)) +
guides(color = FALSE) +
theme_minimal() +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
p1 / p2 + plot_layout(heights = c(10, 1))
#> Warning: Removed 1 rows containing missing values (geom_rect).
Using sample data:
library(tidyverse)
library(plotly)
myplot <- diamonds %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, ncol = 8, scales = "free", strip.position = "bottom") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
ggplotly(myplot)
Returns something like:
Where the inside facets are horribly scaled compared to the first and last and there is a lot of extra padding. I tried to find a solution from these questions:
ggplotly not working properly when number are facets are more
R: facet_wrap does not render correctly with ggplotly in Shiny app
With trial and error I used panel.spacing.x = unit(-0.5, "line") in theme() and it looks a bit better, with a lot of the extra padding gone, but the internal facets are still noticeably smaller.
Also as an extra question but not as important, the strip labels are the top in the ggplotly() call, when I set them at the bottom. Seems like an ongoing issue here, does anyone have a hacky workaround?
Edit: in my real dataset I need y-axis labels for each of the facets as their scales are quite different so I kept them in the example and is why I need facet_wrap. Screenshot of my real dataset for explanation:
Updated answer (2): just use fixfacets()
I've put together a function fixfacets(fig, facets, domain_offset) that turns this:
...by using this:
f <- fixfacets(figure = fig, facets <- unique(df$clarity), domain_offset <- 0.06)
...into this:
This function should now be pretty flexible with regards to number of facets.
Complete code:
library(tidyverse)
library(plotly)
# YOUR SETUP:
df <- data.frame(diamonds)
df['price'][df$clarity == 'VS1', ] <- filter(df['price'], df['clarity']=='VS1')*2
myplot <- df %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, scales = 'free', shrink = FALSE, ncol = 8, strip.position = "bottom", dir='h') +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
fig <- ggplotly(myplot)
# Custom function that takes a ggplotly figure and its facets as arguments.
# The upper x-values for each domain is set programmatically, but you can adjust
# the look of the figure by adjusting the width of the facet domain and the
# corresponding annotations labels through the domain_offset variable
fixfacets <- function(figure, facets, domain_offset){
# split x ranges from 0 to 1 into
# intervals corresponding to number of facets
# xHi = highest x for shape
xHi <- seq(0, 1, len = n_facets+1)
xHi <- xHi[2:length(xHi)]
xOs <- domain_offset
# Shape manipulations, identified by dark grey backround: "rgba(217,217,217,1)"
# structure: p$x$layout$shapes[[2]]$
shp <- fig$x$layout$shapes
j <- 1
for (i in seq_along(shp)){
if (shp[[i]]$fillcolor=="rgba(217,217,217,1)" & (!is.na(shp[[i]]$fillcolor))){
#$x$layout$shapes[[i]]$fillcolor <- 'rgba(0,0,255,0.5)' # optionally change color for each label shape
fig$x$layout$shapes[[i]]$x1 <- xHi[j]
fig$x$layout$shapes[[i]]$x0 <- (xHi[j] - xOs)
#fig$x$layout$shapes[[i]]$y <- -0.05
j<-j+1
}
}
# annotation manipulations, identified by label name
# structure: p$x$layout$annotations[[2]]
ann <- fig$x$layout$annotations
annos <- facets
j <- 1
for (i in seq_along(ann)){
if (ann[[i]]$text %in% annos){
# but each annotation between high and low x,
# and set adjustment to center
fig$x$layout$annotations[[i]]$x <- (((xHi[j]-xOs)+xHi[j])/2)
fig$x$layout$annotations[[i]]$xanchor <- 'center'
#print(fig$x$layout$annotations[[i]]$y)
#fig$x$layout$annotations[[i]]$y <- -0.05
j<-j+1
}
}
# domain manipulations
# set high and low x for each facet domain
xax <- names(fig$x$layout)
j <- 1
for (i in seq_along(xax)){
if (!is.na(pmatch('xaxis', lot[i]))){
#print(p[['x']][['layout']][[lot[i]]][['domain']][2])
fig[['x']][['layout']][[xax[i]]][['domain']][2] <- xHi[j]
fig[['x']][['layout']][[xax[i]]][['domain']][1] <- xHi[j] - xOs
j<-j+1
}
}
return(fig)
}
f <- fixfacets(figure = fig, facets <- unique(df$clarity), domain_offset <- 0.06)
f
Updated answer (1): How to handle each element programmatically!
The elements of your figure that require some editing to meet your needs with regards to maintaining the scaling of each facet and fix the weird layout, are:
x label annotations through fig$x$layout$annotations,
x label shapes through fig$x$layout$shapes, and
the position where each facet starts and stops along the x axis through fig$x$layout$xaxis$domain
The only real challenge was referincing, for example, the correct shapes and annotations among many other shapes and annotations. The code snippet below will do exatly this to produce the following plot:
The code snippet might need some careful tweaking for each case with regards to facet names, and number of names, but the code in itself is pretty basic so you shouldn't have any problem with that. I'll polish it a bit more myself when I find the time.
Complete code:
ibrary(tidyverse)
library(plotly)
# YOUR SETUP:
df <- data.frame(diamonds)
df['price'][df$clarity == 'VS1', ] <- filter(df['price'], df['clarity']=='VS1')*2
myplot <- df %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, scales = 'free', shrink = FALSE, ncol = 8, strip.position = "bottom", dir='h') +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
#fig <- ggplotly(myplot)
# MY SUGGESTED SOLUTION:
# get info about facets
# through unique levels of clarity
facets <- unique(df$clarity)
n_facets <- length(facets)
# split x ranges from 0 to 1 into
# intervals corresponding to number of facets
# xHi = highest x for shape
xHi <- seq(0, 1, len = n_facets+1)
xHi <- xHi[2:length(xHi)]
# specify an offset from highest to lowest x for shapes
xOs <- 0.06
# Shape manipulations, identified by dark grey backround: "rgba(217,217,217,1)"
# structure: p$x$layout$shapes[[2]]$
shp <- fig$x$layout$shapes
j <- 1
for (i in seq_along(shp)){
if (shp[[i]]$fillcolor=="rgba(217,217,217,1)" & (!is.na(shp[[i]]$fillcolor))){
#fig$x$layout$shapes[[i]]$fillcolor <- 'rgba(0,0,255,0.5)' # optionally change color for each label shape
fig$x$layout$shapes[[i]]$x1 <- xHi[j]
fig$x$layout$shapes[[i]]$x0 <- (xHi[j] - xOs)
j<-j+1
}
}
# annotation manipulations, identified by label name
# structure: p$x$layout$annotations[[2]]
ann <- fig$x$layout$annotations
annos <- facets
j <- 1
for (i in seq_along(ann)){
if (ann[[i]]$text %in% annos){
# but each annotation between high and low x,
# and set adjustment to center
fig$x$layout$annotations[[i]]$x <- (((xHi[j]-xOs)+xHi[j])/2)
fig$x$layout$annotations[[i]]$xanchor <- 'center'
j<-j+1
}
}
# domain manipulations
# set high and low x for each facet domain
lot <- names(fig$x$layout)
j <- 1
for (i in seq_along(lot)){
if (!is.na(pmatch('xaxis', lot[i]))){
#print(p[['x']][['layout']][[lot[i]]][['domain']][2])
fig[['x']][['layout']][[lot[i]]][['domain']][2] <- xHi[j]
fig[['x']][['layout']][[lot[i]]][['domain']][1] <- xHi[j] - xOs
j<-j+1
}
}
fig
Initial answers based on built-in functionalities
With many variables of very different values, it seems that you're going to end up with a challenging format no matter what, meaning either
facets will have varying width, or
labels will cover facets or be too small to be readable, or
the figure will be too wide to display without a scrollbar.
So what I'd suggest is rescaling your price column for each unique clarity and set scale='free_x. I still hope someone will come up with a better answer. But here's what I would do:
Plot 1: Rescaled values andscale='free_x
Code 1:
#install.packages("scales")
library(tidyverse)
library(plotly)
library(scales)
library(data.table)
setDT(df)
df <- data.frame(diamonds)
df['price'][df$clarity == 'VS1', ] <- filter(df['price'], df['clarity']=='VS1')*2
# rescale price for each clarity
setDT(df)
clarities <- unique(df$clarity)
for (c in clarities){
df[clarity == c, price := rescale(price)]
}
df$price <- rescale(df$price)
myplot <- df %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, scales = 'free_x', shrink = FALSE, ncol = 8, strip.position = "bottom") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
p <- ggplotly(myplot)
p
This will of course only give insight into the internal distribution of each category since the values have been rescaled. If you want to show the raw price data, and maintain readability, I'd suggest making room for a scrollbar by setting the width large enough.
Plot 2: scales='free' and big enough width:
Code 2:
library(tidyverse)
library(plotly)
df <- data.frame(diamonds)
df['price'][df$clarity == 'VS1', ] <- filter(df['price'], df['clarity']=='VS1')*2
myplot <- df %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, scales = 'free', shrink = FALSE, ncol = 8, strip.position = "bottom") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
p <- ggplotly(myplot, width = 1400)
p
And, of course, if your values don't vary too much accross categories, scales='free_x' will work just fine.
Plot 3: scales='free_x
Code 3:
library(tidyverse)
library(plotly)
df <- data.frame(diamonds)
df['price'][df$clarity == 'VS1', ] <- filter(df['price'], df['clarity']=='VS1')*2
myplot <- df %>% ggplot(aes(clarity, price)) +
geom_boxplot() +
facet_wrap(~ clarity, scales = 'free_x', shrink = FALSE, ncol = 8, strip.position = "bottom") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
p <- ggplotly(myplot)
p
sometimes it is helpful to consider a different plot altogether if you struggle with the selected plot. It all depends on what it is that you wish to visualise. Sometimes box plots work, sometimes histograms work and sometime densities works.
Here is an example of how a density plot can give you a quick idea of data distribution for many parameters.
library(tidyverse)
library(plotly)
myplot <- diamonds %>% ggplot(aes(price, colour = clarity)) +
geom_density(aes(fill = clarity), alpha = 0.25) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
I'm trying to produce a scatter plot with geom_point where the points are circumscribed by a smoothed polygon, with geom_polygon.
Here's my point data:
set.seed(1)
df <- data.frame(x=c(rnorm(30,-0.1,0.1),rnorm(30,0,0.1),rnorm(30,0.1,0.1)),y=c(rnorm(30,-1,0.1),rnorm(30,0,0.1),rnorm(30,1,0.1)),val=rnorm(90),cluster=c(rep(1,30),rep(2,30),rep(3,30)),stringsAsFactors=F)
I color each point according the an interval that df$val is in. Here's the interval data:
intervals.df <- data.frame(interval=c("(-3,-2]","(-2,-0.999]","(-0.999,0]","(0,1.96]","(1.96,3.91]","(3.91,5.87]","not expressed"),
start=c(-3,-2,-0.999,0,1.96,3.91,NA),end=c(-2,-0.999,0,1.96,3.91,5.87,NA),
col=c("#2f3b61","#436CE8","#E0E0FF","#7d4343","#C74747","#EBCCD6","#D3D3D3"),stringsAsFactors=F)
Assigning colors and intervals to the points:
df <- cbind(df,do.call(rbind,lapply(df$val,function(x){
if(is.na(x)){
return(data.frame(col=intervals.df$col[nrow(intervals.df)],interval=intervals.df$interval[nrow(intervals.df)],stringsAsFactors=F))
} else{
idx <- which(intervals.df$start <= x & intervals.df$end >= x)
return(data.frame(col=intervals.df$col[idx],interval=intervals.df$interval[idx],stringsAsFactors=F))
}
})))
Preparing the colors for the leged which will show each interval:
df$interval <- factor(df$interval,levels=intervals.df$interval)
colors <- intervals.df$col
names(colors) <- intervals.df$interval
Here's where I constructed the smoothed polygons (using a function courtesy of this link):
clusters <- sort(unique(df$cluster))
cluster.cols <- c("#ff00ff","#088163","#ccbfa5")
splinePolygon <- function(xy,vertices,k=3, ...)
{
# Assert: xy is an n by 2 matrix with n >= k.
# Wrap k vertices around each end.
n <- dim(xy)[1]
if (k >= 1) {
data <- rbind(xy[(n-k+1):n,], xy, xy[1:k, ])
} else {
data <- xy
}
# Spline the x and y coordinates.
data.spline <- spline(1:(n+2*k), data[,1], n=vertices, ...)
x <- data.spline$x
x1 <- data.spline$y
x2 <- spline(1:(n+2*k), data[,2], n=vertices, ...)$y
# Retain only the middle part.
cbind(x1, x2)[k < x & x <= n+k, ]
}
library(data.table)
hulls.df <- do.call(rbind,lapply(1:length(clusters),function(l){
dt <- data.table(df[which(df$cluster==clusters[l]),])
hull <- dt[, .SD[chull(x,y)]]
spline.hull <- splinePolygon(cbind(hull$x,hull$y),100)
return(data.frame(x=spline.hull[,1],y=spline.hull[,2],val=NA,cluster=clusters[l],col=cluster.cols[l],interval=NA,stringsAsFactors=F))
}))
hulls.df$cluster <- factor(hulls.df$cluster,levels=clusters)
And here's my ggplot command:
library(ggplot2)
p <- ggplot(df,aes(x=x,y=y,colour=interval))+geom_point(cex=2,shape=1,stroke=1)+labs(x="X", y="Y")+theme_bw()+theme(legend.key=element_blank(),panel.border=element_blank(),strip.background=element_blank())+scale_color_manual(drop=FALSE,values=colors,name="DE")
p <- p+geom_polygon(data=hulls.df,aes(x=x,y=y,group=cluster),color=hulls.df$col,fill=NA)
which produces:
My question is how do I add a legend for the polygon under the legend for the points? I want it to a legend with 3 lines colored according to the cluster colors and the corresponding cluster number beside each line?
Slightly different output, only changing the last line of your code, it may solve your purpose:
p+geom_polygon(data=hulls.df,aes(x=x,y=y,group=cluster, fill=cluster),alpha=0.1)
Say, you want to add a legend of the_factor. My basic idea is,
(1) put the_factor into mapping by using unused aes arguments; aes(xx = the_factor)
(2) if (1) affects something, delete the effect by using scale_xx_manual()
(3) modify the legend by using guides(xx = guide_legend(override.aes = list()))
In your case, aes(fill) and aes(alpha) are unused. The former is better to do it because of no effect. So I used aes(fill=as.factor(cluster)).
p <- ggplot(df,aes(x=x,y=y,colour=interval, fill=as.factor(cluster))) + # add aes(fill=...)
geom_point(cex=2, shape=1, stroke=1) +
labs(x="X", y="Y",fill="cluster") + # add fill="cluster"
theme_bw() + theme(legend.key=element_blank(),panel.border=element_blank(),strip.background=element_blank()) + scale_color_manual(drop=FALSE,values=colors,name="DE") +
guides(fill = guide_legend(override.aes = list(colour = cluster.cols, pch=0))) # add
p <- p+geom_polygon(data=hulls.df,aes(x=x,y=y,group=cluster), color=hulls.df$col,fill=NA)
Of course, you can make the same graph by using aes(alpha = the_factor)). Because it has influence, you need to control it by using scale_alpha_manual().
g <- ggplot(df, aes(x=x,y=y,colour=interval)) +
geom_point(cex=2, shape=1, stroke=1, aes(alpha=as.factor(cluster))) + # add aes(alpha)
labs(x="X", y="Y",alpha="cluster") + # add alpha="cluster"
theme_bw() + theme(legend.key=element_blank(),panel.border=element_blank(),strip.background=element_blank()) + scale_color_manual(drop=FALSE,values=colors,name="DE") +
scale_alpha_manual(values=c(1,1,1)) + # add
guides(alpha = guide_legend(override.aes = list(colour = cluster.cols, pch=0))) # add
g <- p+geom_polygon(data=hulls.df,aes(x=x,y=y,group=cluster), color=hulls.df$col,fill=NA)
What you are asking for is two colour scales. My understanding is that this is not possible. But you can give the impression of having two colour scales with a bit of a cheat and using the filled symbols (shapes 21 to 25).
p <- ggplot(df, aes(x = x, y = y, fill = interval)) +
geom_point(cex = 2, shape = 21, stroke = 1, colour = NA)+
labs(x = "X", y = "Y") +
theme_bw() +
theme(legend.key = element_blank(), panel.border = element_blank(), strip.background = element_blank()) +
scale_fill_manual(drop=FALSE, values=colors, name="DE") +
geom_polygon(data = hulls.df, aes(x = x, y = y, colour = cluster), fill = NA) +
scale_colour_manual(values = cluster.cols)
p
Alternatively, use a filled polygon with a low alpha
p <- ggplot(df,aes(x=x,y=y,colour=interval))+
geom_point(cex=2,shape=1,stroke=1)+
labs(x="X", y="Y")+
theme_bw() +
theme(legend.key = element_blank(),panel.border=element_blank(), strip.background=element_blank()) +
scale_color_manual(drop=FALSE,values=colors,name="DE", guide = guide_legend(override.aes = list(fill = NA))) +
geom_polygon(data=hulls.df,aes(x=x,y=y,group=cluster, fill = cluster), alpha = 0.2, show.legend = TRUE) +
scale_fill_manual(values = cluster.cols)
p
But this might make the point colours difficult to see.
Length of x-axis is important for my plot because it allows one to compare between facets, therefore I want facets to have different x-axis sizes. Here is my example data:
group1 <- seq(1, 10, 2)
group2 <- seq(1, 20, 3)
x = c(group1, group2)
mydf <- data.frame (X =x , Y = rnorm (length (x),5,1),
groups = c(rep(1, length (group1)), rep(2, length(group2))))
And my code:
p1 = ggplot(data=mydf,aes(x=X,y=Y,color=factor(groups)) )+
geom_point(size=2)+
scale_x_continuous(labels=comma)+
theme_bw()
p1+facet_grid(groups ~ .,scales = "fixed",space="free_x")
And the resulting figure:
Panel-1 has x-axis values less then 10 whereas panel-2 has x-axis value extending to 20. Still both panels and have same size on x-axis. Is there any way to make x-axis panel size different for different panels, so that they correspond to their (x-axis) values?
I found an example from some different package that shows what I am trying to do, here is the figure:
Maybe something like this can get you started. There's still some formatting to do, though.
library(grid)
library(gridExtra)
library(dplyr)
library(ggplot2)
p1 <- ggplot(data=mydf[mydf$groups==1,],aes(x=X,y=Y))+
geom_point(size=2)+
theme_bw()
p2 <- ggplot(data=mydf[mydf$groups==2,],aes(x=X,y=Y))+
geom_point(size=2)+
theme_bw()
summ <- mydf %>% group_by(groups) %>% summarize(len=diff(range(X)))
summ$p <- summ$len/max(summ$len)
summ$q <- 1-summ$p
ng <- nullGrob()
grid.arrange(arrangeGrob(p1,ng,widths=summ[1,3:4]),
arrangeGrob(p2,ng,widths=summ[2,3:4]))
I'm sure there's a way to make this more general, and the axes don't line up perfectly yet, but it's a beginning.
Here is a solution following OP's clarifying comment ("I guess axis will be same but the boxes will be of variable size. Is it possible by plotting them separately and aligning in grid?").
library(plyr); library(ggplot2)
buffer <- 0.5 # Extra space around the box
#Calculate box parameters
mydf.box <- ddply(mydf, .(groups), summarise,
max.X = max(X) + buffer,
min.X = 0,
max.Y = max(Y) + buffer,
min.Y = 0,
X = mean(X), Y = mean(Y)) #Dummy values for X and Y needed for geom_rect
p2 <- ggplot(data=mydf,aes(x=X, y=Y) )+
geom_rect(data = mydf.box, aes( xmax = max.X, xmin = min.X,
ymax = max.Y, ymin = min.Y),
fill = "white", colour = "black", fill = NA) +
geom_point(size=2) + facet_grid(groups ~ .,scales = "free_y") +
theme_classic() +
#Extra formatting to make your plot like the example
theme(panel.background = element_rect(fill = "grey85"),
strip.text.y = element_text(angle = 0),
strip.background = element_rect(colour = NA, fill = "grey65"))
Didzis Elferts showed how to plot a dendogram using ggplot2 and ggdendro:
horizontal dendrogram in R with labels
here is the code:
labs = paste("sta_",1:50,sep="") #new labels
rownames(USArrests)<-labs #set new row names
hc <- hclust(dist(USArrests), "ave")
library(ggplot2)
library(ggdendro)
#convert cluster object to use with ggplot
dendr <- dendro_data(hc, type="rectangle")
#your own labels are supplied in geom_text() and label=label
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=3) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())
Does anyone know, how to colorize the different clusters? For example, you want to have 2 Clusters (k=2) colorized?
This approach is very similar to #DidzisElferts', just a little simpler.
df <- USArrests # really bad idea to muck up internal datasets
labs <- paste("sta_",1:50,sep="") # new labels
rownames(df) <- labs # set new row names
library(ggplot2)
library(ggdendro)
hc <- hclust(dist(df), "ave") # heirarchal clustering
dendr <- dendro_data(hc, type="rectangle") # convert for ggplot
clust <- cutree(hc,k=2) # find 2 clusters
clust.df <- data.frame(label=names(clust), cluster=factor(clust))
# dendr[["labels"]] has the labels, merge with clust.df based on label column
dendr[["labels"]] <- merge(dendr[["labels"]],clust.df, by="label")
# plot the dendrogram; note use of color=cluster in geom_text(...)
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x, y, label=label, hjust=0, color=cluster),
size=3) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())
Workaround would be to plot cluster object with plot() and then use function rect.hclust() to draw borders around the clusters (nunber of clusters is set with argument k=). If result of rect.hclust() is saved as object it will make list of observation where each list element contains observations belonging to each cluster.
plot(hc)
gg<-rect.hclust(hc,k=2)
Now this list can be converted to dataframe where column clust contains names for clusters (in this example two groups) - names are repeated according to lengths of list elemets.
clust.gr<-data.frame(num=unlist(gg),
clust=rep(c("Clust1","Clust2"),times=sapply(gg,length)))
head(clust.gr)
num clust
sta_1 1 Clust1
sta_2 2 Clust1
sta_3 3 Clust1
sta_5 5 Clust1
sta_8 8 Clust1
sta_9 9 Clust1
New data frame is merged with label() information of dendr object (dendro_data() result).
text.df<-merge(label(dendr),clust.gr,by.x="label",by.y="row.names")
head(text.df)
label x y num clust
1 sta_1 8 0 1 Clust1
2 sta_10 28 0 10 Clust2
3 sta_11 41 0 11 Clust2
4 sta_12 31 0 12 Clust2
5 sta_13 10 0 13 Clust1
6 sta_14 37 0 14 Clust2
When plotting dendrogram use text.df to add labels with geom_text() and use column clust for colors.
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=text.df, aes(x=x, y=y, label=label, hjust=0,color=clust), size=3) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())
Adding to #DidzisElferts' and #jlhoward's code, the dendrogram itself can be coloured.
library(ggplot2)
library(ggdendro)
library(plyr)
library(zoo)
df <- USArrests # really bad idea to muck up internal datasets
labs <- paste("sta_", 1:50, sep = "") # new labels
rownames(df) <- labs # set new row names
cut <- 4 # Number of clusters
hc <- hclust(dist(df), "ave") # hierarchical clustering
dendr <- dendro_data(hc, type = "rectangle")
clust <- cutree(hc, k = cut) # find 'cut' clusters
clust.df <- data.frame(label = names(clust), cluster = clust)
# Split dendrogram into upper grey section and lower coloured section
height <- unique(dendr$segments$y)[order(unique(dendr$segments$y), decreasing = TRUE)]
cut.height <- mean(c(height[cut], height[cut-1]))
dendr$segments$line <- ifelse(dendr$segments$y == dendr$segments$yend &
dendr$segments$y > cut.height, 1, 2)
dendr$segments$line <- ifelse(dendr$segments$yend > cut.height, 1, dendr$segments$line)
# Number the clusters
dendr$segments$cluster <- c(-1, diff(dendr$segments$line))
change <- which(dendr$segments$cluster == 1)
for (i in 1:cut) dendr$segments$cluster[change[i]] = i + 1
dendr$segments$cluster <- ifelse(dendr$segments$line == 1, 1,
ifelse(dendr$segments$cluster == 0, NA, dendr$segments$cluster))
dendr$segments$cluster <- na.locf(dendr$segments$cluster)
# Consistent numbering between segment$cluster and label$cluster
clust.df$label <- factor(clust.df$label, levels = levels(dendr$labels$label))
clust.df <- arrange(clust.df, label)
clust.df$cluster <- factor((clust.df$cluster), levels = unique(clust.df$cluster), labels = (1:cut) + 1)
dendr[["labels"]] <- merge(dendr[["labels"]], clust.df, by = "label")
# Positions for cluster labels
n.rle <- rle(dendr$segments$cluster)
N <- cumsum(n.rle$lengths)
N <- N[seq(1, length(N), 2)] + 1
N.df <- dendr$segments[N, ]
N.df$cluster <- N.df$cluster - 1
# Plot the dendrogram
ggplot() +
geom_segment(data = segment(dendr),
aes(x=x, y=y, xend=xend, yend=yend, size=factor(line), colour=factor(cluster)),
lineend = "square", show.legend = FALSE) +
scale_colour_manual(values = c("grey60", rainbow(cut))) +
scale_size_manual(values = c(.1, 1)) +
geom_text(data = N.df, aes(x = x, y = y, label = factor(cluster), colour = factor(cluster + 1)),
hjust = 1.5, show.legend = FALSE) +
geom_text(data = label(dendr), aes(x, y, label = label, colour = factor(cluster)),
hjust = -0.2, size = 3, show.legend = FALSE) +
scale_y_reverse(expand = c(0.2, 0)) +
labs(x = NULL, y = NULL) +
coord_flip() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
panel.background = element_rect(fill = "white"),
panel.grid = element_blank())
The 2-cluster and 4-cluster solutions:
A short way to achieve a similar result is to use the package dendextend, taken from this overview.
df <- USArrests # really bad idea to muck up internal datasets
labs <- paste("sta_",1:50,sep="") # new labels
rownames(df) <- labs # set new row names
require(magrittr)
require(ggplot2)
require(dendextend)
dend <- df %>% dist %>%
hclust %>% as.dendrogram %>%
set("branches_k_color", k = 4) %>% set("branches_lwd", 0.7) %>%
set("labels_cex", 0.6) %>% set("labels_colors", k = 4) %>%
set("leaves_pch", 19) %>% set("leaves_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)
Note: The order of the states is slightly different compared to those above - not really changing interpretation though.
For those that are still looking for a convenient way to do this, you can use my package ggdendroplot (https://github.com/NicolasH2/ggdendroplot).
If you have the data from the posted example:
labs = paste("sta_",1:50,sep="") #new labels
rownames(USArrests)<-labs #set new row names
hc <- hclust(dist(USArrests), "ave")
...you can use ggdendroplot with ggplot to get a colored dendrogram:
devtools::install_github("NicolasH2/ggdendroplot")
library(ggdendroplot)
library(ggplot2)
ggplot() + geom_dendro(hc, dendrocut = 30)
You can turn it sideways and on its head etc. Its basically just a ggplot layer, so you can modify the graph further as you wish and add it to other ggplots. Check out the github page to see what you can do with ggdendroplot.