Clustering plot. Add the cluster number: Function fviz_cluster (factoextra) - r

I have the following R code:
library(factoextra)
kms<-kmeans(df,18,nstart=100)
fviz_cluster(kms, data = df, alpha=0.2,shape=19,geom = "point")
It outputs the following picture:
Is possible to add the number of the clusters inside the respective cluster in the picture?. Or show the cluster numbers instead fo the point centers.
UPDATE.
Something similar to I want to achieve I found in : https://www.r-bloggers.com/2016/11/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning/
I tried it, but I got error. So I have to install the version of factoextra that is used and change the code. So I got this:
fviz_cluster(kms, data = df,frame.level = 0.68)
How can i remove the numbers except the numbers on the center of the cluster?

There doesn't seem to be a simple solution; here is a potential workaround:
library(tidyverse)
library(factoextra)
data("iris")
# Select a single point for each category (i.e. setosa = the 25th value)
# label the selected value, then label the rest of the points with nothing ("")
iris$label <- c(rep("", 24), "setosa", rep("", 25),
rep("", 23), "versicolor", rep("", 26),
rep("", 24), "virginica", rep("", 25))
# Remove species column (5) and label column and scale the data
iris.scaled <- scale(iris[, -c(5,6)])
# K-means clustering
km.res <- kmeans(iris.scaled, 3, nstart = 10)
# Visualize clusters
fviz_cluster(km.res, iris[, -c(5,6)], alpha = 0.2, shape = 19, geom = c("point")) +
# Label the points (only the 3 with actual labels show up on the plot)
geom_text(aes(label = iris$label))

Related

Adjust plot margins to show figure legend

How do I adjust my plot size to make the heatmap legend visible?
I tried par(oma=c(0,0,1,0)+1, mar=c(0,0,0,0)+1) but it completely truncated my plot.
# Correlation Matrix
dat.cor <- cor(samp.matrix, method="pearson", use="pairwise.complete.obs")
cx <- redgreen(50)
# Correlation plot - heatmap
png("Heatmap_cor.matrix.png")
#par(oma=c(0,0,1,0), mar=c(0,0,0,0))
leg <- seq(min(dat.cor, na.rm=T), max(dat.cor, na.rm=T), length=10)
image(dat.cor, main="Correlation between Glioma vs Non-Tumor\n Gene Expression", col=cx, axes=F)
axis(1,at=seq(0,1,length=ncol(dat.cor)),label=dimnames(dat.cor)[[2]], cex.axis=0.9,las=2)
axis(2,at=seq(0,1,length=ncol(dat.cor)),label=dimnames(dat.cor)[[2]], cex.axis=0.9,las=2)
dev.off()
It would be a lot easier to help you with your problem if you included a minimal reproducible example. Please see https://stackoverflow.com/help/how-to-ask to get tips on improving your questions and improve your chances of getting an answer.
In order to replicate your issue, I downloaded a subset of the GEO dataset and used the mean affy intensities to create an approximation of your heatmap:
# Load libraries
library(tidyverse)
#BiocManager::install("affyio")
library(affyio)
# GSE data downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4290
list_of_files <- fs::dir_ls("~/Desktop/GSE4290_RAW/")
# Load the CEL files
CEL_list <- list()
for (f in seq_along(list_of_files)) {
CEL_list[[f]] <- read.celfile(list_of_files[[f]],
intensity.means.only = TRUE)
}
# Rename each element of the list with the corresponding sample name
names(CEL_list) <- gsub(x = basename(list_of_files),
pattern = ".CEL.gz",
replacement = "")
# Create a matrix of the mean intensities for all genes
samp.matrix <- map(CEL_list, pluck, "INTENSITY", "MEAN") %>%
bind_cols() %>%
as.matrix()
# Calculate correlations between samples
dat.cor <- cor(samp.matrix, method = "pearson",
use = "pairwise.complete.obs")
# Specify a colour palette (green/red is NOT colourblind friendly)
cx <- colorRampPalette(viridis::inferno(50))(50)
# Plot the heatmap
png("Heatmap_cor.matrix.png")
par(oma=c(0,0,1,0), mar=c(6,6,4,7), par(xpd = TRUE))
leg <- seq(from = 0.1, to = 1, length.out = 10)
image(dat.cor, main="Correlation between Glioma vs Non-Tumor\n Gene Expression", col=cx, axes=F)
axis(1,at=seq(0,1,length=ncol(dat.cor)),label=dimnames(dat.cor)[[2]], cex.axis=0.9,las=2)
axis(2,at=seq(0,1,length=ncol(dat.cor)),label=dimnames(dat.cor)[[2]], cex.axis=0.9,las=2)
legend(1.1, 1.1, title = "Correlation", legend = leg,
fill = colorRampPalette(viridis::inferno(50))(10))
dev.off()
Does this solve your problem?
Also, one of the great things about R is that people create packages to make these types of tasks easier; one example is the pheatmap package which makes clustering samples and annotating sample groups a lot more straightforward and I've found that the final image can be 'nicer' than creating the plot from scratch. E.g.
library(pheatmap)
pheatmap(mat = dat.cor, color = cx, border_color = "white", legend = TRUE,
main = "Correlation between Glioma vs Non-Tumor\n Gene Expression")

R multiple plots of time series xts with only 1 legend

I want to produce multiple graphs of a time series xts object in different windows. The issue is that I cannot add only one legend (for the last plot). My code is the following:
dev.new(width=3,height=9)
par(mfrow=c(3,1))
plot(csum_GVMP[,c(-2,-3)],main=" ",minor.ticks="years",cex.axis = 1,major.ticks="years",grid.ticks.on=FALSE,grid.ticks.lty=0,col=color)
addLegend("bottomleft",legend.names = c("","","","","","",""))
plot(csum_ERC[,c(-2,-3)],main=" ",minor.ticks="years",cex.axis = 1,major.ticks="years",grid.ticks.on=FALSE,grid.ticks.lty=0,col=color)
addLegend("bottomleft",legend.names = c("","","","","","",""))
plot(csum_MD[,c(-2,-3)],main=" ",minor.ticks="years",cex.axis = 1,major.ticks="years",grid.ticks.on=FALSE,grid.ticks.lty=0,col=color)
As you see I added blank values for the legend names for the 1st and 2nd plot, but the results is that the graphs are of the same plot are being repeated two times like these: showing only the plot for the csum_GVMP
here
Otherwise if I leave the addLegend out the plot looks like this here,
which is what I want but now I would like to add only one legend. If I leave out the command addLegend for 1st and 2nd plot, the figures are not even plotted.
Does it anybody know how to handle this? Thank you in advance.
here you go. If you uncomment the addLegend it will duplicate the graphs, as I mentioned in the post.
I hope this helps
set.seed(10)
library(MASS)
library(xts)
date=seq(as.Date("2000/1/1"), as.Date("2000/1/10"), "days")
matrixA=as.numeric(mvrnorm(n = 30, 0.5, 0.2, tol = 1e-6, empirical = TRUE, EISPACK = FALSE))
matrixA=matrix(matrixA,10,3)
martixA.ts=as.xts(matrixA,date)
matrixB=as.numeric(mvrnorm(n = 30, 0.5, 0.2, tol = 1e-6, empirical = TRUE, EISPACK = FALSE))
matrixB=matrix(matrixB,10,3)
martixB.ts=as.xts(matrixB,date)
par(mfrow=c(2,1))
plot(as.xts(matrixA,date),main="A")
#addLegend("bottomleft",legend.names = c("A","B"))
plot(as.xts(matrixB,date),main="B")
#addLegend("bottomleft",legend.names = c("",""))
You should be able to see this
I'm not particularly happy with this solution, but it solves the immediate problem.
The strategy is to "build" the plot to completion before plotting/printing it. See below.
set.seed(10)
library(MASS)
library(xts)
date <- seq(as.Date("2000-01-01"), as.Date("2000-01-10"), "days")
matrixA <- matrix(mvrnorm(n = 30, 0.5, 0.2, empirical = TRUE), 10, 3)
matrixA.ts <- xts(matrixA, date)
matrixB <- matrix(mvrnorm(n = 30, 0.5, 0.2, empirical = TRUE), 10, 3)
matrixB.ts <- xts(matrixB, date)
# Create the first plot, but do not draw it
# Assign the result to 'p1'
p1 <- plot(matrixA.ts, main = "A")
p1 <- addLegend("bottomleft", legend.names = c("A","B"))
# Create the second plot without drawing it
# Assign the result to 'p2'
p2 <- plot(matrixB.ts, main = "B")
p2 <- addLegend("bottomleft", legend.names = c("",""))
# Set up the device layout, and draw both plots
par(mfrow=c(2,1))
p1
p2

Draw vertical quantile lines over histogram

I currently generate the following plot using ggplot in R:
The data is stored in a single dataframe with three columns: PDF (y-axis in the plot above), mids(x) and dataset name. This is created from histograms.
What I want to do is to plot a color-coded vertical line for each dataset representing the 95th quantile, like I manually painted below as an example:
I tried to use + geom_line(stat="vline", xintercept="mean") but of course I'm looking for the quantiles, not for the mean, and AFAIK ggplot does not allow that. Colors are fine.
I also tried + stat_quantile(quantiles = 0.95) but I'm not sure what it does exactly. Documentation is very scarce. Colors, again, are fine.
Please note that density values are very low, down to 1e-8. I don't know if the quantile() function likes that.
I understand that calculating the quantile of an histogram is not quite the same as calculating that of a list of numbers. I don't know how it would help, but the HistogramToolspackage contains an ApproxQuantile() function for histogram quantiles.
Minimum working example is included below. As you can see I obtain a data frame from each histogram, then bind the dataframes together and plot that.
library(ggplot2)
v <- c(1:30, 2:50, 1:20, 1:5, 1:100, 1, 2, 1, 1:5, 0, 0, 0, 5, 1, 3, 7, 24, 77)
h <- hist(v, breaks=c(0:100))
df1 <- data.frame(h$mids,h$density,rep("dataset1", 100))
colnames(df1) <- c('Bin','Pdf','Dataset')
df2 <- data.frame(h$mids*2,h$density*2,rep("dataset2", 100))
colnames(df2) <- c('Bin','Pdf','Dataset')
df_tot <- rbind(df1, df2)
ggplot(data=df_tot[which(df_tot$Pdf>0),], aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
geom_point(aes(color=Dataset), alpha = 0.7, size=1.5)
Precomputing these values and plotting them separately seems like the simplest option. Doing so with dplyr requires minimal effort:
library(dplyr)
q.95 <- df_tot %>%
group_by(Dataset) %>%
summarise(Bin_q.95 = quantile(Bin, 0.95))
ggplot(data=df_tot[which(df_tot$Pdf>0),],
aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
geom_point(aes(color=Dataset), alpha = 0.7, size=1.5) +
geom_vline(data = q.95, aes(xintercept = Bin_q.95, colour = Dataset))

R - draw new layer behind current plot

Just curious, when plotting in R, one can easily change the order of the executive code to change the order of those "layer" on the plot, e.g.
plot(x, type = "n")
lines(y)
points(x)
to get x over the y. Are there any way to do it in an adhoc way, e.g.
plot(x)
lines(y, behind = TRUE) # fictional option behind
While there isn't explicitly a behind option or layers in plot, an easy way to overlay two plots might be using the add = TRUE option in plot. Here is an example with artificial data:
# Load sp package for creating artificial data
library(sp)
# Create sample town points
towns <- data.frame(lon = sample(100), lat = sample(100))
towns <- SpatialPoints(towns)
# Create sample polygon grid
grd <- GridTopology(c(1,1), c(10,10), c(10,10))
polys <- as.SpatialPolygons.GridTopology(grd)
# Plot polygons
plot(polys)
# Add towns (in red colour)
plot(towns, add = TRUE, col = 'red')
As another example, you can plot lines on different layers in ggplot and melt like this:
a <- c(3, 6, 16, 17, 11, 21)
b <- c(0.3, 2.3, 9, 9, 5 ,12)
c <- c(3, 7, 9, 7, 6, 10)
dat <- data.frame(a=a,b=b,c=c)
dat <- melt(dat)
Add an explicit 'x' variable to our data frame:
dat$x <- rep(1:6,times=3)
Then just plot the graph:
ggplot(dat,aes(x=x,y=value)) +
geom_line(aes(colour=variable)) +
scale_colour_manual(values=colours) +
labs(x="time[h]",y="a",colour="") +
opts(title="bla")
Finally, there is explicit support for layers in other packages, such as in PBSmapping for maps.

R: How do I display clustered matrix heatmap (similar color patterns are grouped)

I searched a lot of questions about heatmap throughout the site and packages, but I still have a problem.
I have clustered data (kmeans/EM/DBscan..), and I want to create a heatmap by grouping the same cluster. I want the similar color patterns to be grouped in the heatmap, so generally, it looks like a block-diagonal.
I tried to order the data by the cluster number and display it,
k = kmeans(data, 3)
d = data.frame(data)
d = data.frame(d, k$cluster)
d = d[order(d$k.cluster),]
heatmap(as.matrix(d))
but it is still not sorted and looks like this link: But, I want it to be sorted by its cluster number and looked like this:
Can I do this in R?
I searched lots of packages and tried many ways, but I still have a problem.
Thanks a lot.
You can do this using reshape2 and ggplot2 as follows:
library(reshape2)
library(ggplot2)
# Create dummy data
set.seed(123)
df <- data.frame(
a = sample(1:5, 1000, replace=TRUE),
b = sample(1:5, 1000, replace=TRUE),
c = sample(1:5, 1000, replace=TRUE)
)
# Perform clustering
k <- kmeans(df, 3)
# Append id and cluster
dfc <- cbind(df, id=seq(nrow(df)), cluster=k$cluster)
# Add idsort, the id number ordered by cluster
dfc$idsort <- dfc$id[order(dfc$cluster)]
dfc$idsort <- order(dfc$idsort)
# use reshape2::melt to create data.frame in long format
dfm <- melt(dfc, id.vars=c("id", "idsort"))
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value))
You should set Rowv and Colv to NA if you don't want the dendrograms and the subseuent ordering. BTW, You should also put of the scaling. Using the df of Andrie :
heatmap(as.matrix(df)[order(k$cluster),],Rowv=NA,Colv=NA,scale="none",labRow=NA)
In fact, this whole heatmap is based on image(). You can hack away using image to construct a plot exactly like you want. Heatmap is using layout() internally, so it will be diffucult to set the margins. With image you could do eg :
myHeatmap <- function(x,ord,xlab="",ylab="",main="My Heatmap",
col=heat.colors(5), ...){
op <- par(mar=c(3,0,2,0)+0.1)
on.exit(par(op))
nc <- NCOL(x)
nr <- NROW(x)
labCol <- names(x)
x <- t(x[ord,])
image(1L:nc, 1L:nr, x, xlim = 0.5 + c(0, nc), ylim = 0.5 +
c(0, nr), axes = FALSE, xlab=xlab, ylab=ylab, main=main,
col=col,...)
axis(1, 1L:nc, labels = labCol, las = 2, line = -0.5, tick = 0)
axis(2, 1L:nr, labels = NA, las = 2, line = -0.5, tick = 0)
}
library(RColorBrewer)
myHeatmap(df,order(k$cluster),col=brewer.pal(5,"BuGn"))
To produce a plot that has less margins on the side. You can also manipulate axes, colors, ... You should definitely take a look at the RColorBrewerpackage
(This custom function is based on the internal plotting used by heatmap btw, simplified for the illustration and to get rid of all the dendrogram stuff)

Resources