I visualized my CLARA results using fviz_cluster (ggplot2) and I would like to have the medoids of each cluster more prominent (like changing their shape or color, etc) than other data points. The issue is, I have more than 800,000 data points and it is impossible to see it just through the "show.clust.cent".
How can I color the medoids with different colors and make them so much bigger than other data points, or make other data points invisible except the medoids? I also tried to use the star.plot but somehow it didn't work.
I know the line number of the medoids and thought to add it manually, but I also don't know how to integrate it to the fviz_cluster.
Can anyone help me with this? Thank you!
fviz_cluster(clara.res,
palette = c("#004c6d",
"#00ffff",
"#00a1c1",
"#6efa75",
"#78ab63",
"#cc0089",
"#ffc334",
"#ff9509",
"#ffb6de",
"#00cfe3"
), # color palette
ellipse.type = "t",geom = "point",show.clust.cent = TRUE,repel = TRUE,pointsize = 0.5,
ggtheme = theme_classic()
Will this be ok for you?
library(tidyverse)
fpoint = function(n) tibble(
Dlm1 = rnorm(n, sample(-20:20,1), sample(1:5,1)),
Dlm2 = rnorm(n, sample(-20:20,1), sample(1:5,1))
)
df = tibble(cluster = paste(1:10)) %>%
mutate(data = map(cluster, ~fpoint(1000))) %>%
unnest(data)
df %>% ggplot(aes(Dlm1, Dlm2, color=cluster))+
geom_point(alpha = 0.2, pch=21)+
stat_ellipse(size=0.7)
Write data to tibble and use standard ggplot.
Update 1
library(factoextra)
library(cluster)
df = USArrests %>% na.omit() %>% scale()
kmed = pam(df, k = 4)
fviz_cluster(kmed, data = df, alpha=0.3, geom = "point", show.clust.cent = TRUE,repel = TRUE, pointsize = 2)
Is that the point?
data('iris')
pca.irix <- PCA(iris[ ,1:4])
gg <- factoextra::fviz_pca_biplot(X = pca.irix,
# samples
fill.ind = iris$Species, col.ind = 'black',
pointshape = 21, pointsize = 1.5,
geom.ind = 'point', repel = T,
geom.var = FALSE )
I would like to obtain a plot that is exactly like the plot above but without the specie setosa.
I started doing this, but do not know how to continue
setosa_wo <- iris %>%
filter(Species != 'setosa')
gg + scale_x_continuous(limits = c((-2), 2)) + scale_y_continuous(limits = c((-2), 2))
How to remove a colored group from a plot? But the plot should stay the same.
One approach to remove one or any number of groups from the plot would be to filter the data used for the layers, e.g. having a look at gg$layers show that your PCA plot is composed of six layers, however only in the first two of the layers are the groups used as fill color. Therefore I simply filtered the data for these two layers which gives me a plot where setosa is removed.
EDIT Following the suggestion by #DaveArmstrong I added his code to fix the ranges of the axes on the original ranges and addtionally added the original colors
library(FactoMineR)
library(ggplot2)
pca.irix <- PCA(iris[ ,1:4])
gg <- factoextra::fviz_pca_biplot(X = pca.irix,
# samples
fill.ind = iris$Species, col.ind = 'black',
pointshape = 21, pointsize = 1.5,
geom.ind = 'point', repel = T,
geom.var = FALSE )
# First: Get the ranges
yrg <- ggplot2::layer_scales(gg)$y$range$range
xrg <- ggplot2::layer_scales(gg)$x$range$range
# Filter the data
gg$layers[[1]]$data <- dplyr::filter(gg$layers[[1]]$data, Fill. != "setosa")
gg$layers[[2]]$data <- dplyr::filter(gg$layers[[2]]$data, Fill. != "setosa")
gg +
# Set the limits to the original ones
ggplot2::coord_cartesian(xlim=xrg, ylim=yrg, expand=FALSE) +
# Add orignial colors
ggplot2::scale_fill_manual(values = scales::hue_pal()(3)[2:3])
Created on 2020-10-16 by the reprex package (v0.3.0)
I am using a gene expression dataset from ~100 cells.
I want to generate a dot plot indicating which cells are expressing which genes, like below, excluding the color delineations.
I have tried ggplot solutions, but (from what I can tell) Ggplot2 cannot graph numerous variables in each axis. I've looked into more complex packages like Seurot and cRegulome (the image above is from cRegulome), but these produce more information the graphical output than I want.
Below is an example of the type of data frame I am working with.
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
How can I manipulate this dataset to get the graphical output I want?
You can do this by reshaping the data and using geom_point. Map the size aesthetic to your count variable and it will work well. The legend is currently a bit nonsensical but can be manually tweaked if you do not have any other sizes than 0 and 1.
library(tidyverse)
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
fake_data %>%
rownames_to_column(var = "gene") %>%
gather(cell, count, -gene) %>%
ggplot() +
geom_point(aes(x = gene, y = cell, size = count))
Created on 2019-08-02 by the reprex package (v0.3.0)
This solution is a base R solution that relies on matplot().
fake_data2 <- sweep(fake_data, 2, seq_len(length(fake_data)), FUN = '*')
fake_data2[fake_data2 == 0] <- NA_integer_
matplot(x = seq_along(Genes), y = as.matrix(fake_data2),
, cex = colSums(fake_data) / 3, pch = 16, col = 1
, yaxt='n', xaxt='n', ann=FALSE)
axis(1, at = seq_along(Genes), Genes)
axis(2, at = seq_len(length(fake_data)), names(fake_data), las = 1)
You didn't provide enough details on how what size you wanted. The size here is based on the number of 1 values for each column.
Here is a toy example I have got stuck on
library(plotly)
library(dplyr)
# construct data.frame
df <- tibble(x=c(3,2,3,5,5,5,2),y=c("a","a","a","b","b","b","b"))
# construct data.frame of last y values
latest <- df %>%
group_by(y) %>%
slice(n())
# plot for one value of y (NB not sure why value for 3 appears?)
p <- plot_ly() %>%
add_histogram(data=subset(df,y=="b"),x= ~x) %>%
add_histogram(data=subset(latest,y=="b"),x= ~x,marker=list(color="red")) %>%
layout(barmode="overlay",showlegend=FALSE,title= ~y)
p
How can i set these up as subplots, one for each unique value of y? In the real world example, I would have 20 different y's so would ideally loop or apply the code. In addition, it would be good to set standard x scales of say c(1:10) and have, for example, 2 rows
TIA
build a list containing each of the plots
set the bin sizes manually for the histograms, otherwise the automatic selection will choose different bins for each of the traces within a plot (making it look strange as in you example where the bars of each trace are different widths)
use subplot to put it all together
add titles to individual subplots using a list of annotations, as explained here
Like this:
N = nlevels(factor(df$y))
plot_list = vector("list", N)
lab_list = vector("list", N)
for (i in 1:N) {
this_y = levels(factor(df$y))[i]
p <- plot_ly() %>%
add_trace(type="histogram", data=subset(df,y==this_y), x=x, marker=list(color="blue"),
autobinx=F, xbins=list(start=0.5, end=6.5, size=1)) %>%
add_trace(type="histogram", data=subset(latest,y==this_y), x = x, marker=list(color="red"),
autobinx=F, xbins=list(start=0.5, end=6.5, size=1)) %>%
layout(barmode="overlay", showlegend=FALSE)
plot_list[[i]] = p
titlex = 0.5
titley = c(1.05, 0.45)[i]
lab_list[[i]] = list(x=titlex, y=titley, text=this_y,
showarrow=F, xref='paper', yref='paper', font=list(size=18))
}
subplot(plot_list, nrows = 2) %>%
layout(annotations = lab_list)
Using this code in R,
library("dendextend")
library("dendextendRcpp")
dist2 <- read.csv("distanceMatrix.csv",sep=";",header=TRUE)
mat <- as.matrix(dist2)
# using piping to get the dend
dend <- dist2 %>% dist %>% hclust %>% as.dendrogram %>% set("labels", colnames(mat))
foo <- function(k){
svg(filename = "dendrogram_newest.svg",width = 25,height = 14)
# plot + color the dend's branches before, based on k clusters:
dend %>% color_branches(k) %>% plot()
# add horiz line:
abline(h = heights_per_k.dendrogram(dend)[k], lwd = 2, lty = 2, col = "purple")
dev.off()}
foo(6)
I get this:
So, how to shorten these lines. This way is almost unreadable.
And yes, my labels are ordered just like in first row of my distanceMatrix.csv. This order has nothing to do with relations inside of distanceMatrix. I mean, dendrogram is ok but values of labels aren't the right one.
Thanks