Visualizing the CLARA cluster center/medoid - r

I visualized my CLARA results using fviz_cluster (ggplot2) and I would like to have the medoids of each cluster more prominent (like changing their shape or color, etc) than other data points. The issue is, I have more than 800,000 data points and it is impossible to see it just through the "show.clust.cent".
How can I color the medoids with different colors and make them so much bigger than other data points, or make other data points invisible except the medoids? I also tried to use the star.plot but somehow it didn't work.
I know the line number of the medoids and thought to add it manually, but I also don't know how to integrate it to the fviz_cluster.
Can anyone help me with this? Thank you!
fviz_cluster(clara.res,
palette = c("#004c6d",
"#00ffff",
"#00a1c1",
"#6efa75",
"#78ab63",
"#cc0089",
"#ffc334",
"#ff9509",
"#ffb6de",
"#00cfe3"
), # color palette
ellipse.type = "t",geom = "point",show.clust.cent = TRUE,repel = TRUE,pointsize = 0.5,
ggtheme = theme_classic()

Will this be ok for you?
library(tidyverse)
fpoint = function(n) tibble(
Dlm1 = rnorm(n, sample(-20:20,1), sample(1:5,1)),
Dlm2 = rnorm(n, sample(-20:20,1), sample(1:5,1))
)
df = tibble(cluster = paste(1:10)) %>%
mutate(data = map(cluster, ~fpoint(1000))) %>%
unnest(data)
df %>% ggplot(aes(Dlm1, Dlm2, color=cluster))+
geom_point(alpha = 0.2, pch=21)+
stat_ellipse(size=0.7)
Write data to tibble and use standard ggplot.
Update 1
library(factoextra)
library(cluster)
df = USArrests %>% na.omit() %>% scale()
kmed = pam(df, k = 4)
fviz_cluster(kmed, data = df, alpha=0.3, geom = "point", show.clust.cent = TRUE,repel = TRUE, pointsize = 2)
Is that the point?

Related

R ggplot: overlay two conditional density plots (same binary outcome variable) - possible?

I know how to plot several density curves/polygrams on one plot, but not conditional density plots.
Reproducible example:
require(ggplot2)
# generate data
a <- runif(200, min=0, max = 1000)
b <- runif(200, min=0, max = 1000)
c <- sample(c("A", "B"), 200, replace =T)
df <- data.frame(a,b,c)
# plot 1
ggplot(df, aes(a, fill = c)) +
geom_density(position='fill', alpha = 0.5)
# plot 2
ggplot(df, aes(b, fill = c)) +
geom_density(position='fill', alpha = 0.5)
In my real data I have a bunch of these paired conditional density plots and I would need to overlay one over the other to see (and show) how different (or similar) they are. Does anyone know how to do this?
One way would be to plot the two versions as layers. The overlapping areas will be slightly different, depending on the layer order, based on how alpha works in ggplot2. This may or may not be what you want. You might fiddle with the two alphas, or vary the border colors, to distinguish them more.
ggplot(df, aes(fill = c)) +
geom_density(aes(a), position='fill', alpha = 0.5) +
geom_density(aes(b), position='fill', alpha = 0.5)
For example, you might make it so the fill only applies to one layer, but the other layer distinguishes groups using the group aesthetic, and perhaps a different linetype. This one seems more readable to me, especially if there is a natural ordering to the two variables that justifies putting one in the "foreground" and one in the "background."
ggplot(df) +
geom_density(aes(a, group = c), position='fill', alpha = 0.2, linetype = "dashed") +
geom_density(aes(b, fill = c), position='fill', alpha = 0.5)
I'm not so sure if "on top of one another" is a great idea. Jon's ideas are probably the way to go. But what about just plotting side-by side - our brains can cope with that and we can compare this pretty well.
Make it long, then use facet.
Another option might be an animated graph (see 2nd code chunk below).
require(ggplot2)
#> Loading required package: ggplot2
library(tidyverse)
a <- runif(200, min=0, max = 1000)
b <- runif(200, min=0, max = 1000)
#### BAAAAAD idea to call anything "c" in R!!! Don't do this. ever!
d <- sample(c("A", "B"), 200, replace =T)
df <- data.frame(a,b,d)
df %>% pivot_longer(cols = c(a,b)) %>%
ggplot(aes(value, fill = d)) +
geom_density(position='fill', alpha = 0.5) +
facet_grid(~name)
library(gganimate)
p <- df %>% pivot_longer(cols = c(a,b)) %>%
ggplot(aes(value, fill = d)) +
geom_density(position='fill', alpha = 0.5) +
labs(title = "{closest_state}")
p_anim <- p + transition_states(name)
animate(p_anim, duration = 2, fps = 5)
Created on 2022-06-14 by the reprex package (v2.0.1)
Although it is not the overlay you might have thought of, it facilitates the comparison of density curves:
library(tidyverse)
library(ggridges)
library(truncnorm)
DF <- tibble(
alpha = rtruncnorm(n = 200, a = 0, b = 1000, mean = 500, sd = 50),
beta = rtruncnorm(n = 200, a = 0, b = 1000, mean = 550, sd = 50)
)
DF <- DF %>%
pivot_longer(c(alpha, beta), names_to = "name", values_to = "meas") %>%
mutate(name = factor(name))
DF %>%
ggplot(aes(meas, name, fill = factor(stat(quantile)))) +
stat_density_ridges(
geom = "density_ridges_gradient",
calc_ecdf = T,
quantiles = 4,
quantile_lines = T
) +
scale_fill_viridis_d(name = "Quartiles")

Force plotly violin plot not to display a violin on zero values

I have measurements from several groups which I would like to plot as violin plots:
set.seed(1)
df <- data.frame(val = c(runif(100,1,5),runif(100,1,5),rep(0,100)),
group = c(rep("A",100),rep("B",100),rep("C",100)))
Using R's ggplot2:
library(ggplot2)
ggplot(data = df, aes(x = group, y = val, color = group)) + geom_violin()
I get:
But when I try to get the equivalent with R's plotly using:
library(plotly)
plot_ly(x = df$group, y = df$val, split = df$group, type = 'violin', box = list(visible = F), points = F, showlegend = T, color = df$group)
I get:
Where group "C" gets an inflated/artificial violin.
Any idea how to deal with this and not by using ggplotly?
I did not find a way to fix the behaviour of plotly (probably worth making a bug report for this). A workaround would be to filter your data to only draw violin plots on groups whose range is greater than zero. If you also need to show where the other groups are, you can use a boxplot for these.
To demonstrate, I use library(data.table) for the filtering stage. You could use dplyr or base versions of the same procedure if you prefer:
setDT(df)[, toplot := diff(range(val)) > 0, group]
Now we can plot the groups using different trace styles depending on whether they should have violins or not
plot_ly() %>%
add_trace(data = df[(toplot)], x = ~group, y = ~val, split = ~group,
type = 'violin', box = list(visible = F), points = F) %>%
add_boxplot(data = df[(!toplot)], x = ~group, y = ~val, split = ~group)

Using ggplot2 and viridis, fill histogram based on other variable

I am trying to create the top left graph in this figure in ggplot, using viridis to make the colour gradient.
Here is my sample data:
# simulate t-values
data = data.frame(sim =1:10000,
t_0= rt(n = 10000,df =12, ncp=0),
t_1 = rt(n = 10000,df =12, ncp=1.2))
# compute p-values
data = data %>%
mutate(p_0 = 2* pt(t_0, df=12, lower.tail = ifelse(t_0 > 0,FALSE ,TRUE)),
p_1 = 2* pt(t_1, df=12, lower.tail = ifelse(t_1 > 0,FALSE ,TRUE)))
# convert from wide to long
data.long = data %>%
gather(condition,measurement, t_0:p_1) %>%
separate(col=condition, into=c("para","hyp"), sep = "_")
# convert to wide repeated measures format
data.wide = data.long %>% spread(key = para, measurement)
To create the graphs on the left, I need to colour the histogram according to the corresponding values in the right graphs. If t = 0 (corresponding to a p close to 1), the graph should be yellow, if t>4 (corresponding to a p close to 0), the fill should be dark blue. This post shows how to create a similar graph using scale_fill_gradientn, which does unfortunately does not work with the discrete values I have created using cut().
This is the closest I have come, however I want the graph to have yellow for x=0 blending to dark blue at the edges.
# create bins based on t-values
t0bins <- seq(-12, 12, by = 1)
# compute corresponding p-values
pt0bins <- 2*pt(t0bins, df = 12, lower.tail = FALSE)
ggplot(data.wide, aes(x=t, fill=cut(..x.., breaks=get("t0bins", envir=.GlobalEnv)))) +
geom_histogram(binwidth=0.1)+
scale_fill_viridis(discrete=T)
which gives:
You can try
library(tidyverse)
library(viridis)
data.wide %>%
mutate(bins=cut(t, breaks=t0bins)) %>%
{ggplot(.,aes(x=t, fill=bins)) +
geom_histogram(binwidth=0.1)+
scale_x_continuous(limits =c(-12,12)) +
scale_fill_manual(drop=FALSE,values = c(viridis(nlevels(.$bins)/2), viridis(nlevels(.$bins)/2, direction = -1)))}

R: PCA ggplot Error "arguments imply differing number of rows"

I have a dataset:
https://docs.google.com/spreadsheets/d/1ZgyRQ2uTw-MjjkJgWCIiZ1vpnxKmF3o15a5awndttgo/edit?usp=sharing
that I'm trying to apply PCA analysis and to achieve a graph based on graph provided in this post:
https://stats.stackexchange.com/questions/61215/how-to-interpret-this-pca-biplot-coming-from-a-survey-of-what-areas-people-are-i
However, an error doesn't seem to go away:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
arguments imply differing number of rows: 0, 1006
Following is my code that I have trouble finding the source of error. Would like to have some help for error detection. Any hints?
The goal is to produced a PCA graph grouped by levels of Happiness.in.life. I modified the original code to fit with my dataset. Originally, group is determined by Genders, which has 2 levels. What I'm attempting to do is to build a graph based on 5 levels of Happiness.in.life. However, it doesn't seem I can use the old code...
Thanks!
library(magrittr)
library(dplyr)
library(tidyr)
df <- happiness_reduced %>% dplyr::select(Happiness.in.life:Internet.usage, Happiness.in.life)
head(df)
vars_on_hap <- df %>% dplyr::select(-Happiness.in.life)
head(vars_on_hap)
group<-df$Happiness.in.life
fit <- prcomp(vars_on_hap)
pcData <- data.frame(fit$x)
vPCs <- fit$rotation[, c("PC1", "PC2")] %>% as.data.frame()
multiple <- min(
(max(pcData[,"PC1"]) - min(pcData[,"PC1"]))/(max(vPCs[,"PC1"])-
min(vPCs[,"PC1"])),
(max(pcData[,"PC2"]) - min(pcData[,"PC2"]))/(max(vPCs[,"PC2"])-
min(vPCs[,"PC2"]))
)
ggplot(pcData, aes(x=PC1, y=PC2)) +
geom_point(aes(colour=groups)) +
coord_equal() +
geom_text(data=vPCs,
aes(x = fit$rotation[, "PC1"]*multiple*0.82,
y = fit$rotation[,"PC2"]*multiple*0.82,
label=rownames(fit$rotation)),
size = 2, vjust=1, color="black") +
geom_segment(data=vPCs,
aes(x = 0,
y = 0,
xend = fit$rotation[,"PC1"]*multiple*0.8,
yend = fit$rotation[,"PC2"]*multiple*0.8),
arrow = arrow(length = unit(.2, 'cm')),
color = "grey30")
Here is an approach on how to plot the result of PCA in ggplot2:
library(tidyverse)
library(ggrepel)
A good idea (not in all cases for instance if they are all in the same units) is to scale the variables prior to PCA
hapiness %>% #this is the data from google drive. In the future try not top post such links on SO because they tend to be unusable after some time has passed
select(-Happiness.in.life) %>%
prcomp(center = TRUE, scale. = TRUE) -> fit
Now we can proceed to plotting the fit:
fit$x %>% #coordinates of the points are in x element
as.data.frame()%>% #convert matrix to data frame
select(PC1, PC2) %>% #select the first two PC
bind_cols(hapiness = as.factor(hapiness$Happiness.in.life)) %>% #add the coloring variable
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = hapiness)) + #plot points and color
geom_segment(data = fit$rotation %>% #data we want plotted by geom_segment is in rotation element
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(), #get to row names so you can label after
aes(x = 0, y = 0, xend = PC1 * 7, yend = PC2* 7, group = rowname), #I scaled the rotation by 7 so it fits in the plot nicely
arrow = arrow(angle = 20, type = "closed", ends = "last",length = unit(0.2,"cm")),
color = "grey30") +
geom_text_repel(data = fit$rotation %>%
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(),
aes(x = PC1*7,
y = PC2*7,
label = rowname)) +
coord_equal(ratio = fit$sdev[2]^2 / fit$sdev[1]^2) + #I like setting the ratio to the ratio of eigen values
xlab(paste("PC1", round(fit$sdev[1]^2/ sum(fit$sdev^2) *100, 2), "%")) +
ylab(paste("PC2", round(fit$sdev[2]^2/ sum(fit$sdev^2) *100, 2), "%")) +
theme_bw()
Look at all them happy people on the left (well it is hard to notice because of the colors used, I suggest using the palette jco from ggpubr library) get_palette('jco', 5) ie scale_color_manual(values = get_palette('jco', 5))
quite a similar plot can be achieved with library ggord:
library(ggord)
ggord(fit, grp_in = as.factor(hapiness$Happiness.in.life),
size = 1, ellipse = F, ext = 1.2, vec_ext = 5)
the major difference is ggord uses equal scaling for axes. Also I scaled the rotation by 5 instead of 7 as in the first plot.
As you can see I do not like many intermediate data frames.

Drawing rectangles around specified labels in a dendrogram with 'dendextend'

I'm currently constructing a dendrogram and I'm using 'dendextend' to tweak the look of it.
I've been able to do everything I want to (labelling leaves and highlighting branches of my chosen clusters), except drawing rectangles around pre-defined clusters.
My data (which can be sourced from this file: Barra_IBS_example.matrix) was clustered with 'pvclust', so 'pvrect' draws the rects in the correct position, but it cuts the labels (see image below), so I want to reproduce it with 'rect.dendrogram', however, I can't figure out how to tell the function to use the clustering data from 'pvclust'.
This is the code I'm using:
idnames <- dimnames(ibs_mat)[[1]]
ibs.pv <- pvclust(ibs_mat, nboot=1000)
ibs.clust <- pvpick(ibs.pv, alpha=0.95)
names(ibs.clust$clusters) <- paste0("Cluster", 1:length(ibs.clust$clusters))
# Choose a colour palette
pal <- brewer.pal(length(ibs.clust$clusters), "Paired")
# Transform the list to a dataframe
ibs_meta <- bind_rows(lapply(names(ibs.clust$clusters),
function(l) data.frame(Cluster=l, Sample = ibs.clust$clusters[[l]])))
# Add the rest of the non-clustered samples (and assign them as Cluster0), add colour to each cluster
ibs_table <- ibs_meta %>%
rbind(., data.frame(Cluster = "Cluster0",
Sample = idnames[!idnames %in% .$Sample])) %>%
mutate(Cluster_int=as.numeric(sub("Cluster", "", Cluster))) %>%
mutate(Cluster_col=ifelse(Cluster_int==0, "#000000",
pal[Cluster_int])) %>%
.[match(ibs.pv$hclust$labels[ibs.pv$hclust$order], .$Sample),]
hcd <- as.dendrogram(ibs.pv) %>%
#pvclust_show_signif(ibs.pv, show_type = "lwd", signif_value = c(2, 1),alpha=0.25) %>%
set("leaves_pch", ifelse(ibs_table$Cluster_int>0,19,18)) %>% # node point type
set("leaves_cex", 1) %>% # node point size
set("leaves_col", ibs_table$Cluster_col) %>% #node point color
branches_attr_by_labels(ibs_meta$Sample, TF_values = c(2, Inf), attr = c("lwd")) %>% # change branch width
# rect.dendrogram(k=12, cluster = ibs_table$Cluster_int, border = 8, lty = 5, lwd = 1.5,
# lower_rect = 0) %>% # add rectangles around clusters
plot(main="Barramundi samples IBS based clustering")
pvrect(ibs.pv, alpha=0.95, lwd=1.5)
Many thanks, Ido
ok, this took more work than I had hoped, but I got a solution for you.
I created a new function called pvrect2 and just pushed it to the latest version of dendextend on github. Here is a self contained example demonstrating the solution:
devtools::install_github('talgalili/dendextend')
library(pvclust)
library(dendextend)
data(lung) # 916 genes for 73 subjects
set.seed(13134)
result <- pvclust(lung[, 1:20], method.dist="cor", method.hclust="average", nboot=10)
par(mar = c(9,2.5,2,0))
dend <- as.dendrogram(result)
dend %>%
pvclust_show_signif(result, signif_value = c(3,.5)) %>%
pvclust_show_signif(result, signif_value = c("black", "grey"), show_type = "col") %>%
plot(main = "Cluster dendrogram with AU/BP values (%)")
# pvrect(result, alpha=0.95)
pvrect2(result, alpha=0.95)
text(result, alpha=0.95)
UvdV.png

Resources