Count the number of clusters on a UMAP scatterplot (without providing labels)? - r

I have a dataframe with columns 'x' and 'y' corresponding to x/y coordinates of a scatterplot I've made using ggplot2. I'm looking for some way to ask, "how many clusters exist here?". I understand that maybe some user input may be required here for what you want to call a 'cluster'.
I have found some success using Seurat, because it contains a function to label clusters. However, it's more like finding the clusters that correspond to a vector of labels provided by the user (ex: I proivde 5 unique labels so just go find 5 clusters).
Min Reprex:
Seurat's LabelClusters function is very useful for labeling clusters starting solely from X/Y coordinates:
library(umap)
library(Seurat)
my_umap <- umap(iris[,c(1:4)])
my_umap <- as.data.frame(my_umap$layout)
my_umap$id <- iris$Species
colnames(my_umap) <- c("x","y", "id")
p <- ggplot(my_umap, aes(x=x,y=y,color=id)) + geom_point()
LabelClusters(plot=p, id='id', color="black")
Image of result
However, I have a need to detect total # of clusters from these data (without providing labels). By this I mean first detecting how many clusters exist . Maybe here we would see 5 clusters instead of 3 :
And here is that result

First of all, the number of clusters is very subjective. Different algorithms give different results and it is up to the data analyst to decide what is an appropriate clustering. There are some clusterings that are better than others, but what algorithm is optimal for calling clusters is dependent on the data. For example, if you generate random noise, clustering algorithms most likely will find clusters, but because the data is noise, these are meaningless.
I'm going to caution against calling clusters in UMAP space. UMAP is an nearest neighbour embedding method that is constrained to 2/3 dimensions. This constraint loses a lot of potentially useful information. I think a generally useful strategy is to compute a (shared) nearest neighbour graph on the data (or a number of principal components), through for example bluster::makeSNNGraph() and then use igraph::cluster_louvain(), or some other method, to call clusters.
library(ggplot2)
library(umap)
data <- as.matrix(iris[, 1:4])
my_umap <- as.data.frame(umap(data)$layout)
graph <- bluster::makeSNNGraph(data)
#> Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
#> detected tied distances to neighbors, see ?'BiocNeighbors-ties'
clust <- igraph::cluster_louvain(graph)
my_umap$clust <- factor(clust$membership)
ggplot(my_umap, aes(V1, V2, colour = clust)) +
geom_point()
Created on 2022-01-31 by the reprex package (v2.0.1)

Related

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

Point pattern classification with spatstat: how to choose the right bandwidth?

I'm still trying to find the best way to classify bivariate point patterns:
Point pattern classification with spatstat: what am I doing wrong?
I now analysed 110 samples of my dataset using #Adrian's suggestion with sigma=bw.diggle (as I wanted an automatic bandwidth selection). f is a "resource selection function" (RSF) which describes the relationship between the intensity of the Cancer point process and the covariate (here kernel density of Immune):
Cancer <- split(cells)[["tumor"]]
Immune <- split(cells)[["bcell"]]
Dimmune <- density(Immune,sigma=bw.diggle)
f <- rhohat(Cancer, Dimmune)
I am in doubt about some results I've got. A dozen of rho-functions looked weird (disrupted, single peak). After changing to default sigma=NULL or sigma=bw.scott (which are smoother) the functions became "better" (see examples below). I also experimented with the following manipulations:
cells # bivariate point pattern with marks "tumor" and "bcell"
o.marks<-cells$marks # original marks
#A) randomly re-assign original marks
a.marks <- sample(cells$marks)
#B) replace marks randomly with a 50/50 proportion
b.marks<-as.factor(sample(c("tumor","bcell"), replace=TRUE, size=length(o.marks)))
#C) random (homogenious?) pattern with the original number of points
randt<-runifpoint(npoints(subset(cells,marks=="tumor")),win=cells$window)
randb<-runifpoint(npoints(subset(cells,marks=="bcell")),win=cells$window)
cells<-superimpose(tumor=randt,bcell=randb)
#D) tumor points are associated with bcell points (is "clustered" a right term?)
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),Dimmune,win=cells$window)
#E) tumor points are segregated from bcell points
reversedD<-Dimmune
density.scale.v<-sort(unique((as.vector(Dimmune$v)[!is.na(as.vector(Dimmune$v))]))) # density scale
density.scale.v.rev<-rev(density.scale.v)# reversed density scale
new.image.v<-Dimmune$v
# Loop over matrix
for(row in 1:nrow(Dimmune$v)) {
for(col in 1:ncol(Dimmune$v)) {
if (is.na(Dimmune$v[row, col])==TRUE){next}
number<-which(density.scale.v==Dimmune$v[row, col])
new.image.v[row, col]<-density.scale.v.rev[number]}
}
reversedD$v<-new.image.v # reversed density
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),reversedD,win=cells$window)
A better way to generate inverse density heatmaps is given by #Adrian in his post below.
I could not generate rpoint patterns for the bw.diggle density as it produced negative numbers.Thus I replaced the negatives Dimmune$v[which(Dimmune$v<0)]<-0 and could run rpoint then. As #Adrian explained in the post below, this is normal and can be solved easier by using a density.ppp option positive=TRUE.
I first used bw.diggle, because hopskel.test indicarted "clustering" for all my patterns. Now I'm going to use bw.scott for my analysis but can this decision be somehow justified? Is there a better method besides "RSF-function is looking weird"?
some examples:
sample10:
sample20:
sample110:
That is a lot of questions!
Please try to ask only one question per post.
But here are some answers to your technical questions about spatstat.
Negative values:
The help for density.ppp explains that small negative values can occur because of numerical effects. To force the density values to be non-negative, use the argument positive=TRUE in the call to density.ppp. For example density(Immune, bw.diggle, positive=TRUE).
Reversed image: to reverse the ordering of values in an image Z you can use the following code:
V <- Z
A <- order(Z[])
V[][A] <- Z[][rev(A)]
Then V is the order-reversed image.
Tips for your code:
to generate a random point pattern with the same number of points and in the same window as an existing point pattern X, use Y <- runifpoint(ex=X).
To extract the marks of a point pattern X, use a <- marks(X). To assign new marks to a point pattern X, use marks(X) <- b.
to randomly permute the marks attached to the points in a point pattern X, use Y <- rlabel(X).
to assign new marks to a point pattern X where the new marks are drawn randomly-with-replacement from a given vector of values m, use Y <- rlabel(X, m, permute=FALSE).

rPhenograph file is a "large communities (68 elements, 2.5Mb) How to plot it? ggplot2?

I used some raw-output files from our flow cytometer which tells me in .csv which intensities it measures at which wavelength for every event/cell.
This resulted in a .csv with around 25000 cells and around 240 measuring points.
Importing the .csv file into R-Studio and removing some measurements yielded a matrix with 25000 obs x 73 variables.
Then I used rPhenograph to calculate the neighborhoods, which worked well.
But now it seems to be a dataframe or something that I genuinely have no idea how to plot it.
Data1 <- read_csv("CD4_3.csv", skip=17)
Data_selected <- select(Data2, ends_with(".A"))
rpheno_out <- Rphenograph(Data_selected)
I hoped to get a plot which looks/resembles a tSNE plot.
Instead, I only got an error-code telling me that ggplot can't handle it.
ggplot(rpheno_out) + geom_point()
Fehler: data must be a data frame, or other object coercible by
fortify(), not an S3 object with class communities
I think you've misunderstood what the Rphenograph() function returns ; the doc states :
A simple R implementation of the phenograph
PhenoGraph
algorithm, which is a clustering method designed for high-dimensional
single-cell data analysis. It works by creating a graph ("network")
representing phenotypic similarities between cells by calclating the
Jaccard coefficient between nearest-neighbor sets, and then
identifying communities using the well known Louvain
method in this graph.
This only builds clusters on your data based on the information you provide. The output has no dimensionally reduced version of your input.
If you want to see what your clustering looks like, then you have to apply your favorite dimension-reduction analysis and plot color-coding with the cluster info from Rphenograph().
To give you an example, I've done this on the provided code from the function's doc :
library(cytofkit)
## Example from Rphenograph's doc
iris_unique <- unique(iris) # Remove duplicates
data <- as.matrix(iris_unique[,1:4])
Rphenograph_out <- Rphenograph(data, k = 45)
## Added bit to see the results
pca <- prcomp(iris_unique[,1:4], retx = T, rank. = 2)
par(mfrow=c(1,2))
plot(pca$x, col=Rphenograph_out$membership, lwd=3,
main="Color by Rphenograph cluster")
plot(pca$x, col=iris_unique$Species, lwd=3,
main="Color by Species")
Results in :

revealing clusters of interaction in igraph

I have an interaction network and I used the following code to make an adjacency matrix and subsequently calculate the dissimilarity between the nodes of the network and then cluster them to form modules:
ADJ1=abs(adjacent-mat)^6
dissADJ1<-1-ADJ1
hierADJ<-hclust(as.dist(dissADJ1), method = "average")
Now I would like those modules to appear when I plot the igraph.
g<-simplify(graph_from_adjacency_matrix(adjacent-mat, weighted=T))
plot.igraph(g)
However the only thing that I have found thus far to translate hclust output to graph is as per the following tutorial: http://gastonsanchez.com/resources/2014/07/05/Pretty-tree-graph/
phylo_tree = as.phylo(hierADJ)
graph_edges = phylo_tree$edge
graph_net = graph.edgelist(graph_edges)
plot(graph_net)
which is useful for hierarchical lineage but rather I just want the nodes that closely interact to cluster as follows:
Can anyone recommend how to use a command such as components from igraph to get these clusters to show?
igraph provides a bunch of different layout algorithms which are used to place nodes in the plot.
A good one to start with for a weighted network like this is the force-directed layout (implemented by layout.fruchterman.reingold in igraph).
Below is a example of using the force-directed layout using some simple simulated data.
First, we create some mock data and clusters, along with some "noise" to make it more realistic:
library('dplyr')
library('igraph')
library('RColorBrewer')
set.seed(1)
# generate a couple clusters
nodes_per_cluster <- 30
n <- 10
nvals <- nodes_per_cluster * n
# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)
dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_', 1:nodes_per_cluster))
Next, we can use Pearson correlation to construct our adjacency matrix:
# create correlation matrix
cor_mat <- cor(t(dat))
# shift to [0,1] to separate positive and negative correlations
adj_mat <- (cor_mat + 1) / 2
# get rid of low correlations and self-loops
adj_mat <- adj_mat^3
adj_mat[adj_mat < 0.5] <- 0
diag(adj_mat) <- 0
Cluster the data using hclust and cutree:
# convert to dissimilarity matrix and cluster using hclust
dissim_mat <- 1 - adj_mat
dend <- dissim_mat %>%
as.dist %>%
hclust
clusters = cutree(dend, h=0.65)
# color the nodes
pal = colorRampPalette(brewer.pal(11,"Spectral"))(length(unique(clusters)))
node_colors <- pal[clusters]
Finally, create an igraph graph from the adjacency matrix and plot it using the fruchterman.reingold layout:
# create graph
g <- graph.adjacency(adj_mat, mode='undirected', weighted=TRUE)
# set node color and plot using a force-directed layout (fruchterman-reingold)
V(g)$color <- node_colors
coords_fr = layout.fruchterman.reingold(g, weights=E(g)$weight)
# igraph plot options
igraph.options(vertex.size=8, edge.width=0.75)
# plot network
plot(g, layout=coords_fr, vertex.color=V(g)$color)
In the above code, I generated two "clusters" of correlated rows, and a third group of "noise".
Hierarchical clustering (hclust + cuttree) is used to assign the data points to clusters, and they are colored based on cluster membership.
The result looks like this:
For some more examples of clustering and plotting graphs with igraph, checkout: http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html
You haven't shared some toy data for us to play with and suggest improvements to code, but your question states that you are only interested in plotting your clusters distinctly - that is, graphical presentation.
Although igraph comes with some nice force directed layout algorithms, such as layout.fruchterman.reingold, layout_with_kk, etc., they can, in presence of a large number of nodes, quickly become difficult to interpret and make sense of at all.
Like this:
With these traditional methods of visualising networks,
the layout algorithms, rather than the data, determine the visualisation
similar networks may end up being visualised very differently
large number of nodes will make the visualisation difficult to interpret
Instead, I find Hive Plots to be better at displaying important network properties, which, in your instance, are the cluster and the edges.
In your case, you can:
Plot each cluster on a different straight line
order the placement of nodes intelligently, so that nodes with certain properties are placed at the very end or start of each straight line
Colour the edges to identify direction of edge
To achieve this you will need to:
use the ggnetwork package to turn your igraph object into a dataframe
map your clusters to the nodes present in this dataframe
generate coordinates for the straight lines and map these to each cluster
use ggplot to visualise
There is also a hiveR package in R, should you wish to use a packaged solution. You might also find another visualisation technique for graphs very useful: BioFabric

Graphing results of dbscan in R

Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.

Resources