Density based clustering that allows user to specify number of clusters - r

I have data that consists of roughly 100,000 points on a 2-d graph. Each point has X and Y coordinates. I'm looking for an algorithm that will cluster these points based on density but I want to specify the number of clusters.
I originally tried K-Means since this would allow me to specify the number of clusters. However, my data naturally "clumps" into ridges. K-Means would inevitably bisect some of these ridges. DBSCAN seems like a better fit simply due to the shape of my data, but with DBSCAN I can't specify the number of clusters I'd like.
Essentially what I'm trying to find is an algorithm that will optimally cluster the graph into N groups based on density. Where N is supplied by me. At this point I don't care where it's implemented (R, Python, FORTRAN...).
Any direction you can provide would be much appreciated.

In an area of high density, the points tend to be close together, so clustering on the (euclidian) distance may give similar results (not always).
For example, with these three normals in 2 dimensions:
x1 <- mnormt::rmnorm(200, c(10,10), matrix(c(20,0,0,.1), 2, 2))
x2 <- mnormt::rmnorm(100, c(10,20), matrix(c(20,0,0,.1), 2, 2))
x3 <- mnormt::rmnorm(300, c(23, 15), matrix(c(.1,0,0,35), 2, 2))
xx <- rbind(x1, x2, x3)
plot(xx, col=rep(c("grey10","pink2", "green4"), times=c(200,100,300)))
We can apply different clustering algorithms:
# hierarchical
clustering <- hclust(dist(xx,
method = "euclidian"),
method = "ward.D")
h.cl <- cutree(clustering, k=3)
# K-means and dbscan
k.cl <- kmeans(xx, centers = 3L)
d.cl <- dbscan::dbscan(xx, eps = 1)
And we see on this particular example, the hierarchical clustering and DBSCAN produced similar results, whereas K-means cut one of the clusters in a wrong way.
opar <- par(mfrow=c(3,1), mar = c(1,1,1,1))
plot(xx, col = k.cl$cluster, main="K-means")
plot(xx, col = d.cl$cluster, main="DBSCAN")
plot(xx, col = h.cl, main="Hierarchical")
par(opar)
Of course, there is no guarantee this will work on your particular data.

Related

Calculation of allowed space within monte carlo simulated data of 3 variables (cube in 3D coordinates)

I´m working on the topic of calculating the robust working range of a process. For this purpose I´m building models from DOE data and simulating data with a monte carlo approach. Filtering the data with a criteria for the response leads to a allowed space (see plots for better visualization).
In the example below, there are 3 variables and the goal is to calculate the biggest possible square (in parallel with the axis) within the allowed room. This would describe the working range of the process. The coding is just to get every variable in the same range (-1 to 1).
library(tidyverse)
library(MASS)
library(ggplot2)
library(gridExtra)
library(rgl)
df<-data.frame(
X1=runif(100,0,2),
X2=runif(100,10,30),
X3=runif(100,5,75))%>%
mutate(Y1=2*X1-2*X2+X3)
f1<-Y1~X1+X2+X3
model1<- lm(f1, data=df)
m.c <- NULL
n=10000
for (k in 1:n)
{
X1=runif(1,0,2)
X2=runif(1,10,30)
X3=runif(1,5,75)
m.c = rbind(m.c, data.frame(X1, X2, X3))
}
m.c_coded<-m.c%>%
mutate(predict1=predict(model1, newdata = .))%>%
mutate(X1=(X1-1/1))%>%
mutate(X2=(X2-20)/10)%>%
mutate(X3=(X3-40)/35)
Space<- m.c_coded%>%
filter(predict1<=0)
p1<-ggplot(Space)+
geom_point(aes(X1, X2))+
xlim(-1,1)+
ylim(-1,1)
p2<-ggplot(Space)+
geom_point(aes(X1, X3))+
xlim(-1,1)+
ylim(-1,1)
p3<-ggplot(Space)+
geom_point(aes(X2, X3))+
xlim(-1,1)+
ylim(-1,1)
grid.arrange(arrangeGrob(p1,p2,p3, nrow = 1), nrow = 1)
MODR_plot3D<-plot3d( x=Space$X1, y=Space$X2, z=Space$X3, type = "p",
xlim = (c(-1,1)), ylim(c(-1,1)), zlim = (c(-1,1))
)
There are specialized programms for that (DOE software) which can calculate this so called Design-space, but I want to implement it in my R skript. Sadly I do not have any idea, how I can calculate the position (edges) of this square. My approach would be to find the maximum distance to the surface on (center of the square).
Does anyone an idea how I can calculate this cube in a proper way? If possible I want to extend this also for the n-dimensional room.

How can I create a scatter plot in R to visualise the result of a SOM clustering model?

I have a dataset (this is just a dummy, my real datasets are much larger) in which there are five variables: two spatial variables X and Y (basically pairs of coordinates) and three attributes A, B and C associated to each X,Y point:
X Y A B C
1 1 34 11 26
1 2 47 16 31
1 3 60 21 36
1 4 73 26 41
1 5 86 31 46
2 1 99 36 51
... with 15 more rows
If I run a k-Means Clustering model on the dataset, I can easily produce a plot in which each X,Y point is coloured according to the related cluster:
library(tidyverse)
#Read the dataset
My_ds <- read_delim("test_dataset.csv",delim = ",", escape_double = FALSE, trim_ws = TRUE)
#Set the number of clusters
kClusters <- 3
#Create the model
kMeans <- kmeans(My_ds[ , c("A", "B", "C")], centers = kClusters)
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = kMeans$cluster,
size = 15) +
theme_minimal()
k-Means scatter plot
With the kohonen package I can also use a different clustering approach based on self-organising maps (SOM):
library(kohonen)
#Prepare the dataset
My_ds_SOM <- as.matrix(scale(My_ds[ , c("A", "B", "C")]))
#Set the grid
My_Grid <- somgrid(xdim = 3, ydim = 3, topo = "hexagonal")
#Create the model
My_Model <- som(X = My_ds_SOM,
grid = My_Grid)
However, I cannot find a way to produce a scatter plot similar to the one above and based on the SOM clusters. With k-Means I used kMeans$cluster to control the colour of the X,Y points, what should I use with SOM?
Update 1
OK, I made some progress thanks to this blog post. The key is to perform clustering on the SOM nodes, to isolate groups of samples with similar metrics.
First, an estimate of the number of clusters that would be suitable can be ascertained using a K-means algorithm and looking for an elbow-point in the plot of within cluster sum of squares (WCSS):
#View WCSS for K-means
mydata <- getCodes(My_Model)
wcss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:8) { #Second number is of one's choosing (I used number_of_nodes-1)
wcss[i] <- sum(kmeans(mydata, centers=i)$withinss)
}
plot(wcss)
WCSS plot
Then I use hierarchical clustering and the SOM plot function to visualise the clusters on the node map:
#Define colour palette
pretty_palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
#Use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(getCodes(My_Model))), 3)
#Plot these results
plot(My_Model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(My_Model, som_cluster)
Clusters on node map
Finally, I assign labels to the original data by using the som_cluster variable that maps nodes, with the som_model$unit.classif variable that maps data samples to nodes:
#Get vector with cluster value for each original data sample
cluster_assignment <- som_cluster[My_Model$unit.classif]
#Add the assignment as a column in the original data
My_ds$cluster <- cluster_assignment
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = My_ds$cluster,
size = 15) +
theme_minimal()
SOM+hierarchical scatter plot
Applying hierarchical clustering on top of the SOM nodes makes the process a bit convoluted, as SOM already helps reduce the dimensions and cluster neighbouring nodes together. But this was the only way I could get what I wanted.
Update 2
Some more progress. This time I'm focusing on making the whole process fully automatic. Specifically, I want to avoid choosing 1) the SOM grid size and 2) the number of clusters during the clustering of the node map.
Regarding point 1, I used a rule of thumb suggested by Vesanto J, Alhoniemi E. Clustering of the self-organizing map. IEEE Transactions on neural networks. 2000 May;11(3):586-600, which is #nodes = 5*sqrt(#observations). Therefore, setting the grid for the SOM model works like this:
My_dim <- as.integer(sqrt(5*sqrt(nrow(My_ds_SOM))))
My_Grid <- somgrid(xdim = My_dim, ydim = My_dim, topo = "hexagonal")
Of course, this works best with large datasets. In any case, this approach should be a starting point only, the grid size can (and should) then be adjusted by looking at the resulting node count plot, weight vector plot and heatmap.
About point 2, when using hierarchical clustering to cluster the codebook vectors, the kgs function of the maptree package allows the optimal number of clusters to be calculated automatically:
library(maptree)
distance <- dist(getCodes(My_Model))
clustering <- hclust(distance)
optimal_k <- kgs(clustering, distance, maxclus = 20)
clusters <- as.integer(names(optimal_k[which(optimal_k == min(optimal_k))]))
som_cluster <- cutree(clustering, clusters)
Also in this case, the number of clusters determined by the code can be compared to the one suggested by the WCSS plot, to check if there is a significant discrepancy.

How to get Principal Component Data in PAM in R

I create a graph using autoplot function using mtcars data and get graph like this
here my code:
library(cluster)
library(NbClust)
library(ggplot2)
library(ggfortify)
x <- mtcars
number.cluster <- NbClust(x, distance = "euclidean", min.nc = 1, max.nc = 5, method = "complete", index = "ch")
best.cluster <- as.numeric(number.cluster$Best.nc[1])
x.pam <- pam(x, best.cluster)
autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
my question is how do i get PC1 & PC2 data Coordinate based on this graph?
thank you
You can use layer_data() to get the data used for a ggplot object:
p <- autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
layer_data(p, 1L) # coordinates of all points
layer_data(p, 2L) # coordinates of points that contribute to polygons
Your entire process is flawed. First you use complete linkage to estimate the number of clusters; but rather than using the "best" clustering found you then cluster again with PAM instead.
You use Euclidean distance, but in Euclidean space k-means will usually work better than PAM - PAM shines when you don't have Euclidean geometry and cannot use k-means.
And then you want to use this PCA plot, which is heavily distorted (almost the entire variance is in the first component, the y axis is visualizing pretty much random deviation). Just use PCA if you want these coordinates, not reconstruct this from the plot.

spatial clustering in R (simple example)

I have this simple data.frame
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
The idea is to find the spatial clusters based on the distance
First, I plot the map (lon,lat) :
plot(data$lon,data$lat)
so clearly I have three clusters based in the distance between the position of points.
For this aim, I've tried this code in R :
d= as.matrix(dist(cbind(data$lon,data$lat))) #Creat distance matrix
d=ifelse(d<5,d,0) #keep only distance < 5
d=as.dist(d)
hc<-hclust(d) # hierarchical clustering
plot(hc)
data$clust <- cutree(hc,k=3) # cut the dendrogram to generate 3 clusters
This gives :
Now I try to plot the same points but with colors from clusters
plot(data$x,data$y, col=c("red","blue","green")[data$clust],pch=19)
Here the results
Which is not what I'm looking for.
Actually, I want to find something like this plot
Thank you for help.
What about something like this:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
km <- kmeans(cbind(lat, lon), centers = 3)
plot(lon, lat, col = km$cluster, pch = 20)
Here's a different approach. First it assumes that the coordinates are WGS-84 and not UTM (flat). Then it clusters all neighbors within a given radius to the same cluster using hierarchical clustering (with method = single, which adopts a 'friends of friends' clustering strategy).
In order to compute the distance matrix, I'm using the rdist.earth method from the package fields. The default earth radius for this package is 6378.388 (the equatorial radius) which might not be what one is looking for, so I've changed it to 6371. See this article for more info.
library(fields)
lon = c(31.621785, 31.641773, 31.617269, 31.583895, 31.603284)
lat = c(30.901118, 31.245008, 31.163886, 30.25058, 30.262378)
threshold.in.km <- 40
coors <- data.frame(lon,lat)
#distance matrix
dist.in.km.matrix <- rdist.earth(coors,miles = F,R=6371)
#clustering
fit <- hclust(as.dist(dist.in.km.matrix), method = "single")
clusters <- cutree(fit,h = threshold.in.km)
plot(lon, lat, col = clusters, pch = 20)
This could be a good solution if you don't know the number of clusters (like the k-means option), and is somewhat related to the dbscan option with minPts = 1.
---EDIT---
With the original data:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
dist <- rdist.earth(data,miles = F,R=6371) #dist <- dist(data) if data is UTM
fit <- hclust(as.dist(dist), method = "single")
clusters <- cutree(fit,h = 1000) #h = 2 if data is UTM
plot(lon, lat, col = clusters, pch = 20)
As you have a spatial data to cluster, so DBSCAN is best suited for you data.
You can do this clustering using dbscan() function provided by fpc, a R package.
library(fpc)
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
DBSCAN <- dbscan(cbind(lat, lon), eps = 1.5, MinPts = 3)
plot(lon, lat, col = DBSCAN$cluster, pch = 20)

spatial distribution of points, R

What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?
1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.
A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')

Resources