spatial clustering in R (simple example) - r

I have this simple data.frame
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
The idea is to find the spatial clusters based on the distance
First, I plot the map (lon,lat) :
plot(data$lon,data$lat)
so clearly I have three clusters based in the distance between the position of points.
For this aim, I've tried this code in R :
d= as.matrix(dist(cbind(data$lon,data$lat))) #Creat distance matrix
d=ifelse(d<5,d,0) #keep only distance < 5
d=as.dist(d)
hc<-hclust(d) # hierarchical clustering
plot(hc)
data$clust <- cutree(hc,k=3) # cut the dendrogram to generate 3 clusters
This gives :
Now I try to plot the same points but with colors from clusters
plot(data$x,data$y, col=c("red","blue","green")[data$clust],pch=19)
Here the results
Which is not what I'm looking for.
Actually, I want to find something like this plot
Thank you for help.

What about something like this:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
km <- kmeans(cbind(lat, lon), centers = 3)
plot(lon, lat, col = km$cluster, pch = 20)

Here's a different approach. First it assumes that the coordinates are WGS-84 and not UTM (flat). Then it clusters all neighbors within a given radius to the same cluster using hierarchical clustering (with method = single, which adopts a 'friends of friends' clustering strategy).
In order to compute the distance matrix, I'm using the rdist.earth method from the package fields. The default earth radius for this package is 6378.388 (the equatorial radius) which might not be what one is looking for, so I've changed it to 6371. See this article for more info.
library(fields)
lon = c(31.621785, 31.641773, 31.617269, 31.583895, 31.603284)
lat = c(30.901118, 31.245008, 31.163886, 30.25058, 30.262378)
threshold.in.km <- 40
coors <- data.frame(lon,lat)
#distance matrix
dist.in.km.matrix <- rdist.earth(coors,miles = F,R=6371)
#clustering
fit <- hclust(as.dist(dist.in.km.matrix), method = "single")
clusters <- cutree(fit,h = threshold.in.km)
plot(lon, lat, col = clusters, pch = 20)
This could be a good solution if you don't know the number of clusters (like the k-means option), and is somewhat related to the dbscan option with minPts = 1.
---EDIT---
With the original data:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
dist <- rdist.earth(data,miles = F,R=6371) #dist <- dist(data) if data is UTM
fit <- hclust(as.dist(dist), method = "single")
clusters <- cutree(fit,h = 1000) #h = 2 if data is UTM
plot(lon, lat, col = clusters, pch = 20)

As you have a spatial data to cluster, so DBSCAN is best suited for you data.
You can do this clustering using dbscan() function provided by fpc, a R package.
library(fpc)
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
DBSCAN <- dbscan(cbind(lat, lon), eps = 1.5, MinPts = 3)
plot(lon, lat, col = DBSCAN$cluster, pch = 20)

Related

How can I create a scatter plot in R to visualise the result of a SOM clustering model?

I have a dataset (this is just a dummy, my real datasets are much larger) in which there are five variables: two spatial variables X and Y (basically pairs of coordinates) and three attributes A, B and C associated to each X,Y point:
X Y A B C
1 1 34 11 26
1 2 47 16 31
1 3 60 21 36
1 4 73 26 41
1 5 86 31 46
2 1 99 36 51
... with 15 more rows
If I run a k-Means Clustering model on the dataset, I can easily produce a plot in which each X,Y point is coloured according to the related cluster:
library(tidyverse)
#Read the dataset
My_ds <- read_delim("test_dataset.csv",delim = ",", escape_double = FALSE, trim_ws = TRUE)
#Set the number of clusters
kClusters <- 3
#Create the model
kMeans <- kmeans(My_ds[ , c("A", "B", "C")], centers = kClusters)
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = kMeans$cluster,
size = 15) +
theme_minimal()
k-Means scatter plot
With the kohonen package I can also use a different clustering approach based on self-organising maps (SOM):
library(kohonen)
#Prepare the dataset
My_ds_SOM <- as.matrix(scale(My_ds[ , c("A", "B", "C")]))
#Set the grid
My_Grid <- somgrid(xdim = 3, ydim = 3, topo = "hexagonal")
#Create the model
My_Model <- som(X = My_ds_SOM,
grid = My_Grid)
However, I cannot find a way to produce a scatter plot similar to the one above and based on the SOM clusters. With k-Means I used kMeans$cluster to control the colour of the X,Y points, what should I use with SOM?
Update 1
OK, I made some progress thanks to this blog post. The key is to perform clustering on the SOM nodes, to isolate groups of samples with similar metrics.
First, an estimate of the number of clusters that would be suitable can be ascertained using a K-means algorithm and looking for an elbow-point in the plot of within cluster sum of squares (WCSS):
#View WCSS for K-means
mydata <- getCodes(My_Model)
wcss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:8) { #Second number is of one's choosing (I used number_of_nodes-1)
wcss[i] <- sum(kmeans(mydata, centers=i)$withinss)
}
plot(wcss)
WCSS plot
Then I use hierarchical clustering and the SOM plot function to visualise the clusters on the node map:
#Define colour palette
pretty_palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
#Use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(getCodes(My_Model))), 3)
#Plot these results
plot(My_Model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(My_Model, som_cluster)
Clusters on node map
Finally, I assign labels to the original data by using the som_cluster variable that maps nodes, with the som_model$unit.classif variable that maps data samples to nodes:
#Get vector with cluster value for each original data sample
cluster_assignment <- som_cluster[My_Model$unit.classif]
#Add the assignment as a column in the original data
My_ds$cluster <- cluster_assignment
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = My_ds$cluster,
size = 15) +
theme_minimal()
SOM+hierarchical scatter plot
Applying hierarchical clustering on top of the SOM nodes makes the process a bit convoluted, as SOM already helps reduce the dimensions and cluster neighbouring nodes together. But this was the only way I could get what I wanted.
Update 2
Some more progress. This time I'm focusing on making the whole process fully automatic. Specifically, I want to avoid choosing 1) the SOM grid size and 2) the number of clusters during the clustering of the node map.
Regarding point 1, I used a rule of thumb suggested by Vesanto J, Alhoniemi E. Clustering of the self-organizing map. IEEE Transactions on neural networks. 2000 May;11(3):586-600, which is #nodes = 5*sqrt(#observations). Therefore, setting the grid for the SOM model works like this:
My_dim <- as.integer(sqrt(5*sqrt(nrow(My_ds_SOM))))
My_Grid <- somgrid(xdim = My_dim, ydim = My_dim, topo = "hexagonal")
Of course, this works best with large datasets. In any case, this approach should be a starting point only, the grid size can (and should) then be adjusted by looking at the resulting node count plot, weight vector plot and heatmap.
About point 2, when using hierarchical clustering to cluster the codebook vectors, the kgs function of the maptree package allows the optimal number of clusters to be calculated automatically:
library(maptree)
distance <- dist(getCodes(My_Model))
clustering <- hclust(distance)
optimal_k <- kgs(clustering, distance, maxclus = 20)
clusters <- as.integer(names(optimal_k[which(optimal_k == min(optimal_k))]))
som_cluster <- cutree(clustering, clusters)
Also in this case, the number of clusters determined by the code can be compared to the one suggested by the WCSS plot, to check if there is a significant discrepancy.

Density based clustering that allows user to specify number of clusters

I have data that consists of roughly 100,000 points on a 2-d graph. Each point has X and Y coordinates. I'm looking for an algorithm that will cluster these points based on density but I want to specify the number of clusters.
I originally tried K-Means since this would allow me to specify the number of clusters. However, my data naturally "clumps" into ridges. K-Means would inevitably bisect some of these ridges. DBSCAN seems like a better fit simply due to the shape of my data, but with DBSCAN I can't specify the number of clusters I'd like.
Essentially what I'm trying to find is an algorithm that will optimally cluster the graph into N groups based on density. Where N is supplied by me. At this point I don't care where it's implemented (R, Python, FORTRAN...).
Any direction you can provide would be much appreciated.
In an area of high density, the points tend to be close together, so clustering on the (euclidian) distance may give similar results (not always).
For example, with these three normals in 2 dimensions:
x1 <- mnormt::rmnorm(200, c(10,10), matrix(c(20,0,0,.1), 2, 2))
x2 <- mnormt::rmnorm(100, c(10,20), matrix(c(20,0,0,.1), 2, 2))
x3 <- mnormt::rmnorm(300, c(23, 15), matrix(c(.1,0,0,35), 2, 2))
xx <- rbind(x1, x2, x3)
plot(xx, col=rep(c("grey10","pink2", "green4"), times=c(200,100,300)))
We can apply different clustering algorithms:
# hierarchical
clustering <- hclust(dist(xx,
method = "euclidian"),
method = "ward.D")
h.cl <- cutree(clustering, k=3)
# K-means and dbscan
k.cl <- kmeans(xx, centers = 3L)
d.cl <- dbscan::dbscan(xx, eps = 1)
And we see on this particular example, the hierarchical clustering and DBSCAN produced similar results, whereas K-means cut one of the clusters in a wrong way.
opar <- par(mfrow=c(3,1), mar = c(1,1,1,1))
plot(xx, col = k.cl$cluster, main="K-means")
plot(xx, col = d.cl$cluster, main="DBSCAN")
plot(xx, col = h.cl, main="Hierarchical")
par(opar)
Of course, there is no guarantee this will work on your particular data.

Best way to cluster long/lat hotspot points in one city in R?

I am new to R and (unsupervised) machine learning. I'm trying to find out the best cluster solution for my data in R.
What is my data about?
I have a dataset with +/- 800 long / lat WGS84 coordinates in one city.
Long is in the range 6.90 - 6.95
lat is in the range 52.29 - 52.33
What do I want?
I want to find "hotspots" based on their density. As example: minimum 5 long/lat points in a range of 50 meter. This is a point plot example:
Why do I want this?
As example: let's assume that every single point is a car accident. By clustering the points I hope to see which areas need attention. (min x points in a range of x meter needs attention)
What have I found?
The following clustering algorithms seems possible for my solution:
DBscan (https://cran.r-project.org/web/packages/dbscan/dbscan.pdf)
HDBscan(https://cran.r-project.org/web/packages/dbscan/vignettes/hdbscan.html)
OPTICS (https://www.rdocumentation.org/packages/dbscan/versions/0.9-8/topics/optics)
City Clustering Algorithm (https://cran.r-project.org/web/packages/osc/vignettes/paper.pdf)
My questions
What is the best solution or algorithm for my case in R?
Is it true that I have to convert my long/lat to a distance / Haversine matrix first?
Find something interested on: https://gis.stackexchange.com/questions/64392/finding-clusters-of-points-based-distance-rule-using-r
I changed this code a bit, using the outliers as places where a lot happens
# 1. Make spatialpointsdataframe #
xy <- SpatialPointsDataFrame(
matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))
# 2. Use DISTM function to generate distance matrix.#
mdist <- distm(xy)
# 3. Use hierarchical clustering with complete methode#
hc <- hclust(as.dist(mdist), method="complete")
# 4. Show dendogram#
plot(hc, labels = input$street, xlab="", sub="",cex=0.7)
# 5. Set distance: in my case 300 meter#
d=300
# 6. define clusters based on a tree "height" cutoff "d" and add them to the SpDataFrame
xy$clust <- cutree(hc, h=d)
# 7. Add clusters to dataset#
input$cluster <- xy#data[["clust"]]
# 8. Plot clusters #
plot(input$long, input$lat, col=input$cluster, pch=20)
text(input$long, input$lat, labels =input$cluster)
# 9. Count n in cluster#
selection2 <- input %>% count(cluster)
# 10. Make a boxplot #
boxplot(selection2$n)
#11. Get first outlier#
outlier <- boxplot.stats(selection2$n)$out
outlier <- sort(outlier)
outlier <- as.numeric(outlier[1])
#12. Filter clusters greater than outlier#
selectie3 <- as.vector(selection2 %>% filter(selection2$n >= outlier[1]) %>% select(cluster))
#13. Make a new DF with all outlier clusters#
heatclusters <- input %>% filter(cluster%in% c(selectie3$cluster))
#14. Plot outlier clusters#
plot(heatclusters$long, heatclusters$lat, col=heatclusters$cluster)
#15. Plot on density map ##
googlemap + geom_point(aes(x=long , y=lat), data=heatclusters, color="red", size=0.1, shape=".") +
stat_density2d(data=heatclusters,
aes(x =long, y =lat, fill= ..level..), alpha = .2, size = 0.1,
bins = 10, geom = "polygon") + scale_fill_gradient(low = "green", high = "red")
Don't know if this a good solution. But it seems to work. Maybe someone has any other suggestion?

How to get Principal Component Data in PAM in R

I create a graph using autoplot function using mtcars data and get graph like this
here my code:
library(cluster)
library(NbClust)
library(ggplot2)
library(ggfortify)
x <- mtcars
number.cluster <- NbClust(x, distance = "euclidean", min.nc = 1, max.nc = 5, method = "complete", index = "ch")
best.cluster <- as.numeric(number.cluster$Best.nc[1])
x.pam <- pam(x, best.cluster)
autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
my question is how do i get PC1 & PC2 data Coordinate based on this graph?
thank you
You can use layer_data() to get the data used for a ggplot object:
p <- autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
layer_data(p, 1L) # coordinates of all points
layer_data(p, 2L) # coordinates of points that contribute to polygons
Your entire process is flawed. First you use complete linkage to estimate the number of clusters; but rather than using the "best" clustering found you then cluster again with PAM instead.
You use Euclidean distance, but in Euclidean space k-means will usually work better than PAM - PAM shines when you don't have Euclidean geometry and cannot use k-means.
And then you want to use this PCA plot, which is heavily distorted (almost the entire variance is in the first component, the y axis is visualizing pretty much random deviation). Just use PCA if you want these coordinates, not reconstruct this from the plot.

Spatial correlogram using the raster package

Dear Crowd
Problem
I tried to calculate a spatial correlogram with the packages nfc, pgirmess, SpatialPack and spdep. However, I was troubling to define the start and end-point of the distance. I'm only interested in the spatial autocorrelation at smaller distances, but there on smaller bins. Additionally, as the raster is quite large (1.8 Megapixels), I run into memory troubles with these packages but the SpatialPack.
So I tried to produce my own code, using the function Moran from the package raster. But I must have some error, as the result for the complete dataset is somewhat different than the one from the other packages. If there is no error in my code, it might at least help others with similar problems.
Question
I'm not sure, whether my focal matrix is erroneous. Could you please tell me whether the central pixel needs to be incorporated? Using the testdata I can't show the differences between the methods, but on my complete dataset, there are differences visible, as shown in the Image below. However, the bins are not exactly the same (50m vs. 69m), so this might explain parts of the differences. However, at the first bin, this explanation seems not to be plausible to me. Or might the irregular shape of my raster, and different ways to handle NA's cause the difference?
Comparison of Own method with the one from SpatialPack
Runable Example
Testdata
The code for calculating the testdata is taken from http://www.petrkeil.com/?p=1050#comment-416317
# packages used for the data generation
library(raster)
library(vegan) # will be used for PCNM
# empty matrix and spatial coordinates of its cells
side=30
my.mat <- matrix(NA, nrow=side, ncol=side)
x.coord <- rep(1:side, each=side)*5
y.coord <- rep(1:side, times=side)*5
xy <- data.frame(x.coord, y.coord)
# all paiwise euclidean distances between the cells
xy.dist <- dist(xy)
# PCNM axes of the dist. matrix (from 'vegan' package)
pcnm.axes <- pcnm(xy.dist)$vectors
# using 8th PCNM axis as my atificial z variable
z.value <- pcnm.axes[,8]*200 + rnorm(side*side, 0, 1)
# plotting the artificial spatial data
r <- rasterFromXYZ(xyz = cbind(xy,z.value))
plot(r, axes=F)
Own Code
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
formerBreak <- 0 #for the first run important
for (i in c(seq(10,200,10))) #Calculate the Morans I for these bins
{
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (formerBreak>0) #if it is the second run
{
midpoint <- ceiling(ncol(w)/2) # get the midpoint
w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)] <- w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)]*(wOld==0)#set the previous focal weights to 0
w <- w*(1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w = w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
formerBreak <- i/res(r)[1]#divides the breaks by the resolution of the raster to be able to translate them to the focal window
}
plot(x=sp.Corr[,2],y = sp.Corr[,1],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
Other methods to calculate the Spatial Correlogram
library(SpatialPack)
sp.Corr <- summary(modified.ttest(z.value,z.value,coords = xy,nclass = 21))
plot(x=sp.Corr$coef[,1],y = data$coef[,4],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
library(ncf)
ncf.cor <- correlog(x.coord, y.coord, z.value,increment=10, resamp=1)
plot(ncf.cor)
In order to compare the results of the correlogram, in your case, two things should be considered. (i) your code only works for bins proportional to the resolution of your raster. In that case, a bit of difference in the bins could make to include or exclude an important amount of pairs. (ii) The irregular shape of the raster has a strong impact of the pairs that are considered to compute the correlation for certain distance interval. So your code should deal with both, allow any value for the length of bin and consider the irregular shape of the raster. A small modification of your code to tackle those problems are below.
# SpatialPack correlation
library(SpatialPack)
test <- modified.ttest(z.value,z.value,coords = xy,nclass = 21)
# Own correlation
bins <- test$upper.bounds
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
for (i in bins) {
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (i > bins[1]) {
midpoint <- ceiling(dim(w)/2) # get the midpoint
half_range <- floor(dim(wOld)/2)
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])] <-
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])]*(wOld==0)
w <- w * (1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w=w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
}
# Comparing
plot(x=test$upper.bounds, test$imoran[,1], col = 2,type = "b",ylab = "Moran's I",xlab="Upper bound of distance", lwd = 2)
lines(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
points(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
legend('topright', legend = c('SpatialPack', 'Own code'), col = 2:3, lty = 1, lwd = 2:1)
The image shows that the results of using the SpatialPack package and the own code are the same.

Resources