DBSCAN Clustering with additional features - r

Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?
I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...
my R Code:
df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)
and the plot I get:

You are misinterpreting the plot.
You don't get one result per plot, but all plots show the same clusters, only in different attributes.
But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.
In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.
For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.
If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:
compute Haversine distances
compute Score dissimilarity
distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)
It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.

You need to scale your data. V3 has a range which is much larger than the range for the V1 and V2 and thus DBSCAN currently mostly ignores V3.

Related

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

Different results between fpc::dbscan and dbscan::dbscan

I want to implement DBSCAN in R on some GPS coordinates. I have a distance matrix (dist_matrix) that I fed into the following functions:
dbscan::dbscan(dis_matrix, eps=50, minPts = 5,borderPoints=TRUE)
fpc::dbscan(dis_matrix,eps = 50,MinPts = 5,method = "dist")
and Im getting very different results from both functions in terms of number of clusters and if a point is a noise point or belongs to a cluster. Basically, the results are inconsistent between two algorithms. I have no clue why they generate these very different results although here
http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
we see for iris data, both functions did the same.
my distance matrix [is coming from a function (geosphere::distm) which calculates the spatial distance between more than 2000 coordinates.
Furthermore, I coded dbscan according to this psuedo-code
source: https://cse.buffalo.edu/~jing/cse601/fa13/materials/clustering_density.pdf
My results are equal to what I obtained from fpc package.
Can anyone notice why they are different. I already looked into both functions and haven't found anything.
The documentation of geosphere::distm says that it does not return a dist object but a matrix. dbscan::dbscan assumes that you have a data matrix and not distances. Convert your matrix into a dist object with as.dist first. THis should resolve the problem.

r - DBSCAN (Density Based Clustering) describe unit of measure for eps

I was trying to use the dbscan package in R to try to cluster some spatial data. The dbscan::dbscan function takes eps and minpts as input. I have a dataframe with two columns longitude and latitude expressed in degree decimals like in the following:
df <- data.frame(lon = c(seq(1,5,1), seq(1,5,1)),
lat = c(1.1,3.1,1.2,4.1,2.1,2.2,3.2,2.4,1.4,5.1))
and I apply the algorithm:
db <- fpc::dbscan(df, eps = 1, MinPts = 2)
will eps here be defined in degrees or in some other unit ? I'm really trying to understand in which unit this maximum distance eps value is expressed so any help is appreciated
Never use the fpc package, always use dbscan::dbscan instead.
If you have latitude and longitude, you need to choose an appropriate distance function such as Haversine.
The default distance function, Euclidean, ignores the spherical nature of earth. The eps value then is a mixture of degrees latitude and longitude, but these do not correspond to uniform distances! One degree east at the equator is much farther than one degree east in Vancouver.
Even then, you need to pay attention to units. One implementation of Haversine may yield radians, another one meters, and of course someone crazy will work in miles.
Unfortunately, as far as I can tell, none of the R implementations can accelerate Haversine distance. So it may be much faster to cluster the data in ELKI instead (you need to add an index yourself though).
If your data is small enough, you can however use a precomputed distance matrix (dist object) in R. But that will take O(n²) time and memory, so it is not very scalable.

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:
library(fpc)
ds <- dbscan(USArrests,eps=20)
Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".
This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.
There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.
For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.
OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.
Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.
The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.
MinPts
As Anony-Mousse explained, 'A low minPts means it will build more clusters from noise, so don't choose it too small.'.
minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is use ln(n), where n is the total number of points to be clustered.
epsilon
There are several ways to determine it:
1) k-distance plot
In a clustering with minPts = k, we expect that core pints and border points' k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide
2) DBSCAN extensions like OPTICS
OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.
3) sensitivity analysis
Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.
One common and popular way of managing the epsilon parameter of DBSCAN is to compute a k-distance plot of your dataset. Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPTS (which strongly depends on your data), you fix k to that value. Then you use as epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope.
For details on choosing parameters, see the paper below on p. 11:
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
For two-dimensional data: use default value of minPts=4 (Ester et al., 1996)
For more than 2 dimensions: minPts=2*dim (Sander et al., 1998)
Once you know which MinPts to choose, you can determine Epsilon:
Plot the k-distances with k=minPts (Ester et al., 1996)
Find the 'elbow' in the graph--> The k-distance value is your Epsilon value.
If you have the resources, you can also test a bunch of epsilon and minPts values and see what works. I do this using expand.grid and mapply.
# Establish search parameters.
k <- c(25, 50, 100, 200, 500, 1000)
eps <- c(0.001, 0.01, 0.02, 0.05, 0.1, 0.2)
# Perform grid search.
grid <- expand.grid(k = k, eps = eps)
results <- mapply(grid$k, grid$eps, FUN = function(k, eps) {
cluster <- dbscan(data, minPts = k, eps = eps)$cluster
sum <- table(cluster)
cat(c("k =", k, "; eps =", eps, ";", sum, "\n"))
})
See this webpage, section 5: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning
It gives detailed instructions on how to find epsilon. MinPts ... not so much.

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources