Cluster assignments differ sometimes in two DBSCAN implementations - r

I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))
Clustering is done with parameters as below:
eps = 0.2
MinPts = 5
I am comparing the cluster assignments of the fpc::dbscan with my implementation of dbscan . Maximum of the runs shows every point was classified identically by both implementations.
But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.
I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.
Plot of clusters.
Top inset is fpc::dbscan, bottom inset is my dbscan implementation
Note The point which differs in my implementation is marked with an exclamation mark (!)
I am also uploading zoomed images of the mismatch section:
My dbscan implementation output
+ are core points
o are border points
- are noise points
! highlights the differing point
fpc::dbscan implementation output
triangles are core points
coloured circles are border points
black circles are noise points
Another example:
My dbscan implementation output
fpc::dbscan implementation output
EDIT
Equal x-y scaled example
As requested by Anony-Mousse
In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:
fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly
my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly
Question
I am new into cluster analysis therefore i have another question: is these type of difference allowable?
In my implementation i am scanning from the first point to the last point as it is supplied, also in fpc::dbscan the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by !) from the same cluster center. Also i have generates some cases in which fpc::dbscan marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?
Code segments on request.

DBSCAN is known to be order dependant for border points. They will be assigned to the cluster they are first discovered from. If a border point is not dense, but in the vincinity of two dense points from different clusters, it can be assigned to either.
This is why DBSCAN is often described as "order independent, except for border points".
Try shuffling the data (or reversing!), then rerunning your algorithm. The results should change.
As I assume neither your nor the fpc implementation has index support (to speed up range queries and make the algorithm run in O(n log n)), I'd guess that one of the implementations is processing the points in forward order, the other one in backward order. '''Update: indexes should not play much of a role, as they don't change the order across clusters, only within one cluster'''.
Another option for "generating" this difference is to
keep the first (non-noise) cluster assignment of each point (IIRC official DBSCAN pseudocode)
keep the last cluster assignment of each point (fbc::dbscan seems to do this)
These will also generate different results on objects that are border points to more than once cluster. There also is the possibility to assign these points to both cluters, which will yield a non-strict partitioning of the data set. Usually, the benefits of having a strict partitioning are more important than having a fully deterministic result.
Don't get me wrong: the "overwrite" strategy of fbc::dbscan doesn't substantially change the results. I would probably even implement it that way myself.
Are any non-border points affected?

Related

Spatstat, using the Matérn cluster process to generate homogeneous landscapes, how do I interpret the Ripley K function?

I am looking to develop a point process that ranges from homogeneous, i.e. no correlation between points to a point cluster process that does have correlation between points. From experimentation I can see that using the Matérn cluster process I can generate landscapes that are clustered.
library(spatstat)
plot(rMatClust(kappa=3,r=0.1,mu=50))
I want to use the simplest code that increases the level of homogeneity, i.e. decreasing dependence of points on each other. I do not want to use a binary model where either the pattern is homogeneous or not. i.e. Just a poisson process which can be generated such as:
plot(rpoispp(150))
From experimentation I noticed that if I increase the radius of the clusters using the Matérn cluster process, I do seem to create a pseudo homogeneous pattern.
plot(rMatClust(kappa=3,r=0.3,mu=50))
plot(rMatClust(kappa=3,r=0.7,mu=50))
Is this a good way of generating degrees of homogeneity? I understand that I can use statistical tests to measure the degree of clustering compared to a complete poisson process, such as the Ripley K test. For example, if I assign the Matérn cluster process data to variables, such as:
a<-rMatClust(kappa=3,r=0.1,mu=50)
b<-rMatClust(kappa=3,r=0.3,mu=50)
c<-rMatClust(kappa=3,r=0.7,mu=50)
Then use the Ripley K test and plot the results:
plot(Kest(a))
plot(Kest(b))
plot(Kest(c))
I can see that the difference between a homogeneous poisson process and the clustered point process decreases. I still do not fully understand the significance of the various K values according to edge effects and so forth, and how to interpret the Ripley K function, but I think this is the right direction to be heading in? How do I interpret the Ripley K function? Another problem is the number of points in each plot, I do not have a consistent number of points in each plot, as can be seen by:
summary(a)
summary(b)
summary(c)
Any knowledgeable feedback on this is greatly appreciated.
The standard terminology is that you want to generate a clustered point pattern.
The function rMatClust generates a clustered point pattern at random, in a two-stage process. The first stage is to generate "parent" points completely at random. The second stage is to generate, for each "parent", a random number of "offspring" points, and to place the "offspring" points inside a circle of radius R around their "parent". The final result is the collection of all "offspring" points. From this description (and help(rMatClust)) you can figure out what happens for different parameter values.
The K function (not the "K test") is a summary of the spacing between points in a point pattern. At a distance r, the value of K(r) is the normalised average number of points observed to fall within distance r of a typical point in the pattern. It is normalised so that it does not depend on the number of points, making it possible to compare patterns with different numbers of points.
When you plot the K function, one of the curves is the theoretical curve that would be expected if the points are completely random, and the other curves are computed from the data point pattern. This allows you to assess whether the point pattern appears to be clustered.
I strongly suggest you do some reading in Chapter 7 of the spatstat book. You can download this chapter for free.

computer vision: segmentation setup. Graph cut potentials

I have been trying to teach myself some simple computer vision algorithms and am trying to solve a problem where I have some noise corrupted image and all I am trying to do is separate the black background from the foreground which has some signal. Now, the background RGB channels are not all completely zero as they can have some noise. However, the human eye can easily discern the foreground from the background.
So, what I did was use the SLIC algorithm to break the image down into super pixels. The idea being that since the image is noise corrupted, doing statistics on the patches might result in better classification of background and foreground because of higher SNR.
After this, I get around 100 patches which should have similar profile and the result of SLIC seems reasonable. I have been reading about graph cuts (the Kolmogorov paper) and it seemed like something nice to try for the binary problem I have. So, I constructed a graph which is a first order MRF and I have edges between the immediate neighbours (4-connected graph).
Now, I was wondering what possible unary and binary terms I can use here to do my segmentation. So, I was thinking for the unary term, I can model it as a simple Gaussian where the background should have a zero mean intensity and the foreground should have some non-zero mean. Although, I am struggling to figure out how to encode this. Should I just assume some noise variance and compute probabilities directly using patch statistics?
Similarly, for neighbouring patches I do want to encourage them to take similar label but I am not sure what binary term I can design that reflects that. Seems just the difference between the label (1 or 0) seems weird...
Sorry for the long-winded question. Hoping someone can give some helpful hint on how to start.
You could build your CRF model over superpixels, such that a superpixel has a connection to another superpixel if it is a neighbour of it.
For your statistical model Pixel Wise Posteriors are simple and cheap to compute.
So, I suggest the following for the unary terms of the CRF:
Build foreground and background histograms over texture per pixel(assuming you have a mask, or reasonable amount of marked foreground pixels(note, not superpixels)).
For each superpixel, make an independence assumption over pixels within it, such that a superpixels likelihood of being either foreground or background is the product over each observation in the superpixel(in practice, we sum logs). The individual likelihood terms come from the histograms that you generated.
Compute the posterior for foreground as the cumulative likelihood described above for foreground divided by the sum of the cumulative likelihoods of both. Similar for background.
The pairwise terms between superpixels can be as simple as the difference between the mean observed textures(pixelwise) for each passed through a kernel, such as the Radial Basis Function.
Alternatively, you could compute histograms over each superpixels observed texture(again, pixel wise) and compute the Bhattacharyya Distance between each neighbouring pair of superpixels.

Clustering GPS data using DBSCAN but clusters are not meaningful (in terms of size)

I am working with GPS data (latitude, longitude). For density based clustering I have used DBSCAN in R.
Advantages of DBSCAN in my case:
I don't have to predefine numbers of clusters
I can calculate a distance matrix (using Haversine Distance
Formula) and use that as input in dbscan
library(fossil)
dist<- earth.dist(df, dist=T) #df is dataset containing lat long values
library(fpc)
dens<-dbscan(dist,MinPts=25,eps=0.43,method="dist")
Now, when I look at the clusters, they are not meaningful. Some clusters have points which are more than 1km apart. I want dense clusters but not that big in size.
Different values of MinPts and eps are taken care of and I have also used k nearest neighbor distance graph to get an optimum value of eps for MinPts=25
What dbscan is doing is going to every point in my dataset and if point p has MinPts in its eps neighborhood it will make a cluster but at the same time it is also joining the clusters which are density reachable (which I guess are creating a problem for me).
It really is a big question, particularly "how to reduce size of a cluster without affecting its information too much", but I will write it down as the following points:
How to remove border points in a cluster? I know which points are in
which cluster using dens$cluster, but how would I know if a
particular point is core or border?
Is cluster 0 always noise?
I was under the impression that the size of a cluster would be
comparable to eps. But that's not the case because density reachable
clusters are combined together.
Is there any other clustering method which has the advantage of dbscan
but can give me more meaningful clusters?
OPTICS is another alternative but will it solve my issue?
Note: By meaningful I want to say closer points should be in a cluster. But points which are 1km or more apart should not be in the same cluster.
DBSCAN doesn't claim the radius is the maximum cluster size.
Have you read the article? It's looking for arbitrarily shaped clusters; eps is just the core size of a point; roughly the size used for density estimation; any point within this radius of a core point will be part of a cluster.
This makes it essentially the maximum step size to connect dense points. But they may still form a chain of density connected points, of arbitary shape or size.
I don't know what cluster 0 is in your R implementation. I've experimented with the R implementation, but it was waaaay slower than all the others. I don't recommend using R, there are much better tools for cluster analysis available, such as ELKI. Try running DBSCAN with your settings on ELKI, with LatLngDistanceFunction and and sort-tile-recursive loaded R-tree index. You'll be surprised how fast it can be, compared to R.
OPTICS is looking for the same density connected type of clusters. Are you sure this arbitrarily-shaped type of clusters is what you are looking for?
IMHO, you are using the wrong method for your goals (and you aren't really explaining what you are trying to achieve)
If you want a hard limit on the cluster diameter, use complete-linkage hierarchical clustering.

Minimising interpolation error between two data sets

In the top of the diagrams below we can see some value (y-axis) changing over time (x-axis).
As this happens we are sampling the value at different and unpredictable times, also we are alternating the sampling between two data sets, indicated by red and blue.
When computing the value at any time, we expect that both red and blue data sets will return similar values. However as shown in the three smaller boxes this is not the case. Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
Initially I used linear interpolation to obtain a value, next I tried using Catmull-Rom interpolation. The former results in a values come close together and then drift apart between each data point; the latter results in values which remain closer, but where the average error is greater.
Can anyone suggest another strategy or interpolation method which will provide greater smoothing (perhaps by using a greater number of sample points from each data set)?
I believe what you ask is a question that does not have a straight answer without further knowledge on the underlying sampled process. By its nature, the value of the function between samples can be merely anything, so I think there is no way to assure the convergence of the interpolations of two sample arrays.
That said, if you have a prior knowledge of the underlying process, then you can choose among several interpolation methods to minimize the errors. For example, if you measure the drag force as a function of the wing velocity, you know the relation is square (a*V^2). Then you can choose polynomial fitting of the 2nd order and have pretty good match between the interpolations of the two serieses.
Try B-splines: Catmull-Rom interpolates (goes through the data points), B-spline does smoothing.
For example, for uniformly-spaced data (not your case)
Bspline(t) = (data(t-1) + 4*data(t) + data(t+1)) / 6
Of course the interpolated red / blue curves depend on the spacing of the red / blue data points,
so cannot match perfectly.
I'd like to quote Introduction to Catmull-Rom Splines to suggest not using Catmull-Rom for this interpolation task.
One of the features of the Catmull-Rom
spline is that the specified curve
will pass through all of the control
points - this is not true of all types
of splines.
By definition your red interpolated curve will pass through all red data points and your blue interpolated curve will pass through all blue points. Therefore you won't get a best fit for both data sets.
You might change your boundary conditions and use data points from both data sets for a piecewise approximation as shown in these slides.
I agree with ysap that this question cannot be answered as you may be expecting. There may be better interpolation methods, depending on your model dynamics - as with ysap, I recommend methods that utilize the underlying dynamics, if known.
Regarding the red/blue samples, I think you have made a good observation about sampled and interpolated data sets and I would challenge your original expectation that:
When computing the value at any time, we expect that both red and blue data sets will return similar values.
I do not expect this. If you assume that you cannot perfectly interpolate - and particularly if the interpolation error is large compared to the errors in samples - then you are certain to have a continuous error function that exhibits largest errors longest (time) from your sample points. Therefore two data sets that have differing sample points should exhibit the behaviour you see because points that are far (in time) from red sample points may be near (in time) to blue sample points and vice versa - if staggered as your points are, this is sure to be true. Thus I would expect what you show, that:
Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
(If you do not have information about underlying dynamics (except frequency content), then Giacomo's points on sampling are key - however, you need not interpolate if looking at info below Nyquist.)
When sampling the original continuous function, the sampling frequency should comply to the Nyquist-Shannon sampling theorem, otherwise the sampling process introduces an error (also known as aliasing). The error, being different in the two datasets, results in a different value when you interpolate.
Therefore, you need to know the highest frequency B of the original function and then collect samples with a frequency at least 2B. If your function has very high frequencies and you cannot sample that fast, you should at least try to filter them away before sampling.

Graph Drawing With Weighted Edges

I'm looking to build an algorithm (or reuse one) that organizes nodes and edges on a 2 dimensional canvas where edges can have corresponding weights.
Any starting material and info would be helpful.
What would the weights do to affect their placement on your canvas?
That being said, you might want to look into graphviz and, more specifically, the DOT language, which organizes nodes on a canvas.
Many graph visualization frameworks use a force-based simulation, in which all nodes exert a repulsive force against each other (with their mass being their size), and edges exert tension on the nodes they connect. This creates aesthetically-arranged graph visualizations.
Although again, I'm not sure where you want node "weights" to come into play. Do you want weighted nodes to be more in the center? To be larger? More further apart?
Many graph/network layout algorithms are implicitly capable of handling weighted networks, but you may need to do some pre-processing and tweaks to the implementation to get it to work. Usually the first step is to determine if your weights represent "similarities" (usually interpreted to mean that stronger weights should place nodes closer togeter) or "dissimilarities" (stronger weights = father apart). The most common case is the former, so you will need to translate them to dissimilarities, often done by subtracting each edge value from the maximum observed edge value in the network. The matrix of dissimilarity values for each edge can then be fed to the algorithm and interpreted as desired distances in the layout space for each edge (i.e. "spring lengths")--usually after multiplying by some constant to transform to display units (pixels).
If you tell me what language you are using, I may be able to point you to some code examples.

Resources