Choosing eps and minpts for DBSCAN (R)? - r

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:
library(fpc)
ds <- dbscan(USArrests,eps=20)
Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".
This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.

There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.
For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.
OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.
Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.
The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.

MinPts
As Anony-Mousse explained, 'A low minPts means it will build more clusters from noise, so don't choose it too small.'.
minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is use ln(n), where n is the total number of points to be clustered.
epsilon
There are several ways to determine it:
1) k-distance plot
In a clustering with minPts = k, we expect that core pints and border points' k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide
2) DBSCAN extensions like OPTICS
OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.
3) sensitivity analysis
Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.

One common and popular way of managing the epsilon parameter of DBSCAN is to compute a k-distance plot of your dataset. Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPTS (which strongly depends on your data), you fix k to that value. Then you use as epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope.

For details on choosing parameters, see the paper below on p. 11:
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
For two-dimensional data: use default value of minPts=4 (Ester et al., 1996)
For more than 2 dimensions: minPts=2*dim (Sander et al., 1998)
Once you know which MinPts to choose, you can determine Epsilon:
Plot the k-distances with k=minPts (Ester et al., 1996)
Find the 'elbow' in the graph--> The k-distance value is your Epsilon value.

If you have the resources, you can also test a bunch of epsilon and minPts values and see what works. I do this using expand.grid and mapply.
# Establish search parameters.
k <- c(25, 50, 100, 200, 500, 1000)
eps <- c(0.001, 0.01, 0.02, 0.05, 0.1, 0.2)
# Perform grid search.
grid <- expand.grid(k = k, eps = eps)
results <- mapply(grid$k, grid$eps, FUN = function(k, eps) {
cluster <- dbscan(data, minPts = k, eps = eps)$cluster
sum <- table(cluster)
cat(c("k =", k, "; eps =", eps, ";", sum, "\n"))
})

See this webpage, section 5: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning
It gives detailed instructions on how to find epsilon. MinPts ... not so much.

Related

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

Determine mode locations of the kernel density estimate of multimodal univariate data

If I have a density function and I plot it with a particular bandwidth, I visually determine that there are 7 local maximums. I would just like to know how to plot separate distributions of the particular maximums on the same plot.
Also, if is possible to know exactly where the maximums occur by running some code? I can make ball-park estimates using the plot but is there an R function that I can use to get the exact points? I would like to know the mean and variance of the 7 densities that I have identified.
Specifically, I have the following:
plot(density(stamp, bw=0.0013,kernel = "gaussian"))
Determining which modes are real in a kernel density estimate is a matter of which bandwidth you chose to use. This is a complicated thing, and I don't advise choosing but a single bandwidth, as even different optimal rules of thumb can give you different answers. In general, the number of modes of a kde is less than the number of the underlying density in the oversmooothed case and more so in the undersmoothed case. There are many papers that cover this topic and give you many options to help determine the veracity of a mode. e.g., check out Silverman's mode test for Gaussian kernels, Friedman and Fisher's prim algorithm, Marron's siZer, and Minnotte and Scott's mode tree are good places to start.
A naive thing you can do, given a single KDE choice of bandwidth is check the run lengths.
In fact, with the bandwidth you have chosen, I find 9 modes. Just calculate the sign change of the difference in the series, and calculate the cumulative length of the runs in order to find the points. Every other point will be a mode or an antimode, depending on which came first. (You can check the sign to determine this)
library(BSDA)
dstamp <- density(Stamp$thickness, bw=0.0013, kernel = "gaussian")
chng <- cumsum(rle(sign(diff(dstamp$y)))$lengths)
plot(dstamp)
abline(v = dstamp$x[chng[seq(1,length(chng),2)]])
Since I needed something to get only the strongest modes, I created a dead simple algorithm that allows you to increase sensitivity by tweaking the number of density samples (to deacrease local noise) and put a minum density threshold, proportional to the max density (to decrease the global noise).
find_posterior_modes <- function(x, n.samples = 100, filter = .1) {
d <- density(x, n = n.samples)
x <- with(d, sapply(2:(n.samples - 1), function(i) if (y[i] > y[i - 1] & y[i] > y[i + 1] & y[i] > max(y) * filter) x[i]))
unlist(x)
}
I recently released the package ModEstM, it uses the same method as shayaa, with two features to suppress the less significant modes :
it is possible to choose the bandwidth of the density estimation, by choosing the "adjust" parameter of the density function,
the modes are presented in decreasing order of the corresponding density.

DBSCAN Clustering with additional features

Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?
I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...
my R Code:
df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)
and the plot I get:
You are misinterpreting the plot.
You don't get one result per plot, but all plots show the same clusters, only in different attributes.
But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.
In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.
For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.
If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:
compute Haversine distances
compute Score dissimilarity
distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)
It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.
You need to scale your data. V3 has a range which is much larger than the range for the V1 and V2 and thus DBSCAN currently mostly ignores V3.

Can a very large (or very small) value in feature vector using SVC bias results? [scikit-learn]

I am trying to better understand how the values of my feature vector may influence the result. For example, let's say I have the following vector with the final value being the result (this is a classification problem using an SVC, for example):
0.713, -0.076, -0.921, 0.498, 2.526, 0.573, -1.117, 1.682, -1.918, 0.251, 0.376, 0.025291666666667, -200, 9, 1
You'll notice that most of the values center around 0, however, there is one value that is orders of magnitude smaller, -200.
I'm concerned that this value is skewing the prediction and is being weighted unfairly heavier than the rest simply because the value is so much different.
Is this something to be concerned about when creating a feature vector? Or will the statistical test I use to evaluate my vector control for this large (or small) value based on the training set I provide it with? Are there methods available in sci-kit learn specifically that you would recommend to normalize the vector?
Thank you for your help!
Yes, it is something you should be concerned about. SVM is heavily influenced by any feature scale variances, so you need a preprocessing technique in order to make it less probable, from the most popular ones:
Linearly rescale each feature dimension to the [0,1] or [-1,1] interval
Normalize each feature dimension so it has mean=0 and variance=1
Decorrelate values by transformation sigma^(-1/2)*X where sigma = cov(X) (data covariance matrix)
each can be easily performed using scikit-learn (although in order to achieve the third one you will need a scipy for matrix square root and inversion)
I am trying to better understand how the values of my feature vector may influence the result.
Then here's the math for you. Let's take the linear kernel as a simple example. It takes a sample x and a support vector sv, and computes the dot product between them. A naive Python implementation of a dot product would be
def dot(x, sv):
return sum(x_i * sv_i for x_i, sv_i in zip(x, sv))
Now if one of the features has a much more extreme range than all the others (either in x or in sv, or worse, in both), then the term corresponding to this feature will dominate the sum.
A similar situation arises with the polynomial and RBF kernels. The poly kernel is just a (shifted) power of the linear kernel:
def poly_kernel(x, sv, d, gamma):
return (dot(x, sv) + gamma) ** d
and the RBF kernel is the square of the distance between x and sv, times a constant:
def rbf_kernel(x, sv, gamma):
diff = [x_i - sv_i for x_i, sv_i in zip(x, sv)]
return gamma * dot(diff, diff)
In each of these cases, if one feature has an extreme range, it will dominate the result and the other features will effectively be ignored, except to break ties.
scikit-learn tools to deal with this live in the sklearn.preprocessing module: MinMaxScaler, StandardScaler, Normalizer.

approximation methods

I attached image:
(source: piccy.info)
So in this image there is a diagram of the function, which is defined on the given points.
For example on points x=1..N.
Another diagram, which was drawn as a semitransparent curve,
That is what I want to get from the original diagram,
i.e. I want to approximate the original function so that it becomes smooth.
Are there any methods for doing that?
I heard about least squares method, which can be used to approximate a function by straight line or by parabolic function. But I do not need to approximate by parabolic function.
I probably need to approximate it by trigonometric function.
So are there any methods for doing that?
And one idea, is it possible to use the Least squares method for this problem, if we can deduce it for trigonometric functions?
One more question!
If I use the discrete Fourier transform and think about the function as a sum of waves, so may be noise has special features by which we can define it and then we can set to zero the corresponding frequency and then perform inverse Fourier transform.
So if you think that it is possible, then what can you suggest in order to identify the frequency of noise?
Unfortunately many solutions here presented don't solve the problem and/or they are plain wrong.
There are many approaches and they are specifically built to solve conditions and requirements you must be aware of !
a) Approximation theory: If you have a very sharp defined function without errors (given by either definition or data) and you want to trace it exactly as possible, you are using
polynominal or rational approximation by Chebyshev or Legendre polynoms, meaning that you
approach the function by a polynom or, if periodical, by Fourier series.
b) Interpolation: If you have a function where some points (but not the whole curve!) are given and you need a function to get through this points, you can use several methods:
Newton-Gregory, Newton with divided differences, Lagrange, Hermite, Spline
c) Curve fitting: You have a function with given points and you want to draw a curve with a given (!) function which approximates the curve as closely as possible. There are linear
and nonlinear algorithms for this case.
Your drawing implicates:
It is not remotely like a mathematical function.
It is not sharply defined by data or function
You need to fit the curve, not some points.
What do you want and need is
d) Smoothing: Given a curve or datapoints with noise or rapidly changing elements, you only want to see the slow changes over time.
You can do that with LOESS as Jacob suggested (but I find that overkill, especially because
choosing a reasonable span needs some experience). For your problem, I simply recommend
the running average as suggested by Jim C.
http://en.wikipedia.org/wiki/Running_average
Sorry, cdonner and Orendorff, your proposals are well-minded, but completely wrong because you are using the right tools for the wrong solution.
These guys used a sixth polynominal to fit climate data and embarassed themselves completely.
http://scienceblogs.com/deltoid/2009/01/the_australians_war_on_science_32.php
http://network.nationalpost.com/np/blogs/fullcomment/archive/2008/10/20/lorne-gunter-thirty-years-of-warmer-temperatures-go-poof.aspx
Use loess in R (free).
E.g. here the loess function approximates a noisy sine curve.
(source: stowers-institute.org)
As you can see you can tweak the smoothness of your curve with span
Here's some sample R code from here:
Step-by-Step Procedure
Let's take a sine curve, add some
"noise" to it, and then see how the
loess "span" parameter affects the
look of the smoothed curve.
Create a sine curve and add some noise:
period <- 120 x <- 1:120 y <-
sin(2*pi*x/period) +
runif(length(x),-1,1)
Plot the points on this noisy sine curve:
plot(x,y, main="Sine Curve +
'Uniform' Noise") mtext("showing
loess smoothing (local regression
smoothing)")
Apply loess smoothing using the default span value of 0.75:
y.loess <- loess(y ~ x, span=0.75,
data.frame(x=x, y=y))
Compute loess smoothed values for all points along the curve:
y.predict <- predict(y.loess,
data.frame(x=x))
Plot the loess smoothed curve along with the points that were already
plotted:
lines(x,y.predict)
You could use a digital filter like a FIR filter. The simplest FIR filter is just a running average. For more sophisticated treatment look a something like a FFT.
This is called curve fitting. The best way to do this is to find a numeric library that can do it for you. Here is a page showing how to do this using scipy. The picture on that page shows what the code does:
(source: scipy.org)
Now it's only 4 lines of code, but the author doesn't explain it at all. I'll try to explain briefly here.
First you have to decide what form you want the answer to be. In this example the author wants a curve of the form
f(x) = p0 cos (2π/p1 x + p2) + p3 x
You might instead want the sum of several curves. That's OK; the formula is an input to the solver.
The goal of the example, then, is to find the constants p0 through p3 to complete the formula. scipy can find this array of four constants. All you need is an error function that scipy can use to see how close its guesses are to the actual sampled data points.
fitfunc = lambda p, x: p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x # Target function
errfunc = lambda p: fitfunc(p, Tx) - tX # Distance to the target function
errfunc takes just one parameter: an array of length 4. It plugs those constants into the formula and calculates an array of values on the candidate curve, then subtracts the array of sampled data points tX. The result is an array of error values; presumably scipy will take the sum of the squares of these values.
Then just put some initial guesses in and scipy.optimize.leastsq crunches the numbers, trying to find a set of parameters p where the error is minimized.
p0 = [-15., 0.8, 0., -1.] # Initial guess for the parameters
p1, success = optimize.leastsq(errfunc, p0[:])
The result p1 is an array containing the four constants. success is 1, 2, 3, or 4 if ths solver actually found a solution. (If the errfunc is sufficiently crazy, the solver can fail.)
This looks like a polynomial approximation. You can play with polynoms in Excel ("Add Trendline" to a chart, select Polynomial, then increase the order to the level of approximation that you need). It shouldn't be too hard to find an algorithm/code for that.
Excel can show the equation that it came up with for the approximation, too.

Resources