Spatstat, using the Matérn cluster process to generate homogeneous landscapes, how do I interpret the Ripley K function? - r

I am looking to develop a point process that ranges from homogeneous, i.e. no correlation between points to a point cluster process that does have correlation between points. From experimentation I can see that using the Matérn cluster process I can generate landscapes that are clustered.
library(spatstat)
plot(rMatClust(kappa=3,r=0.1,mu=50))
I want to use the simplest code that increases the level of homogeneity, i.e. decreasing dependence of points on each other. I do not want to use a binary model where either the pattern is homogeneous or not. i.e. Just a poisson process which can be generated such as:
plot(rpoispp(150))
From experimentation I noticed that if I increase the radius of the clusters using the Matérn cluster process, I do seem to create a pseudo homogeneous pattern.
plot(rMatClust(kappa=3,r=0.3,mu=50))
plot(rMatClust(kappa=3,r=0.7,mu=50))
Is this a good way of generating degrees of homogeneity? I understand that I can use statistical tests to measure the degree of clustering compared to a complete poisson process, such as the Ripley K test. For example, if I assign the Matérn cluster process data to variables, such as:
a<-rMatClust(kappa=3,r=0.1,mu=50)
b<-rMatClust(kappa=3,r=0.3,mu=50)
c<-rMatClust(kappa=3,r=0.7,mu=50)
Then use the Ripley K test and plot the results:
plot(Kest(a))
plot(Kest(b))
plot(Kest(c))
I can see that the difference between a homogeneous poisson process and the clustered point process decreases. I still do not fully understand the significance of the various K values according to edge effects and so forth, and how to interpret the Ripley K function, but I think this is the right direction to be heading in? How do I interpret the Ripley K function? Another problem is the number of points in each plot, I do not have a consistent number of points in each plot, as can be seen by:
summary(a)
summary(b)
summary(c)
Any knowledgeable feedback on this is greatly appreciated.

The standard terminology is that you want to generate a clustered point pattern.
The function rMatClust generates a clustered point pattern at random, in a two-stage process. The first stage is to generate "parent" points completely at random. The second stage is to generate, for each "parent", a random number of "offspring" points, and to place the "offspring" points inside a circle of radius R around their "parent". The final result is the collection of all "offspring" points. From this description (and help(rMatClust)) you can figure out what happens for different parameter values.
The K function (not the "K test") is a summary of the spacing between points in a point pattern. At a distance r, the value of K(r) is the normalised average number of points observed to fall within distance r of a typical point in the pattern. It is normalised so that it does not depend on the number of points, making it possible to compare patterns with different numbers of points.
When you plot the K function, one of the curves is the theoretical curve that would be expected if the points are completely random, and the other curves are computed from the data point pattern. This allows you to assess whether the point pattern appears to be clustered.
I strongly suggest you do some reading in Chapter 7 of the spatstat book. You can download this chapter for free.

Related

pointwise envelopes not including Theoretical line Foxall J

I am computing pointwise envelopes for the Foxall's J function to investigate the whether some point patterns of interest are clustered, avoid or are independent from other point patterns or polygons.
I am doing this using spatstat with a syntax similar to this:
evelope(my_pattern_of_interest, fun=Jfox, funargs=list(Y=my_other_patten),...)
I am calculating the envelopes for several replicated patterns (i.e., different transects) and then plotting the pooled (i.e., pool) the envelopes.
So far, whenever the Jfox is calculated between 2 point patterns, the shaded area representing the simulation envelope includes the theoretical line (i.e., represented by red dashes), as in this example:
Figure 1
Instead, when Jfox is calculated between a point pattern and polygons, it is frequent that the envelope area does not include the Theroretical line (at least in some regions). Like in this example:
Figure 2
What does this means?
From my understanding so far, if - or rather "where" since these are pointwise envelopes - the observed line (solid black) is within the shaded area, it means that the observed pattern (say clustering as in Figure 2) is not significantly different from what could be observed at random. Alternatively, if/where the observed line is out of the shaded area, then I have a significant difference.
What does the fact that the theoretical line lays outside the shaded area mean? Is it giving me another piece of information? How should I interpret this?
Thank you.

Is there a way to locate the knee of the k-nearest neighbour graph?

I am trying to write a function in R that automatically chooses the optimal parameters epsilon and MinPts in a DBSCAN analysis. I found that the k-nearest neighbour plot was very useful in order to select the optimal eps. However, I am trying to make the whole process automatic, and I was wondering if there was any method to locate the exact position of the knee so as to have the most representative eps possible.
I tried listing all the slopes at each point of a kNN dist plot and take the maximum value (since the knee represents the maximum point of curvature), but It wasn't very useful. Is it, indeed, possible to find the knee automatically or should one just look at the plot visually?
Thanks in advance.

Clustering GPS data using DBSCAN but clusters are not meaningful (in terms of size)

I am working with GPS data (latitude, longitude). For density based clustering I have used DBSCAN in R.
Advantages of DBSCAN in my case:
I don't have to predefine numbers of clusters
I can calculate a distance matrix (using Haversine Distance
Formula) and use that as input in dbscan
library(fossil)
dist<- earth.dist(df, dist=T) #df is dataset containing lat long values
library(fpc)
dens<-dbscan(dist,MinPts=25,eps=0.43,method="dist")
Now, when I look at the clusters, they are not meaningful. Some clusters have points which are more than 1km apart. I want dense clusters but not that big in size.
Different values of MinPts and eps are taken care of and I have also used k nearest neighbor distance graph to get an optimum value of eps for MinPts=25
What dbscan is doing is going to every point in my dataset and if point p has MinPts in its eps neighborhood it will make a cluster but at the same time it is also joining the clusters which are density reachable (which I guess are creating a problem for me).
It really is a big question, particularly "how to reduce size of a cluster without affecting its information too much", but I will write it down as the following points:
How to remove border points in a cluster? I know which points are in
which cluster using dens$cluster, but how would I know if a
particular point is core or border?
Is cluster 0 always noise?
I was under the impression that the size of a cluster would be
comparable to eps. But that's not the case because density reachable
clusters are combined together.
Is there any other clustering method which has the advantage of dbscan
but can give me more meaningful clusters?
OPTICS is another alternative but will it solve my issue?
Note: By meaningful I want to say closer points should be in a cluster. But points which are 1km or more apart should not be in the same cluster.
DBSCAN doesn't claim the radius is the maximum cluster size.
Have you read the article? It's looking for arbitrarily shaped clusters; eps is just the core size of a point; roughly the size used for density estimation; any point within this radius of a core point will be part of a cluster.
This makes it essentially the maximum step size to connect dense points. But they may still form a chain of density connected points, of arbitary shape or size.
I don't know what cluster 0 is in your R implementation. I've experimented with the R implementation, but it was waaaay slower than all the others. I don't recommend using R, there are much better tools for cluster analysis available, such as ELKI. Try running DBSCAN with your settings on ELKI, with LatLngDistanceFunction and and sort-tile-recursive loaded R-tree index. You'll be surprised how fast it can be, compared to R.
OPTICS is looking for the same density connected type of clusters. Are you sure this arbitrarily-shaped type of clusters is what you are looking for?
IMHO, you are using the wrong method for your goals (and you aren't really explaining what you are trying to achieve)
If you want a hard limit on the cluster diameter, use complete-linkage hierarchical clustering.

Cluster assignments differ sometimes in two DBSCAN implementations

I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))
Clustering is done with parameters as below:
eps = 0.2
MinPts = 5
I am comparing the cluster assignments of the fpc::dbscan with my implementation of dbscan . Maximum of the runs shows every point was classified identically by both implementations.
But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.
I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.
Plot of clusters.
Top inset is fpc::dbscan, bottom inset is my dbscan implementation
Note The point which differs in my implementation is marked with an exclamation mark (!)
I am also uploading zoomed images of the mismatch section:
My dbscan implementation output
+ are core points
o are border points
- are noise points
! highlights the differing point
fpc::dbscan implementation output
triangles are core points
coloured circles are border points
black circles are noise points
Another example:
My dbscan implementation output
fpc::dbscan implementation output
EDIT
Equal x-y scaled example
As requested by Anony-Mousse
In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:
fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly
my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly
Question
I am new into cluster analysis therefore i have another question: is these type of difference allowable?
In my implementation i am scanning from the first point to the last point as it is supplied, also in fpc::dbscan the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by !) from the same cluster center. Also i have generates some cases in which fpc::dbscan marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?
Code segments on request.
DBSCAN is known to be order dependant for border points. They will be assigned to the cluster they are first discovered from. If a border point is not dense, but in the vincinity of two dense points from different clusters, it can be assigned to either.
This is why DBSCAN is often described as "order independent, except for border points".
Try shuffling the data (or reversing!), then rerunning your algorithm. The results should change.
As I assume neither your nor the fpc implementation has index support (to speed up range queries and make the algorithm run in O(n log n)), I'd guess that one of the implementations is processing the points in forward order, the other one in backward order. '''Update: indexes should not play much of a role, as they don't change the order across clusters, only within one cluster'''.
Another option for "generating" this difference is to
keep the first (non-noise) cluster assignment of each point (IIRC official DBSCAN pseudocode)
keep the last cluster assignment of each point (fbc::dbscan seems to do this)
These will also generate different results on objects that are border points to more than once cluster. There also is the possibility to assign these points to both cluters, which will yield a non-strict partitioning of the data set. Usually, the benefits of having a strict partitioning are more important than having a fully deterministic result.
Don't get me wrong: the "overwrite" strategy of fbc::dbscan doesn't substantially change the results. I would probably even implement it that way myself.
Are any non-border points affected?

Point Sequence Interpolation

Given an arbitrary sequence of points in space, how would you produce a smooth continuous interpolation between them?
2D and 3D solutions are welcome. Solutions that produce a list of points at arbitrary granularity and solutions that produce control points for bezier curves are also appreciated.
Also, it would be cool to see an iterative solution that could approximate early sections of the curve as it received the points, so you could draw with it.
The Catmull-Rom spline is guaranteed to pass through all the control points. I find this to be handier than trying to adjust intermediate control points for other types of splines.
This PDF by Christopher Twigg has a nice brief introduction to the mathematics of the spline. The best summary sentence is:
Catmull-Rom splines have C1
continuity, local control, and
interpolation, but do not lie within
the convex hull of their control
points.
Said another way, if the points indicate a sharp bend to the right, the spline will bank left before turning to the right (there's an example picture in that document). The tightness of those turns in controllable, in this case using his tau parameter in the example matrix.
Here is another example with some downloadable DirectX code.
One way is Lagrange polynominal, which is a method for producing a polynominal which will go through all given data points.
During my first year at university, I wrote a little tool to do this in 2D, and you can find it on this page, it is called Lagrange solver. Wikipedia's page also has a sample implementation.
How it works is thus: you have a n-order polynominal, p(x), where n is the number of points you have. It has the form a_n x^n + a_(n-1) x^(n-1) + ...+ a_0, where _ is subscript, ^ is power. You then turn this into a set of simultaneous equations:
p(x_1) = y_1
p(x_2) = y_2
...
p(x_n) = y_n
You convert the above into a augmented matrix, and solve for the coefficients a_0 ... a_n. Then you have a polynomial which goes through all the points, and you can now interpolate between the points.
Note however, this may not suit your purpose as it offers no way to adjust the curvature etc - you are stuck with a single solution that can not be changed.
You should take a look at B-splines. Their advantage over Bezier curves is that each part is only dependent on local points. So moving a point has no effect on parts of the curve that are far away, where "far away" is determined by a parameter of the spline.
The problem with the Langrange polynomial is that adding a point can have extreme effects on seemingly arbitrary parts of the curve; there's no "localness" like described above.
Have you looked at the Unix spline command? Can that be coerced into doing what you want?
There are several algorithms for interpolating (and exrapolating) between an aribtrary (but final) set of points. You should check out numerical recipes, they also include C++ implementations of those algorithms.
Unfortunately the Lagrange or other forms of polynomial interpolation will not work on an arbitrary set of points. They only work on a set where in one dimension e.g. x
xi < xi+1
For an arbitary set of points, e.g. an aeroplane flight path, where each point is a (longitude, latitude) pair, you will be better off simply modelling the aeroplane's journey with current longitude & latitude and velocity. By adjusting the rate at which the aeroplane can turn (its angular velocity) depending on how close it is to the next waypoint, you can achieve a smooth curve.
The resulting curve would not be mathematically significant nor give you bezier control points. However the algorithm would be computationally simple regardless of the number of waypoints and could produce an interpolated list of points at arbitrary granularity. It would also not require you provide the complete set of points up front, you could simply add waypoints to the end of the set as required.
I came up with the same problem and implemented it with some friends the other day. I like to share the example project on github.
https://github.com/johnjohndoe/PathInterpolation
Feel free to fork it.
Google "orthogonal regression".
Whereas least-squares techniques try to minimize vertical distance between the fit line and each f(x), orthogonal regression minimizes the perpendicular distances.
Addendum
In the presence of noisy data, the venerable RANSAC algorithm is worth checking out too.
In the 3D graphics world, NURBS are popular. Further info is easily googled.

Resources