Creation of correlated marks. E.g. point sizes varying with inter-point distances - r

I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.

I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).

Related

PCL RANSAC model fitting: How can I initialise the model parameters?

I'm reading the PCL tutorial on plane segmentation, because I want to find 3D circles in a very large and dense point cloud I have.
I know already the approximate values for center, radius and orientation of the circle, but I have found no way so far to inform the SACSegmentation object of this fact. I could also name 3 inliers to compute initial values on, but I also don't find a way to do this.
My pointcloud is extremely large (10-20M points), so just random samples will likely be prohibitive, especially since I know already more or less what the parameter values should be and only want to optimize them.
Question: How can I set the starting point of the Sample Consensus optimization procedure?
To segment and optimize model
Set SACSegmentation::setOptimizeCoefficients(true)
Use SACSegmentation::segment which takes in an initial guess (or the final model to segment using iff optimize coefficients is set as false)
You can provide you guess here. Depending on optimization method used, you can reduce the computational load.

What is the difference between metric and non-metric MDS for a beginner?

I am fairly new to data science and would like to know in simple words (like teaching your grandmother) what the difference between metric and non-metric Multidimensional scaling is.
I have been googling for 2 days and watching different videos and wasn't able to quite understand some of the terms people are using to describe the difference, maybe I am lacking some basic knowledge but I don't know in which area so if you have an idea of what I should have a firm understanding of before tackling this subject, I would appreciate the advice. Here is what I know:
Multidimensional scaling is a way of reducing dimensions to be able to visualize or represent data in a more friendly manner. I know that there are several ways for MDS like metric and non metric, PCA and FA (maybe FA is a part of PCA, I'm not sure).
The example I am trying to apply this on is a set of data showing different cities and attributes related to these cities. For example, on a score from 1-7 (1 lowest - 7 highest), this is the score of each city and the corresponding attribute.
**Clean** **Friendly** **Expensive** **Beautiful**
Berlin----------- 4 --------------------- 2-----------------------5------------------------6
Geneva---------6 --------------------- 3-----------------------7------------------------7
Paris------------ 3 --------------------- 4-----------------------6------------------------7
Barcelona----- 2 --------------------- 6-----------------------3------------------------4
How do I know if I should be using metric or non-metric MDS. Are there general rules of thumb or simple logic that I can use to decide without going deep into the technical process.
Thank you
Well, I might not be able to give you a specific answer but a simple answer would be that metric MDS already has the input matrix in the form of distances (i.e. actual distances between cities) and therefore the distances have meaning in the input matrix and create a map of actual physical locations from those distances.
In non-metric MDS, the distances are just a representation of the rankings (i.e. high as in 7 or low as in 1) and they do not have any meaning on their own but they are needed to create the map using euclidean geometry and the map then just shows the similarity in rankings represented by distances between coordinates on the map.
Metric MDS deals with an item x item input matrix whose entries represent Euclidean distance (special case of metric MDS called classical MDS and being equivalent to PCA) or any other distance between items.
Non-metric MDS deals with some distance-like measure (let's call it dissimilarity) between items. There is no requirement for the dissimilarity to satisfy formal properties of a distance/metric (see this wiki for needed properties). The only requirement is that it should be possible to order the dissimilarity values for all item x item pairs in non-decreasing order.
In your case, the item x attribute matrix contains ordinal data (data on a scale 1-7). Euclidean distance won't be appropriate here, but e.g. Pearson "distance" or cosine "distance" are usually used for such data and, as they're not proper distances, non-metric MDS should then be chosen.

Why is k-means clustering ignoring a significant patch of data?

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

Calculus, How can you find an equation from a series of numbers?

I'm analyzing financial data and would like to find the inflection points of a line. I know I can do this using derivatives, but first I need an equation. Is there a way to generate an equation based off of a series of numbers. I would need to do this programmaticly.
Spline interpolation is probably more useful for you than polynomial interpolation: if you fit a polynomial, it must inevitably head off to +/- infinity outside your data range.
You will also want a method which allows a slightly loose fit: financial data is often a bit noisy which can result in very weird curves if you try to fit it exactly.
There are established procedures for turning a set of existing data points into a polynomial; this is called Polynomial Interpolation. This article in Wikipedia: http://en.wikipedia.org/wiki/Polynomial_interpolation
explains it mathematically. You can probably Google for algorithms easily enough.
Given enough points, your polynomial tracks the original, unknown function reasonably well, so the polynomial's inflection points should roughly coincide with the peaks and troughs of your data.
On the other hand, we all know there's not really a function behind financial data. So if I were you I'd scan along those points and find every point that has a smaller value to either side of it, and declare that a high; and vice versa for lows. Force-fitting this data into a fictitious function isn't going to make it any more useful.
Update: Tom Smith advises that spline interpolation is to be preferred to polynomial interpolation for this kind of thing, and Wikipedia bears him out. Or rather, it's bullish on his answer.
What you are thinking is analytical calculus ... when having discrete data (e.g. points), you have to do it numerically. Now, a line usually doesn't have inflection points, so I guess you're thinking of a curve. You can either interpolate some kind of it through the points, then calculate the first derivative (also numerically, but for a larger number of points), or you can just calculate the first derivation from the points you have (which will be better depends on how many points you actually have).
But really, this is just theory since we don't know the nature of data, or the language or anything.
For more on the subject search: numerical analysis on wiki, and go from there.
I think curve fitting might help you in this case. Here is a discussion which might be handy.
cheers

Resources