Can I use a custom distance measure for kmeans function?

Can I use a custom distance measure for kmeans function? - r

I am using the function, kmeans, to perfrom K-means clustering.
I have a special data which need a custom distance measure function and custom mean function.
Can I put (1) a custom distance measure function and (2) custom mean function to the kmeans function?
It seems it uses Euclidean measure only.

The standard kmeans does not allow this, for good reasons. It uses some clever algorithms (Hartigan and Wong; which is why it is much faster than the standard Lloyd textbook algorithm you find in about 100 other R packages). But these only work for the classic k-means scenario with squared deviations (which means assigning each cluster to the Euclidean nearest center, but it actually optimizes least-squares, not Euclidean distances).
I doubt you can simply plug in other distances and centroid functions into the Hartigan and Wong method (apart from it being written in Fortran, so you cannot just plug in a R function there anyway).
Beware that there are very few known combinations where other distances and means are known to always converge well. Bregman divergences should be fine, and cosine is equivalent to squared Euclidean on a sphere, so it will also work.

Related

Difference between distm function or the distVincentyEllipsoid in R

Could you fully explain the big difference in using the distm function or the distVincentyEllipsoid function to calculate the distance of geodesic coordinates in R?
I noticed that using distm for this calculation, it takes much longer. Could you please explain to me beyond the difference, why does this happen?
Thank you!

Following on from your previous question here: Distance calculation optimization in R
The speed relates to the level of computation required to produce the returned object, not necessarily the difference between the computation of distances (I am not sure what great circle computation the distm() function uses as it's default). Indeed the geosphere:: documentation here: https://cran.r-project.org/web/packages/geosphere/geosphere.pdf suggests that distVincentyEllipsoid() calculation is "very accurate" but "computationally more intensive" than other great circle methods while this would make you suspect a slower computation, it is because of the way I have structured the code in my answer to return a vector of distances between each row (not a matrix of distances between each and every point).
Conversely, your distm() calculation in your original code returns a matrix of multiple vectors between each and every point. For your problem, this is not necessary so long as the data is ordered, that is why I have done so. Additionally, the use of hierarchical clustering to cluster the points based on these distances into 3 (your defined number) clusters is also not necessary as we can use the percentile of distances between each point values to do the same. Again the speed benefit relates to computing the clusters on a single vector rather than a matrix.
Please note, I am a data analyst with a background in accounting/finance and not a GIS specialist by any means. That being said my use of the distVincentyEllipsoid() function comes from my general understanding that this returns a pretty accurate estimation of great circle distances as a vector (as a opposed to a matrix). Moreover, having used this in the past to optimise logistics operations for pricing purposes, I can attest to the fact these computations have been tested in the market and found to be sound.

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.

While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!

If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

In what situation would a taylor series for a polynomial be necessary?

I'm having a hard time understanding why it would be useful to use the Taylor series for a function in order to gain an approximation of a function, instead of just using the function itself when programming. If I can tell my computer to compute e^(.1) and it will give me an exact value, why would I take an approximation instead?

Taylor series are generally not used to approximate functions. Usually, some form of minimax polynomial is used.
Taylor series converge slowly (it takes many terms to get the accuracy desired) and are inefficient (they are more accurate near the point around which they are centered and less accurate away from it). The largest use of Taylor series is likely in mathematics classes and papers, where they are useful for examining the properties of functions and for learning about calculus.
To approximate functions, minimax polynomials are often used. A minimax polynomial has the minimum possible maximum error for a particular situation (interval over which a function is to be approximated, degree available for the polynomial). There is usually no analytical solution to finding a minimax polynomial. They are found numerically, using the Remez algorithm. Minimax polynomials can be tailored to suit particular needs, such as minimizing relative error or absolute error, approximating a function over a particular interval, and so on. Minimax polynomials need fewer terms than Taylor series to get acceptable results, and they “spread” the error over the interval instead of being better in the center and worse at the ends.
When you call the exp function to compute ex, you are likely using a minimax polynomial, because somebody has done the work for you and constructed a library routine that evaluates the polynomial. For the most part, the only arithmetic computer processors can do is addition, subtraction, multiplication, and division. So other functions have to be constructed from those operations. The first three give you polynomials, and polynomials are sufficient to approximate many functions, such as sine, cosine, logarithm, and exponentiation (with some additional operations of moving things into and out of the exponent field of floating-point values). Division adds rational functions, which is useful for functions like arctangent.

For two reasons. First and foremost - most processors do not have hardware implementations of complex operations like exponentials, logarithms, etc... In such cases the programming language may provide a library function for computing those - in other words, someone used a taylor series or other approximation for you.
Second, you may have a function that not even the language supports.
I recently wanted to use lookup tables with interpolation to get an angle and then compute the sin() and cos() of that angle. Trouble is that it's a DSP with no floating point and no trigonometric functions so those two functions are really slow (software implementation). Instead I put sin(x) in the table instead of x and then used the taylor series for y=sqrt(1-x*x) to compute the cos(x) from that. This taylor series is accurate over the range I needed with only 5 terms (denominators are all powers of two!) and can be implemented in fixed point using plain C and generates code that is faster than any other approach I could think of.

Quadratic programming solver that guarantees a boundary point?

I have a problem that I have expressed as the minimization of a convex quadratic program with linear constraints. The problem is that I want to disallow any point that is strictly interior (i.e. I only find the answer useful if it is on a vertex of the feasible region.
I'd like to do this without modifying the objective function. I have already considered several modifications that would make this a non-issue, but they all have the unfortunate result of making the program non-convex.
By my estimation my only option for an efficient solution would be a solver that uses a penalty method to approach a solution from the outside of the feasible region. Does anyone know a decent solver for this?
My current objective function is a sum of parabolic cylinders.

Can you just find the vertices of the feasible region and then take the one which minimizes the objective function? This should just involve a bit of linear algebra and then a limited number of evaluations of the objective function.

R : multidimensional scaling

I have several questions:
1. What's the difference between isoMDS and cmdscale?
2. May I use asymmetric matrix?
3. Is there any way to determine optimal number of dimensions (in result)?

One of the MDS methods is distance scaling and it is divided in metric and non-metric. Another one is the classical scaling (also called distance geometry by those in bioinformatics). Classical scaling can be carried out in R by using the command cmdscale. Kruskal's method of nonmetric distance scaling (using the stress function and isotonic regression) can be carried out by using the command isoMDS in library MASS.
The standard treatment of classical scaling yields an eigendecomposition problem and as such is the same as PCA if the goal is dimensionality reduction. The distance scaling methods, on the other hand, use iterative procedures to arrive at a solution.
If you refer to the distance structure, I guess you should pass a structure of the class dist which is an object with distance information. Or a (symmetric) matrix of distances, or an object which can be coerced to such a matrix using as.matrix(). (As I read in the help, only the lower triangle of the matrix is used, the rest is ignored).
(for classical scaling method): One way of determining the dimensionality of the resulting configuration is to look at the eigenvalues of the doubly centered symmetric matrix B (= HAH). The usual strategy is to plot the ordered eigenvalues (or some function of them) against dimension and then identify a dimension at which the eigenvalues become “stable” (i.e., do not change perceptively). At that dimension, we may observe an “elbow” that shows where stability occurs (for points of a n-dimensional space, stability in the plot should occur at dimension n+1). For easier graphical interpretation of a classical scaling solution, we usually choose n to be small, of the order 2 or 3.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex