First of all, can someone explain what vector quantization is, its purpose, and what it does? Secondly, an explanation of how k-means is used to do this would be appreciated as well.
For the record, I don't know if this will make a difference in the explanation, but I'm trying to learn about vector quantization in the context of boundary descriptors. If I calculated a number of boundary descriptors for a particular segment in an image, and I wanted to vector quantize them using k-means, what would this mean, what would this do, why would I want to do, and how would I do it?
Vector quantization is the process of discretizing a random variable valued in some vector space. The result is the projection of that random variable onto a finite set of knots. It is used for signal transmission, quadrature, variance reduction and a lot of other applications.
Optimal quantization consists in choosing the knots in such a way to minimize the mean L^p discretization error.
The K-means is also called Lloyd algorithm consists in starting from an arbitrary set of knots, (or codebook), and iteratively replace each one of them by the L^p-median (or simply by the mean for quadratic quantization) of the probability distribution given that it falls in the Voronoi cell of that knot. An interactive animation is available here.
The historical reference on the Lloyd algorithm is the following
Stuart P. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory, vol. 28, issue 2, pp. 129–137, 1982
K-means algorithms always decreases the quantization error but does not always converge to the globally optimal quantizer. Although, in the case of one-dimensional log-concave distributions, the algorithm converges to a unique global minimum.
The optimal quantization web site contains an extensive bibliography on the matter of vector quantization and functional quantization.
Related
I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.
I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).
I'm wondering how one chooses a specific k in Shi-Malik Algo.
Do we choose several ks and rank them via their SSE measures?
Does k reflect the number of clusters we assume for the data?
kind regards Mikey
Yes, K is the number of natural grouping we believe their is in the data.
You can find K by exploring the eigenvalues.
One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the value of k that maximizes the eigengap (difference between consecutive eigenvalues). i.e., choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large.
The larger this eigengap is, the closer the eigenvectors of the ideal case and hence the better spectral clustering works. If you're interested on the justifications for this procedure, it is based on perturbation theory and spectral graph theory.
You can read more here: A Tutorial on Spectral Clustering - Ulrike von Luxburg
Other way to explore the natural grouping: number of connected components and the spectrum of the Laplacian matrix - the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. Your affinity matrix can be considered as a graph and then, try to look how many connected components you have in the graph. That will give you a sense of the neutral structure of your data..
In addition, as you mentioned, we can set a validation criterion (for example, SSE) and see its value under different values of K. That's fine once you have a labeled data (which is not always the case in clustering) and you know that this criterion/quality measure is really meaningful.
I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.
I have several questions:
1. What's the difference between isoMDS and cmdscale?
2. May I use asymmetric matrix?
3. Is there any way to determine optimal number of dimensions (in result)?
One of the MDS methods is distance scaling and it is divided in metric and non-metric. Another one is the classical scaling (also called distance geometry by those in bioinformatics). Classical scaling can be carried out in R by using the command cmdscale. Kruskal's method of nonmetric distance scaling (using the stress function and isotonic regression) can be carried out by using the command isoMDS in library MASS.
The standard treatment of classical scaling yields an eigendecomposition problem and as such is the same as PCA if the goal is dimensionality reduction. The distance scaling methods, on the other hand, use iterative procedures to arrive at a solution.
If you refer to the distance structure, I guess you should pass a structure of the class dist which is an object with distance information. Or a (symmetric) matrix of distances, or an object which can be coerced to such a matrix using as.matrix(). (As I read in the help, only the lower triangle of the matrix is used, the rest is ignored).
(for classical scaling method): One way of determining the dimensionality of the resulting configuration is to look at the eigenvalues of the doubly centered symmetric matrix B (= HAH). The usual strategy is to plot the ordered eigenvalues (or some function of them) against dimension and then identify a dimension at which the eigenvalues become “stable” (i.e., do not change perceptively). At that dimension, we may observe an “elbow” that shows where stability occurs (for points of a n-dimensional space, stability in the plot should occur at dimension n+1). For easier graphical interpretation of a classical scaling solution, we usually choose n to be small, of the order 2 or 3.
I'm having trouble determining from this research paper exactly how I can reproduce the Standard Vector Quantization algorithm to determine the language of an unidentified speech input, based on a training set of data. Here's some basic info:
Abstract info
Language recognition (e.g. Japanese, English, German, etc) using acoustic features is an important yet difficult problem for current speech
technology. ... The speech data base used in this paper contains 20 languages: 16
sentences uttered twice by 4 males and 4 females. The duration of each
sentence is about 8 seconds. The first algorithm is based on the standard
Vector Quantization (VQ) technique. Every language is characterized
by its own VQ codebook, .
Recognition Algorithms
The first algorithm is based on the standard Vector Quantization (VQ) technique. Every language, k, is characterized by its own VQ codebook, . In the recognition stage input speech is quantized by and the accumulated quantization distortion, d_k, is calculated. The language which as the minimal distortion is recognized. Calcualating VQ distortion, several LPC spectral distortion measures are applied...in this case, the WLR -- weighted least ratio -- distance:
.
Standard VQ Algorithm:
A codebook,
, for each language is generated using training sentences. The accumulated distance for input vector in sentence, ![alt text][4], is defined as: [![alt text][5]][5]
The distance d can be any distance which corresponds to the acoustic features and it must be the same as the one used for codebook generation. Each language is characterized by its VQ codebook, .
My question is, how exactly do I do this? I have a set of 50 sentences in English. In MATLAB, I can easily calculated the WLR for any given signal. But, how do I formulate a codebook, since I must use the WLR for "codebook generation" for English. I'm also curious as to how to compare a VQ codebook of size 16 (which was found to be the best size), to a given input signal. If anyone could help distill this paper down for me, I'd appreciate it greatly.
Thanks!
The second question (compare codebook to given signal) is more easy: for each codebook entry V_k_j you must calculate distance d with input signal. The 'j' with smallest distance 'd' will corespond to best fitted codebook entry. As a distance function you can use WLR
Building codebook (trainig) is bit more complicated. You must divide you sentences to vectors with lenght N (16) and then use some clustering algorithm (like k-means) to cluster these vectors. Then find mean in every cluster. This mean and will be codebook entry. It is a fisrt thing that comes to mind.
Another algorithm (I believe, it will be better) can be found here.
Also, two simple training algorithms are described in Wikipedia