I'm having trouble determining from this research paper exactly how I can reproduce the Standard Vector Quantization algorithm to determine the language of an unidentified speech input, based on a training set of data. Here's some basic info:
Abstract info
Language recognition (e.g. Japanese, English, German, etc) using acoustic features is an important yet difficult problem for current speech
technology. ... The speech data base used in this paper contains 20 languages: 16
sentences uttered twice by 4 males and 4 females. The duration of each
sentence is about 8 seconds. The first algorithm is based on the standard
Vector Quantization (VQ) technique. Every language is characterized
by its own VQ codebook, .
Recognition Algorithms
The first algorithm is based on the standard Vector Quantization (VQ) technique. Every language, k, is characterized by its own VQ codebook, . In the recognition stage input speech is quantized by and the accumulated quantization distortion, d_k, is calculated. The language which as the minimal distortion is recognized. Calcualating VQ distortion, several LPC spectral distortion measures are applied...in this case, the WLR -- weighted least ratio -- distance:
.
Standard VQ Algorithm:
A codebook,
, for each language is generated using training sentences. The accumulated distance for input vector in sentence, ![alt text][4], is defined as: [![alt text][5]][5]
The distance d can be any distance which corresponds to the acoustic features and it must be the same as the one used for codebook generation. Each language is characterized by its VQ codebook, .
My question is, how exactly do I do this? I have a set of 50 sentences in English. In MATLAB, I can easily calculated the WLR for any given signal. But, how do I formulate a codebook, since I must use the WLR for "codebook generation" for English. I'm also curious as to how to compare a VQ codebook of size 16 (which was found to be the best size), to a given input signal. If anyone could help distill this paper down for me, I'd appreciate it greatly.
Thanks!
The second question (compare codebook to given signal) is more easy: for each codebook entry V_k_j you must calculate distance d with input signal. The 'j' with smallest distance 'd' will corespond to best fitted codebook entry. As a distance function you can use WLR
Building codebook (trainig) is bit more complicated. You must divide you sentences to vectors with lenght N (16) and then use some clustering algorithm (like k-means) to cluster these vectors. Then find mean in every cluster. This mean and will be codebook entry. It is a fisrt thing that comes to mind.
Another algorithm (I believe, it will be better) can be found here.
Also, two simple training algorithms are described in Wikipedia
Related
So I learned about the concept of information entropy from Khan Academy where is was phrased in the form of "average amount of yes or no questions needed per symbol". They also gave an alternative form using logarithms.
So let's say we have a symbol generator that produces A,B, and C.
P(A)=1/2, P(B)=1/3, and P(C)=1/6
According to their method, I would gat a chart like this:
First method
Then I would multiply their probability of occurring by the amount of questions needed for each giving
(1/2)*1+(1/3)*2+(1/6)*2 = 1.5bits
but their other method gives
-(1/2)log2(1/2)-(1/3)log2(1/3)-(1/6)log2(1/6)= 1.459... bits
The difference is small, but still significant. I've tried this with different combinations and probabilities and got similar results. Is there something I'm missing? Am I using either method wrong, or is one of them more conditional?
Your second calculation is correct.
The problem with your decision tree approach is that the decision tree is not optimal (and indeed, no binary decision tree could be for those probabilities). Your “is it B” decision node represents less than one bit of information, since once you get there you already know it’s probably B. So your decision tree represents a potential encoding of symbols which is expected to consume 1.5 bits on average, but it represents slightly less than 1.5 bits of information.
In order to have a binary tree which represents an optimal encoding, each node needs to have balanced probabilities. This is not possible if some symbol has a probability whose denominator is not a power of 2.
I'm wondering how one chooses a specific k in Shi-Malik Algo.
Do we choose several ks and rank them via their SSE measures?
Does k reflect the number of clusters we assume for the data?
kind regards Mikey
Yes, K is the number of natural grouping we believe their is in the data.
You can find K by exploring the eigenvalues.
One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the value of k that maximizes the eigengap (difference between consecutive eigenvalues). i.e., choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large.
The larger this eigengap is, the closer the eigenvectors of the ideal case and hence the better spectral clustering works. If you're interested on the justifications for this procedure, it is based on perturbation theory and spectral graph theory.
You can read more here: A Tutorial on Spectral Clustering - Ulrike von Luxburg
Other way to explore the natural grouping: number of connected components and the spectrum of the Laplacian matrix - the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. Your affinity matrix can be considered as a graph and then, try to look how many connected components you have in the graph. That will give you a sense of the neutral structure of your data..
In addition, as you mentioned, we can set a validation criterion (for example, SSE) and see its value under different values of K. That's fine once you have a labeled data (which is not always the case in clustering) and you know that this criterion/quality measure is really meaningful.
First of all, can someone explain what vector quantization is, its purpose, and what it does? Secondly, an explanation of how k-means is used to do this would be appreciated as well.
For the record, I don't know if this will make a difference in the explanation, but I'm trying to learn about vector quantization in the context of boundary descriptors. If I calculated a number of boundary descriptors for a particular segment in an image, and I wanted to vector quantize them using k-means, what would this mean, what would this do, why would I want to do, and how would I do it?
Vector quantization is the process of discretizing a random variable valued in some vector space. The result is the projection of that random variable onto a finite set of knots. It is used for signal transmission, quadrature, variance reduction and a lot of other applications.
Optimal quantization consists in choosing the knots in such a way to minimize the mean L^p discretization error.
The K-means is also called Lloyd algorithm consists in starting from an arbitrary set of knots, (or codebook), and iteratively replace each one of them by the L^p-median (or simply by the mean for quadratic quantization) of the probability distribution given that it falls in the Voronoi cell of that knot. An interactive animation is available here.
The historical reference on the Lloyd algorithm is the following
Stuart P. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory, vol. 28, issue 2, pp. 129–137, 1982
K-means algorithms always decreases the quantization error but does not always converge to the globally optimal quantizer. Although, in the case of one-dimensional log-concave distributions, the algorithm converges to a unique global minimum.
The optimal quantization web site contains an extensive bibliography on the matter of vector quantization and functional quantization.
Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!
I was glancing through the contents of Concrete Maths online. I had at least heard most of the functions and tricks mentioned but there is a whole section on Special Numbers. These numbers include Stirling Numbers, Eulerian Numbers, Harmonic Numbers so on. Now I have never encountered any of these weird numbers. How do they aid in computational problems? Where are they generally used?
Harmonic Numbers appear almost everywhere! Musical Harmonies, analysis of Quicksort...
Stirling Numbers (first and second kind) arise in a variety of combinatorics and partitioning problems.
Eulerian Numbers also occur several places, most notably in permutations and coefficients of polylogarithm functions.
A lot of the numbers you mentioned are used in the analysis of algorithms. You may not have these numbers in your code, but you'll need them if you want to estimate how long it will take for your code to run. You might see them in your code too. Some of these numbers are related to combinatorics, counting how many ways something can happen.
Sometimes it's not enough to know how many possibilities there are because you need to enumerate over the possibilities. Volume 4 of Knuth's TAOCP, in progress, gives the algorithms you need.
Here's an example of using Fibonacci numbers as part of a numerical integration problem.
Harmonic numbers are a discrete analog of logarithms and so they come up in difference equations just like logs come up in differential equations. Here's an example of physical applications of harmonic means, related to harmonic numbers. See the book Gamma for many examples of harmonic numbers in action, especially the chapter "It's a harmonic world."
These special numbers can help out in computational problems in many ways. For example:
You want to find out when your program to compute the GCD of 2 numbers is going to take the longest amount of time: Try 2 consecutive Fibonacci Numbers.
You want to have a rough estimate of the factorial of a large number, but your factorial program is taking too long: Use Stirling's Approximation.
You're testing for prime numbers, but for some numbers you always get the wrong answer: It could be you're using Fermat's Prime test, in which case the Carmicheal numbers are your culprits.
The most common general case I can think of is in looping. Most of the time you specify a loop using a (start;stop;step) type of syntax, in which case it may be possible to reduce the execution time by using properties of the numbers involved.
For example, summing up all the numbers from 1 to n when n is large in a loop is definitely slower than using the identity sum = n*(n + 1)/2.
There are a large number of examples like these. Many of them are in cryptography, where the security of information systems sometimes depends on tricks like these. They can also help you with performance issues, memory issues, because when you know the formula, you may find a faster/more efficient way to compute other things -- things that you actually care about.
For more information, check out wikipedia, or simply try out Project Euler. You'll start finding patterns pretty fast.
Most of these numbers count certain kinds of discrete structures (for instance, Stirling Numbers count Subsets and Cycles). Such structures, and hence these sequences, implicitly arise in the analysis of algorithms.
There is an extensive list at OEIS that lists almost all sequences that appear in Concrete Math. A short summary from that list:
Golomb's Sequence
Binomial Coefficients
Rencontres Numbers
Stirling Numbers
Eulerian Numbers
Hyperfactorials
Genocchi Numbers
You can browse the OEIS pages for the respective sequences to get detailed information about the "properties" of these sequences (though not exactly applications, if that's what you're most interested in).
Also, if you want to see real-life uses of these sequences in analysis of algorithms, flip through the index of Knuth's Art of Computer Programming, and you'll find many references to "applications" of these sequences. John D. Cook already mentioned applications of Fibonacci & Harmonic numbers; here are some more examples:
Stirling Cycle Numbers arise in the analysis of the standard algorithm that finds the maximum element of an array (TAOCP Sec. 1.2.10): How many times must the current maximum value be updated when finding the maximum value? It turns out that the probability that the maximum will need to be updated k times when finding a maximum in an array of n elements is p[n][k] = StirlingCycle[n, k+1]/n!. From this, we can derive that on the average, approximately Log(n) updates will be necessary.
Genocchi Numbers arise in connection with counting the number of BDDs that are "thin" (TAOCP 7.1.4 Exercise 174).
Not necessarily a magic number from the reference you mentioned, but nonetheless --
0x5f3759df
-- the notorious magic number used to calculate inverse square root of a number by giving a good first estimate to Newton's Approximation of Roots, often attributed to the work of John Carmack - more info here.
Not programming related, huh? :)
Is this directly programming related? Surely related, but I don't know how closely.
Special numbers, such as e, pi, etc., come up all over the place. I don't think that anyone would argue about these two. The Golden_ratio also appears with amazing frequency, in everything from art to other special numbers themselves (look at the ratio between successive Fibonacci numbers.)
Various sequences and families of numbers also appear in many places in mathematics and therefore, in programming too. A beautiful place to look is the Encyclopedia of integer sequences.
I'll suggest this is an experience thing. For example, when I took linear algebra, many, many years ago, I learned about the eigenvalues and eigenvectors of a matrix. I'll admit that I did not at all appreciate the significance of eigenvalues/eigenvectors until I saw them in use in a variety of places. In statistics, in terms of what they tell you about uncertainty of an estimate from a covariance matrix, the size and shape of a confidence ellipse, in terms of principal component analysis, or the long term state of a Markov process. In numerical methods, where they tell you about convergence of a method, be it in optimization or an ODE solver. In mechanical engineering, where you see them as principal stresses and strains.
Discussion in Reddit