Python - Clustering MFCC Vectors - vector

I am currently doing a speaker verification project using hidden markov models no accurate results on voice signals yet, though i have tested the system to various data samples (not involved with voice).
I extracted the MFCC of the voice signals using scikits talkbox. I assumed that no parameters must be changed and that the default ones are already fit for such project. I am suspecting that my problem is within the vector quantization of the mfcc vectors. I chose kmeans as my algorithm using scipy's kmeans clustering function. I was wondering if there is a prescribed number of clusters for this kind of work. I originally set mine to 32. Sample rate of my voice files are 8000 and 22050. Oh additionally, I recorded them and manually removed the silence using Audacity.
Any suggestions?

Related

Selection of initial medoids in PAM algorith

I have read a couple of different articles on how PAM selects the initial medoids but I am getting conflicting views.
Some propose that the k first medoids are selected randomly, while others suggest that the algorithm selects initially the k representative medoids in the dataset (not clarifying how that "representativeness" happens though). Below I have listed these resources:
Medoid calculation
Drawbacks of K-Medoid (PAM) Algorithm
https://paginas.fe.up.pt/~ec/files_1112/week_06_Clustering_part_II.pdf
https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples/
1) My question would be if someone could explain in more detail how the algorithm selects the initial k medoids as from what I understand different initial selection can lead to different results.
2) Also is that one of the reasons of using CLARA (apart from minimizing computing time and RAM storage problem) - that is to find medoids through resampling that are the "optimal" options?
I am using R as a parenthesis, with the function pam(). Open to other functions in other libraries if there is a better alternative I am not aware of.
Read the original sources.
There is a lot of nonsense written later, unfortunately.
PAM consists of two algorithms:
BUILD to choose the initial medoids (not randomly)
SWAP to make the best improvements (not k-means style)
The k-means style algorithm works much worse than PAM. Any description of PAM that doesn't mention these two parts is inaccurate (and there are quite some of these...)
The R package seems to use the real PAM algorithm:
By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid that will decrease the objective (this is called the swap phase)
CLARA clearly will find worse solutions than PAM, as it runs PAM on a sample, and I'd the optimum medoids are not in the sample, then they cannot be found.

customizable cross-validation in h2o (features that depend on the training set)

I have a model where some of the input features are calculated from the training dataset (e.g. average or median of a value). I am trying to perform n-fold cross validation on this model, but that means that the values for these features would be different depending on the samples selected for training/validation for each fold. Is there a way in h2o (I'm using it in R) to perhaps pass a funtion that calculates those features once the training set has been determined?
It seems like a pretty intuitive feature to have, but I have not been able to find any documentation on something like it out-of-the-box. Does it exist? If so, could someone point me to a resource?
There's no way to do this while using the built-in cross-validation in H2O. If H2O were written in pure R or Python, then it would be easy to extend it to allow a user to pass in a function to create custom features within the cross-validation loop, however the core of H2O is written in Java, so automatically translating an arbitrary user-defined function from R or Python, first into a REST call and then into Java is not trivial.
Instead, what you'd have to do is write a loop to do the cross-validation yourself and compute the features within the loop.
It sounds like you may be doing target encoding (or something similar), and if that's the case, you'll be interested in this PR to add target encoding in H2O. In the discussion, we talk about the same issue that you're having.

Trying to know the cut off point of an inbuilt function, since currently it is not running. In R

In R, I am trying to use the markov chain package and converting clickstream data to markov chain. I have 4GB of RAM but the program cannot run the command(after a lot of time). This is because after a while the ongoing conversion cannot allocate more than 3969mb of data(that is what the screen says). I am trying to find out that, as to what point will the program run? So if I have say `n' nodes, till how many nodes(obviously less than n) or rows(the rows might contain same or different nodes) will the program run. I am trying to do Attribution Modelling using R. The conversion path are converted from clickstream form to a markov chain. Trying to find out the transition matrix using that.
Image with the function and a sample dataset. Here the h,c,d,p are different nodes. Image here of the code for a small clickstream data
Attached the image of the code and a sample data. The function converts this data into a markov chain containing a lot of important things out of which I am mainly trying to get the Transition Matrix and the Steady State. As I increase the data size(the number of different channel path or Users are not important, it is the different nodes that are important), the function is unable to perform as it cannot allocate more than the 4GB of RAM. I tried hit and trial to get to the point beyond which the function is not working but it did not help. Is there a way where I can know that till what node(or row) will the function work? So that I can generate the Transition Matrix till that point. And maybe the increase in the memory usage with every increasing node as I would like to believe the relationship between the two won't be linear.
Please let me know if the question is not specific enough and if it might need any more details.

Library to train GMMs from MFCC

I am trying to build a basic Emotion detector from speech using MFCCs, their deltas and delta-deltas. A number of papers talk about getting a good accuracy by training GMMs on these features.
I cannot seem to find a ready made package to do the same. I did play around with scilearn in Python, Voicebox and similar toolkits in Matlab and Rmixmod, stochmod, mclust, mixtools and some other packages in R. What would be the best library to calculate GMMs from trained data?
Challenging problem is training data, which contains the emotion information, embedded in feature set. The same features that encapsulate emotions should be used in the test signal. The testing with GMM will only be good as your universal background model. In my experience typically with GMM you can only separate male female and a few unique speakers. Simply feeding the MFCC’s into GMM would not be sufficient, since GMM does not hold time varying information. Since emotional speech would contain time varying parameters such as pitch and changes in pitch over periods in addition to the frequency variations MFCC parameters. I am not saying it not possible with current state of technology but challenging in a good way.
If you want to use Python, here is the code in the famous speech recognition toolkit Sphinx.
http://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxtrain/python/cmusphinx/gmm.py

R: Find function minimum with genetic programming

I am currently using RGP as a genetic programming library. If anyone has an idea for another library (better documentation, more active development, etc.) I would like to hear your suggestions.
The question is rather simple: given a function with n parameters in R, how can i find the global minimum using genetic programming. I tried modifying one of the example programs but it seems this example uses linear regression which I don't think is appropriate in my situation.
Does anyone have any example code i could use?
I can recommend to use HeuristicLab. There are some algorithms implemented: Genetic Algorithm, Evolution Strategy, Simulated Annealing, Particle Swarm Optimization, and more which might be interesting if you're looking into the minimization of real-valued functions. The software is implemented in C# and runs on Windows. It offers a GUI where you can optimize several provided test functions (Rosenbrock, Schaffer, Ackley, etc.). There's also a very good implementation of genetic programming (GP) available, but from my impression you don't need GP. In genetic programming you evolve a function given the output data of an unknown function. I think in your case the function is known and you need to find those parameters that minimize the function's output.
The latest major version of the software was released to the public in 2010 and has since been further developed in several minor releases. We now have a release about two times a year. There's a google group where you can ask for help which is getting more and more active and there are some video tutorials that show the functionality. Check out the tour video on youtube which gives an overview of the features in less than 3 minutes. The research group around Prof. Affenzeller - a researcher in the field of Metaheuristics - has developed this software and is situated in Austria. I'm part of this group also.
Check out the howtos how you can implement your function in the GUI or, if you know C#, how you can implement your problem as a plugin.
You can use a genetic algorithm instead of GP to find the minimum of a function with n variables.
Basically what you do is:
assign initial values
generate initial population of n chromosomes
While (true)
evaluate fitness f(x, y) for each chromosome
if we reach a satisfactory solution of f(x, y) → Exit the loop
create the selection scheme (tournament selection)
select chromosomes (selection):
elitism
crossover
create mutations
alter duplicated chromosomes
replace the original population of chromosomes
end While

Resources