Library to train GMMs from MFCC - r

I am trying to build a basic Emotion detector from speech using MFCCs, their deltas and delta-deltas. A number of papers talk about getting a good accuracy by training GMMs on these features.
I cannot seem to find a ready made package to do the same. I did play around with scilearn in Python, Voicebox and similar toolkits in Matlab and Rmixmod, stochmod, mclust, mixtools and some other packages in R. What would be the best library to calculate GMMs from trained data?

Challenging problem is training data, which contains the emotion information, embedded in feature set. The same features that encapsulate emotions should be used in the test signal. The testing with GMM will only be good as your universal background model. In my experience typically with GMM you can only separate male female and a few unique speakers. Simply feeding the MFCC’s into GMM would not be sufficient, since GMM does not hold time varying information. Since emotional speech would contain time varying parameters such as pitch and changes in pitch over periods in addition to the frequency variations MFCC parameters. I am not saying it not possible with current state of technology but challenging in a good way.

If you want to use Python, here is the code in the famous speech recognition toolkit Sphinx.
http://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxtrain/python/cmusphinx/gmm.py

Related

Difference between "mlp" and "mlpML"

I'm using the Caret package from R to create prediction models for maximum energy demand. What i need to use is neural network multilayer perceptron, but in the Caret package i found out there's 2 of the mlp method, which is "mlp" and "mlpML". what is the difference between the two?
I have read description from a book (Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization) but it still doesnt answer my question.
Caret has 238 different models available! However many of them are just different methods to call the same basic algorithm.
Besides mlp there are 9 other methods of calling a multi-layer-perceptron one of which is mlpML. The real difference is only in the parameters of the function call and which model you need depends on your use case and what you want to adapt about the basic model.
Chances are, if you don't know what mlpML or mlpWeightDecay,etc. does you are fine to just use the basic mlp.
Looking at the official documentation we can see that:
mlp(size) while mlpML(layer1,layer2,layer3) so in the first method you can only tune the size of the multi-layer-perceptron while in the second call you can tune each layer individually.
Looking at the source code here:
https://github.com/topepo/caret/blob/master/models/files/mlp.R
and here:
https://github.com/topepo/caret/blob/master/models/files/mlpML.R
It seems that the difference is that mlpML allows several hidden layers:
modelInfo <- list(label = "Multi-Layer Perceptron, with multiple layers",
while mlp has one single layer with hidden units.
The official documentation also hints at this difference. In my opinion, it is not particularly useful to have many different models that differ only very slightly, and the documentation does not explain those slight differences well.

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Train SVM on a very large dataset stored on hard drive

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still exploring different packages implemented in R.
I have quick question: how do you compare R to Matlab for data mining application, its popularity, pros and cons, industry and academic acceptance etc.? Which one would you choose and why?
I went through various comparisons for Matlab vs R against various metrics but I am specifically interested to get answer for its applicability in Data Mining and ML.
Since both language are pretty new for me I was just wondering if R would be a good choice or not.
I appreciate any kind of suggestions.
For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.
I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.
The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.
My Top 5 list for Learning ML/Data Mining in R:
Mining Association Rules in R
This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.
The standard reference, The Elements of Statistical
Learning by Hastie et al.
The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).
CRAN Task View: Machine Learning
The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)
Revolution Analytics Blog
With particular focus on the posts tagged with Predictive Analytics
ML in R tutorial comprised of slide deck and R code by Josh Reich
A thorough study of the code would, by itself, be an excellent introduction to ML in R.
And one final resource that i think is excellent, but didn't make in the top 5:
A Guide to Getting Stared in Machine Learning [in R]
posted at the blog A Beautiful WWW
Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.
Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.
However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).
And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.
Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.
If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!
Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!
We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.
I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.
I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.
I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.
I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:
Why MATLAB for Data Mining?
I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:
R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.
Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).
Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.

Which particular software development tasks have you used math for? And which branch of math did you use?

I'm not looking for a general discussion on if math is important or not for programming.
Instead I'm looking for real world scenarios where you have actually used some branch of math to solve some particular problem during your career as a software developer.
In particular, I'm looking for concrete examples.
I frequently find myself using De Morgan's theorem when as well as general Boolean algebra when trying to simplify conditionals
I've also occasionally written out truth tables to verify changes, as in the example below (found during a recent code review)
(showAll and s.ShowToUser are both of type bool.)
// Before
(showAll ? (s.ShowToUser || s.ShowToUser == false) : s.ShowToUser)
// After!
showAll || s.ShowToUser
I also used some basic right-angle trigonometry a few years ago when working on some simple graphics - I had to rotate and centre a text string along a line that could be at any angle.
Not revolutionary...but certainly maths.
Linear algebra for 3D rendering and also for financial tools.
Regression analysis for the same financial tools, like correlations between financial instruments and indices, and such.
Statistics, I had to write several methods to get statistical values, like the F Probability Distribution, the Pearson product moment coeficient, and some Linear Algebra correlations, interpolations and extrapolations for implementing the Arbitrage pricing theory for asset pricing and stocks.
Discrete math for everything, linear algebra for 3D, analysis for physics especially for calculating mass properties.
[Linear algebra for everything]
Projective geometry for camera calibration
Identification of time series / statistical filtering for sound & image processing
(I guess) basic mechanics and hence calculus for game programming
Computing sizes of caches to optimize performance. Not as simple as it sounds when this is your critical path, and you have to go back and work out the times saved by using the cache relative to its size.
I'm in medical imaging, and I use mostly linear algebra and basic geometry for anything related to 3D display, anatomical measurements, etc...
I also use numerical analysis for handling real-world noisy data, and a good deal of statistics to prove algorithms, design support tools for clinical trials, etc...
Games with trigonometry and AI with graph theory in my case.
Graph theory to create a weighted graph to represent all possible paths between two points and then find the shortest or most efficient path.
Also statistics for plotting graphs and risk calculations. I used both Normal distribution and cumulative normal distribution calculations. Pretty commonly used functions in Excel I would guess but I actully had to write them myself since there is no built-in support in the .NET libraries. Sadly the built in Math support in .NET seem pretty basic.
I've used trigonometry the most and also a small amount a calculus, working on overlays for GIS (mapping) software, comparing objects in 3D space, and converting between coordinate systems.
A general mathematical understanding is very useful if you're using 3rd party libraries to do calculations for you, as you ofter need to appreciate their limitations.
i often use math and programming together, but the goal of my work IS the math so use software to achive that.
as for the math i use; mostly Calculus (FFT's analysing continuous and discrete signals) with a slash of linar algebra (CORDIC) to do trig on a MCU with no floating point chip.
I used a analytic geometry for simple 3d engine in opengl in hobby project on high school.
Some geometry computation i had used for dynamic printing reports, where was another 90° angle layout than.
A year ago I used some derivatives and integrals for store analysis (product item movement in store).
Bot all the computation can be found on internet or high-school book.
Statistics mean, standard-deviation, for our analysts.
Linear algebra - particularly gauss-jordan elimination and
Calculus - derivatives in the form of difference tables for generating polynomials from a table of (x, f(x))
Linear algebra and complex analysis in electronic engineering.
Statistics in analysing data and translating it into other units (different project).
I used probability and log odds (log of the ratio of two probabilities) to classify incoming emails into multiple categories. Most of the heavy lifting was done by my colleague Fidelis Assis.
Real world scenarios: better rostering of staff, more efficient scheduling of flights, shortest paths in road networks, optimal facility/resource locations.
Branch of maths: Operations Research. Vague definition: construct a mathematical model of a (normally complex) real world business problem, and then use mathematical tools (e.g. optimisation, statistics/probability, queuing theory, graph theory) to interrogate this model to aid in the making of effective decisions (e.g. minimise cost, maximise efficency, predict outcomes etc).
Statistics for scientific data analyses such as:
calculation of distributions, z-standardisation
Fishers Z
Reliability (Alpha, Kappa, Cohen)
Discriminance analyses
scale aggregation, poling, etc.
In actual software development I've only really used quite trivial linear algebra, geometry and trigonometry. Certainly nothing more advanced than the first college course in each subject.
I have however written lots of programs to solve really quite hard math problems, using some very advanced math. But I wouldn't call any of that software development since I wasn't actually developing software. By that I mean that the end result wasn't the program itself, it was an answer. Basically someone would ask me what is essentially a math question and I'd write a program that answered that question. Sure I’d keep the code around for when I get asked the question again, and sometimes I’d send the code to someone so that they could answer the question themselves, but that still doesn’t count as software development in my mind. Occasionally someone would take that code and re-implement it in an application, but then they're the ones doing the software development and I'm the one doing the math.
(Hopefully this new job I’ve started will actually let me to both, so we’ll see how that works out)

Resources