Maximum Likelihood - Estimating Number of maximas - markov

I'm training a Hidden Markov Model using EM, and want to get some estimation of how "certain" I can be about the learned parameters (i.e- the estimated transition, emission, and prior probabilities). In general, different initial conditions result in different parameters, but in many of the cases the different parameters have similar likelihood.
I'm looking for some way to probe the likelihood terrain to get an estimate of the number of local maximas, in-order to get a better idea about the different results I might get. (Running the algorithm takes quite a long time so I can't run it enough times to do a "naive" sensitivity analysis)
Any standard methods to do so?

Related

Hitting a Target In-Stock Rate Through More Accurate Prediction Intervals in Univariate Time Series Forecasting

My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.

Speed up estimation of overlapping additive models (mgcv)

I have some set of variables and I'm fitting many (hundreds of thousands) additive models, each of which includes a subset of all the variables. The dependent variable is the same in every case, and some of the models overlap or are nested. Not all of the independent variables have to enter the model nonparametrically. For clarity, I might have a set of variables {x1,x2,x3,x4,x5} and estimate:
a) y=c+f(x1)+f(x2),
b) y=c+x1+f(x2),
c) y=c+f(x1)+f(x2)+x3, etc.
I'm wondering if there is anything I can do to speed up the gam estimation in this case? Is there anything that is being calculated over and over again that I could calculate once and supply to the function?
What I have already tried:
Memoization since the models repeat exactly from time to time.
Reluctantly switched from thin plate regression splines to cubic regression splines (quite a significant improvement).
The mgcv guide says:
The user can retain most of the advantages of the t.p.r.s. approach by supplying a reduced set of covariate values from which to obtain the basis - typically the number of covariate values used will be substantially smaller than the number of data, and substantially larger than the basis dimension, k.
This caused quite a noticeable improvement with smaller models, e.g. 5 smooths, but not with larger models, e.g. 10 smooths. In fact, in the latter case, it often caused the estimation to take (potentially much) longer.
What I'd like to try but don't know if it's possible:
One obvious thing that repeats itself in both, say, y=c+f(x1)+f(x2) and y=c+x1+f(x2), is the calculation of the basis for f(x2). If I were to use the same knots every time, how (if it's possible at all) could I precalculate the basis for every variable and then supply that to mgcv? Would you expect this to bring a significant time improvement?
Is there anything else you'd recommend?

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

poLCA not stable estimates

I am trying to run a latent class analysis with covariates using polca package. However, every time I run the model, the multinomial logit coefficients result different. I have considered the changes in the order of the classes and I set up a very high number of replications (nrep=1500). However, rerunning the model I obtain different results. For example, I have 3 classes (high, low, medium). No matter the order in which the classes are considered in the estimation, the multinomial model will give me different coefficient for the same combinations after different estimations (such as low vs high and medium vs high). Should I increase further the number of repetitions in order to have stable results? Any idea of why is this happening? I know with the function set.seed() I can replicate the results but I would like to obtain stable estimates to be able to claim the validity of the results. Thank you very much!
From the manual (?poLCA):
As long as probs.start=NULL, each function call will use different
(random) initial starting parameters
you need to use set.seed() or set probs.start in order to get consistent results across function calls.
Actually, if with different starting points you are not converging, you have a data problem.
LCA uses a kind of maximum likelihood estimation. If there is no convergence, you have an under-identification problem: you have too little information to estimate the number of classes that you have. Lower class numbers might run, or you will have to make some a-priori restrictions.
You might wish to read Latent Class and Latent Transition Analysis by Collins. It was a great help for me.

random forest gets worse as number of trees increases

I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.
Please let me know what I am doing wrong with randomForest, and how to fix it.
ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).

Resources