R Supervised Latent Dirichlet Allocation Package - r

I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?

Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.

Related

mlr3 optimized average of ensemble

I try to optimize the averaged prediction of two logistic regressions in a classification task using a superlearner.
My measure of interest is classif.auc
The mlr3 help file tells me (?mlr_learners_avg)
Predictions are averaged using weights (in order of appearance in the
data) which are optimized using nonlinear optimization from the
package "nloptr" for a measure provided in measure (defaults to
classif.acc for LearnerClassifAvg and regr.mse for LearnerRegrAvg).
Learned weights can be obtained from $model. Using non-linear
optimization is implemented in the SuperLearner R package. For a more
detailed analysis the reader is referred to LeDell (2015).
I have two questions regarding this information:
When I look at the source code I think LearnerClassifAvg$new() defaults to "classif.ce", is that true?
I think I could set it to classif.auc with param_set$values <- list(measure="classif.auc",optimizer="nloptr",log_level="warn")
The help file refers to the SuperLearner package and LeDell 2015. As I understand it correctly, the proposed "AUC-Maximizing Ensembles through Metalearning" solution from the paper above is, however, not impelemented in mlr3? Or do I miss something? Could this solution be applied in mlr3? In the mlr3 book I found a paragraph regarding calling an external optimization function, would that be possible for SuperLearner?
As far as I understand it, LeDell2015 proposes and evaluate a general strategy that optimizes AUC as a black-box function by learning optimal weights. They do not really propose a best strategy or any concrete defaults so I looked into the defaults of the SuperLearner package's AUC optimization strategy.
Assuming I understood the paper correctly:
The LearnerClassifAvg basically implements what is proposed in LeDell2015 namely, it optimizes the weights for any metric using non-linear optimization. LeDell2015 focus on the special case of optimizing AUC. As you rightly pointed out, by setting the measure to "classif.auc" you get a meta-learner that optimizes AUC. The default with respect to which optimization routine is used deviates between mlr3pipelines and the SuperLearner package, where we use NLOPT_LN_COBYLA and SuperLearner ... uses the Nelder-Mead method via the optim function to minimize rank loss (from the documentation).
So in order to get exactly the same behaviour, you would need to implement a Nelder-Mead bbotk::Optimizer similar to here that simply wraps stats::optim with method Nelder-Mead and carefully compare settings and stopping criteria. I am fairly confident that NLOPT_LN_COBYLA delivers somewhat comparable results, LeDell2015 has a comparison of the different optimizers for further reference.
Thanks for spotting the error in the documentation. I agree, that the description is a little unclear and I will try to improve this!

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

different values by fitting a boosted tree twice

I use the R-package adabag to fit boosted trees to a (large) data set (140 observations with 3 845 predictors).
I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set).
Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?
function which returns accuracy given the predicted values and true test set values.
err<-function(pred_d, test_d)
{
abs.acc<-sum(pred_d==test_d)
rel.acc<-abs.acc/length(test_d)
v<-c(abs.acc,rel.acc)
return(v)
}
new Edit (9.1.2017):
important following question of the above context.
As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?
Edit 1: Similar phenomenon happens also with tree (using the R-package rpart).
Edit 2: Similar phenomenon did not happen with trees (using rpart) on the data set iris.
There's no reason you should expect to get the same results if you didn't set your seed (with set.seed()).
It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code.
This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages.
Note that in the case of information security it's important to have a (pseudo) random seed which cannot be easily guessed by brute force attacks, because (in a nutshell) knowing a seed value used internally by a security program paves the way for it to be hacked. In science and statistics it's the opposite - you and anyone you share your code/research with should be aware of the seed to ensure reproducibility.
https://en.wikipedia.org/wiki/Random_seed
http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

How to use a glmnet model as a node model for the mob function (of the R package party)?

I am using the mob function of the R package party. My question concerns the model parameter of this function.
How can I define a StatModel object (from the package modeltools) - let's call it glmnetModel - so that the nodes models of the mob estimation are glmnet models (more precisely I would like to use the cv.glmnet function as the main estimation function in the fit slot of glmnetModel) ?
One difficulty is to extend correctly the reweight function (and maybe the estfun and deviance functions ?) like it is suggested here (section 2.1).
Anybody has an idea ?
NB : I have seen some extensions (for SVM : here) but I am not able to use them correctly.
Thank you very much !
Dominique
I'm not sure whether the inferential framework of the parameter instability tests in the MOB algorithm holds for either glmnet or svm.
The assumption is that the model's objective function (e.g., residual sum of squares or log-likelihood) is additive in the observations and that the corresponding first order conditions are consequently additive as well. Then, under certain weak regularity conditions, a central limit theorem holds for the parameter estimates. This can be extended to a functional central limit theorem upon which the parameter instability tests in MOB are based.
If these assumptions do not hold, the p-values may not be valid and hence may lead to too many or too few splits or a biased splitting variable selection. Whether or not this happens in practice for the models you are interested in, I don't know. You would have to check this - either theoretically (which may be hard) or in simulation studies.
Technically, the reimplementation of mob() in the partykit package makes it much easier to plug in new models. A lot less glue code (without S4 classes) is necessary now. See vignette("mob", package = "partykit") for details.

Resources