Cross-validation predictions for lightGBM - r

Is there a simple way to recover cross-validation predictions from the model built using lgb.cv from lightGBM?
I am doing a grid search combined with cross validation. Ultimately I would like to obtain the predictions for each of the defined hold-out folds so I can also stack a few models.

the only way i have found is to continue to train (lgb.train) the same model (booster) to do this you must use init_model='model.txt'. Then save the models best iteration like this bst.save_model('model.txt', num_iteration=bst.best_iteration). Note this is python api, sorry. I have asked the same question but for the python api

Related

How to run model diagnostics and validate binomial GAMs?

I'm looking for methods to test the overall fit of a model, run model diagnostics to help with model selection and methods for model validation for binomial GAMs.
If knows of any way to do use this using R that would be extremely helpful as well (i.e packages and functions). I have heard of DHARMa, but am at a loss of how I would use the package.
Any links with more information would also be appreciated.
Currently, all I have been able to do is ROC curves and AUC values.
Thanks

How to do cross-validation in R using neuralnet?

I'm trying to build a predictive model, using the neuralnet package. First I'm spliting my dataset in training (80%) and test (20%). But ANN is such a powerful technique that my model easily overfits the training set and performs poorly on the external test set.
Predicted vs True Value - Training is the right one and test set is the left one
Is there a way to do a cross-validation on the training set so that my model doesn't overfit the set? How may I do this with my own built in function?
Plus, are there any other approaches when dealing with deep learning? I've heard you can tweak the weights of the model in order to improve its quality on external data.
Thanks in advance!

identifying key columns/features used by decision tree regression

In Azure ML, I have a predictive regression model using boosted decision tree regression and it is reasonably accurate.
The input dataset has over 450 columns and the model has done a good job of predicting against test data sets, without over-fitting.
To report on the result i need to know what features/columns the model mainly used to make predictions but i cant find this information easily when looking at the trained model data.
How do i identify this information? Im happy to import the result dataset into R to help find this but I just need pointers on what direction to start working in.
Mostly, in using Microsoft Azure Machine Learning, when looking at the features that is mainly used to make predictions, it is found on the output of the Train Model module.
But on using Decision Trees as your algorithm, the output of your Train Model module would be the constructed 'trees' of the algorithm, and it looks like this:
To know the features that made impact on predictions while using Decision Trees algorithms, you can use the Permutation Feature Importance module. Look at the sample experiment below:
The parameters of Permutation Feature Importance are Random Seed and Metric for Measuring Performance (in this case, Regression - Coefficient of Determination)
The left input of Permutation Feature Importance is your trained model, and the right input is your test data.
The output of Permutation Feature Importance looks like this:
You can add Execute R Script to extract the Features and Scores from Permutation Feature Importance module.

caret: Linear Model parameter estimates (mean and s.e.) via LOOCV

I am wondering if there is a way to extract out all of the lm parameter estimates from the results of the cross-validation training runs from caret::train() with an lm model.
I have a gist of the R code I used to do some of this checking, where I directly access the train() output object, get the cross-validation data.frame indexes used in each cross-validation run. But I was wondering if there were already functions that access this for me, because I would think that 1.) If it was a good or reasonable idea, the functionality would be there, or 2.) If the functionality is not there, my desire to do this may not be a good idea.
As a second part of the question, you can see in the gist that when I compute the mean and standard error of the single linear parameter over all of the cross-validation parameter estimates, that the mean of the CV parameter estimates and the linear model on a fit of the entire training data set match up well, but the standard error from the CV estimates is much smaller than that from the estimate from the single lm run on the whole training set. Is that expected, or am I computing/thinking about that wrong?
EDIT: I think the second part can be found by reading this answer .
Thanks in advance,
Matt

R Supervised Latent Dirichlet Allocation Package

I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?
Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.

Resources