Is XGBoost effective for variable selection? - r

I have understood the use of XGBoost, I got it this was an amateur question
Can XGBoost be used for variable elimination & selection purpose like LASSO, or we need to use LASSO first to eliminate variables & then use XGBoost finally to get prediction?

XGBoost is quite effective for prediction in the presence of redundant variables (features). As underlying gradient boosting algorithm itself is robust to multi-collinearity.
But it is highly recommended to remove (engineer) any redundant features from any dataset used for training for any algorithm of choice (whether LASSO or XGBoost).
Additionally you can combine those two method using Ensemble learning.

xgboost has built-in regularization(Like LASSO) method when you training.

Related

How to do cross-validation in R using neuralnet?

I'm trying to build a predictive model, using the neuralnet package. First I'm spliting my dataset in training (80%) and test (20%). But ANN is such a powerful technique that my model easily overfits the training set and performs poorly on the external test set.
Predicted vs True Value - Training is the right one and test set is the left one
Is there a way to do a cross-validation on the training set so that my model doesn't overfit the set? How may I do this with my own built in function?
Plus, are there any other approaches when dealing with deep learning? I've heard you can tweak the weights of the model in order to improve its quality on external data.
Thanks in advance!

Boosted trees and Variable Interactions in R

How can one see in a Boosted trees classification model for machine learning (adaboost), which variables interact with each other and how much? I would like to make use of this in R gbm package if possible.
To extract the interaction between input variables, you can use any package like lm. http://www.r-bloggers.com/r-tutorial-series-regression-with-interaction-variables/
You can use ?interact.gbm. See also this cross-validated question, which directs to a vignette of a related technique from the package dismo.
In general, these interactions may not necessarily agree with the interaction terms estimated in a linear model.

SVM Feature Selection in R

I am training a SVM classifier. Right now, I have about 4000 features, but a lot of them are redundant/uninformative. I want to reduce the features in the model to about maybe 20-50. I would like to use greedy hill climbing, reducing the features by 1 each time.
The removed feature should be the least important feature. After training an SVM, how do I get the ranking of the importance of the features? If I am using libsvm in R, how do I get the weight of each feature, or some other similar type of indicator of importance? Thanks!
I would reduce the dimensionality of the problem first using PCA (Principal Component Analysis), then apply SVM. See, e.g., Andrew Ng's lecture videos

prediction intervals with caret

I've been using the caret package in R to run some boosted regression tree and random forest models and am hoping to generate prediction intervals for a set of new cases using the inbuilt cross-validation routine.
The trainControl function allows you to save the hold-out predictions at each of the n-folds, but I'm wondering whether unknown cases can also be predicted at each fold using the built-in functions, or whether I need to use a separate loop to build the models n-times.
Any advice much appreciated
Check the R package quantregForest, available at CRAN. It can easily calculate prediction intervals for random forest models. There's a nice paper by the author of the package, explaining the backgrounds of the method. (Sorry, I can't say anything about prediction intervals for BRT models; I'm looking for them by myself...)

How to do feature selection with randomForest package?

I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file

Resources