how to know feature contributions in R? - r

in R I trained a model with good performance. It takes time to train, since it has over 200 predictors(features). Is there a way to get to know what features contributes most?
Thanks.

This question is too vague as it stands since there can be any number of ways to determine variable importance based on the problem at hand. In any case, please have a look at what is available in the caret package here:
http://caret.r-forge.r-project.org/varimp.html

Related

Specification of a mixed model using glmmLasso package

I have a dataset containing repeated measures and quite a lot of variables per observation. Therefore, I need to find a way to select explanatory variables in a smart way. Regularized Regression methods sound good to me to address this problem.
Upon looking for a solution, I found out about the glmmLasso package quite recently. However, I have difficulties defining a model. I found a demo file online, but since I'm a beginner with R, I had a hard time understanding it.
(demo: https://rdrr.io/cran/glmmLasso/src/demo/glmmLasso-soccer.r)
Since I cannot share the original data, I would suggest you use the soccer dataset (the same dataset used in glmmLasso demo file). The variable team is repeated in observations and should be taken as a random effect.
# sample data
library(glmmLasso)
data("soccer")
I would appreciate if you can explain the parameters lambda and family, and how to tune them.

How to estimate gamma and cost parameters for SVM quickly

I want to train SVMs in R and I know there are functions such as e1071::tune.svm() that can be used to find the optimal parameters for the SVM. However, it seems there are some formulas out there (e.g. used in this report) that can give you a reasonable estimate of these parameters.
Since a grid-search for the parameters can take quite a lot of time on larger datasets and usually, one has to provide a range of possible values anyway, I wondered whether there is a package that implements formulas to get a quick estimate for the gamma and cost parameters for the SVM?
So far, I've found out that caret::train() might use such an approach to estimate sigma (which should be the reciprocal of 2*gamma^2) but I haven't tried it yet, since other calculations are still running (and will be, probably for the next days). Is there also an implementation to estimate cost or at least give a range of reasonable values?
I have found a similar question that asks for alternatives to grid-search in general. However, I would be interested in an R implementation of such alternatives and also, I hope things have developed further since the more general question was posted years ago.

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

Dealing with class imbalance with mlr3

Lately I have been advised to change machine learning framework to mlr3. But I am finding transition somewhat more difficult than I thought at the beginning. In my current project I am dealing with highly imbalanced data which I would like to balance before training my model. I have found out this tutorial which explains how to deal with imbalance via pipelines and graph learner:
https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
I am afraid that this approach will also perform class balancing with new data predicting. Why would I want to do this and reduce my testing sample ?
So the two question that are rising:
Am I correct not to balance classes in testing data?
If so, is there a way of doing this in mlr3?
Of course I could just subset the training data manually and deal with imbalance myself but that's just not fun anymore! :)
Anyway, thanks for any answers,
Cheers!
to answer your questions:
I am afraid that this approach will also perform class balancing with new data predicting.
This is not correct, where did you get this?
Am I correct not to balance classes in testing data?
Class balancing usually works by adding or removing rows (or adjusting weights). All those steps should not be applied during the prediction step, as we want exactly one predicted value for each row in the data. Weights on the other hand usually have no effect during the prediction phase.
Your assumption is correct.
If so, is there a way of doing this in mlr3?
Just use the PipeOpas described in the blog post.
During training, it will do the specified over- or under- sampling, while it does nothing during the prediction.
Cheers,

Can ensemble classifiers underperform the best single classifier?

I have recently run an ensemble classifier in MLR (R) of a multicenter data set. I noticed that the ensemble over three classifiers (that were trained on different data modalities) was worse than the best classifier.
This seemed to be unexpected to me. I was using logistic regressions (without any parameter optimization) as simple classifier and a Partial Least Squares (PLS) Discriminant Analysis as a superlearner, since the base-learner predictions ought to be correlated. I also tested different superlearners like NB, and logistic regression. The results did not change.
Here are my specific questions:
1) Do you know, whether this can in principle occur?
(I also googled a bit and found this blog that seems to indicate that it can:
https://blogs.sas.com/content/sgf/2017/03/10/are-ensemble-classifiers-always-better-than-single-classifiers/)
2) Especially, if you are as surprised as I was, do you know of any checks I could do in mlr to make sure, that there isnt a bug. I have tried to use a different cross-validation scheme (originally I used leave-center-out CV, but since some centers provided very little data, I wasnt sure, whether this might lead to weird model fits of the super learner), but it still holds. I also tried to combine different data modalities and they give me the same phenomenon.
I would be grateful to hear, whether you have experienced this and if not, whether you know what the problem could be.
Thanks in advance!
Yes, this can happen - ensembles do not always guarantee a better result. More details regarding cases where this can happen are discussed also in this cross-validate question

Resources