I'm trying to perform a multivariate lasso regression on a dataset with 300 independent variables and 11 response variables using glmnet library. I'd like to group some of the input variables and then apply multivariate grouped lasso regression so that all the grouped variables are either selected or discarded by the lasso model depending on their significance. How can I achieve this? I did look into grplasso package but it doesn't support multivariate regression.
I assume you mean multinomial regression as you have a multiclass problem (11 classes). In addition, you want to apply group lasso. My recommendation is to use msgl package because it supports group lasso, sparse group lasso and the regular lasso as well. This can be done by supplying the alpha parameter
Alpha : the α value 0 for group lasso, 1 for lasso, between 0 and 1
gives a sparse group lasso penalty.
You can use it for binary classification or multiclass classification as in your problem. You may also tune your lambda using cross-validation using the same package. The documentation is pretty clear and there is also a nice get started page with an example of how to group your variables and perform your analysis. According to my personal experience with this package, it is incredibly fast but it is not as friendly as glmnet package.
One more thing, the package depends on another prerequisite package that needs to be installed as well which is sglOptim
Related
I know that the R packages ``tidymodelsandcaret``` both support multinomial classification. However, does either package support ordinal regression/classification? Here is what I mean by "ordinal regression/classification":
Regular regression: estimate numbers, e.g., numeric values ranging from 1 to 250
Binary classification: estimate binary values, e.g., 0 or 1; positive or negative; yes or no
Multinomial classification: estimate one of more than two categories, e.g., rock|paper|scissors; red|white|green. Crucially, there is no natural order among these categories.
Ordinal regression/classification: estimate ordered categories, e.g., 1st|2nd|3rd; gold|silver|bronze|no-medal; very-low|low|medium|high|very-high. Crucially, there IS a natural order among the categories.
Ordered categories are either treated as classification or as regression tasks, both of which are inappropriate. They are often treated as regression problems by converting the categories to numbers (numeric type in R), but this is not always appropriate, especially in cases where there is no consistently meaningful interval distance between categories or decimal fractions between the categories are meaningless in the real world. (E.g., if gold|silver|bronze|no-medal is converted to 3|2|1|0, then what would an estimated score of 2.5 mean? Is the distance between gold and silver the same as the distance between bronze and no medal?) Ordered categories are also often treated as classification problems by converting them as an unordered factor in R and then simply carrying out multinomial classification. But this is also unsatisfactory because such treatment completely ignores the order among the categories, thus losing a vital part of real-world information that should help improve the performance of the classification.
So, the appropriate technique for estimating such outcomes is to encode them as ordered factors in R and then to apply ordinal regression (also called ordinal classification), a form of logistic regression that is adapted to ordinal categorical outcomes. R implements ordinal regression with the MASS::polr and rms::orm packages, among others. This is a classification technique that preserves and uses the order information of the outcome variable to improve the performance of the estimates beyond what regular regression or multinomial classification can do.
However, these packages are normally used for statistical analyses rather than for machine learning. Machine learning frameworks in R like tidymodels and caret do not seem to have interfaces for them. So, is there any built-in support for ordinal regression/classification in ``tidymodelsandcaret``` or another machine learning framework in R?
There is an entire section this book devoted to ordinal outcomes in caret. Have a look there.
I was wondering if someone would know an R package that would allow me to fit an Ordinal Logistic regression with a LASSO regularization or, alternatively, a Beta regression still with the LASSO? And if you also know of a nice tutorial to help me code that in R (with appropriate cross-validation), that would be even better!
Some context: My response variable is a satisfaction score between 0 and 10 (actually, values lie between 2 and 10) so I can model it with a Beta regression or I can convert its values into ranked categories. My interest is to identify important variables explaining this score but as I have too many potential explanatory variables (p = 12) compared to my sample size (n = 105), I need to use a penalized regression method for model selection, hence my interest in the LASSO.
The ordinalNet package does this. There's a paper with example here:
https://www.jstatsoft.org/article/download/v099i06/1440
Also the glmnetcr package: https://cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf
I'm running a code with around 200.000 observations, where 10.000 were treated and the remaining I'm trying to match using the package MatchIt.
Because of one of these variables, there is a warning message appearing and I don't know if I should just ignore it or not. The message is: Glm.fit: fitted probabilities numerically 0 or 1 occurred
The code that I'm running is similar to the one below:
m.out <- matchit(var ~ VAR1 + VAR2 + VAR3 + VAR4 + VAR5, data = mydata, method = "nearest", exact = c("VAR1", "VAR3", "VAR5"))
For illustration, let's say that the variable with the issue is the "VAR5". This variable is a character variable with about 200 different texts. So, my question is if this warning is a real problem or if it's just because there are too many options in this variable for the size of my data, and, because of that, it's not possible to find a treatment/control prediction? Anyway, is there something that I can do to not have this warning?
Best,
MatchIt by default uses logistic regression through the glm function to estimate propensity scores. This warning means that the logistic regression model has been overfit, with some variables perfectly predicting treatment status. This may indicate a violation of positivity (i.e., your two groups are fundamentally different from each other), but, as you mentioned, it could just be that a relatively unimportant feature has many categories, and some of these perfectly overlap with treatment. There are a few ways to handle this problem; one of them is indeed to drop VAR5, but you can also try to estimate your own propensity scores outside MatchIt using a method that doesn't suffer from this problem and then supply those propensity scores to matchit() through the distance argument.
Two methods come to mind. The first is to use brglm2, a package that implements an alternate method of fitting logistic regression models so that fitted probabilities are never 0 or 1. This method is easy to implement because it just uses a slight variation of the glm function.
A second is to use a machine learning method that performs regularization (i.e., variable selection) so that only the variables and levels of the factors that are important for the analysis are included. You could use glmnet to perform lasso or elastic net logistic regression, you could use gbm or twang to do generalized boosted modeling, or you could use SuperLearner to stack several machine learning methods and take the best predictions from them. You can then supply the predicted values to matchit().
Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?
Within R's "rpart" package for classification/regression trees, is it possible to specify prior weights for the predictor variables? Alternatively, is this a possibility with the BART (Bayesian Additive Regression Trees) package, random forests, or any other package in R?
Based on expert opinion, I would like to force certain predictor variables to be included. Thanks