Ordinal logistic regression (or Beta regression) with a LASSO regularization in R? - r

I was wondering if someone would know an R package that would allow me to fit an Ordinal Logistic regression with a LASSO regularization or, alternatively, a Beta regression still with the LASSO? And if you also know of a nice tutorial to help me code that in R (with appropriate cross-validation), that would be even better!
Some context: My response variable is a satisfaction score between 0 and 10 (actually, values lie between 2 and 10) so I can model it with a Beta regression or I can convert its values into ranked categories. My interest is to identify important variables explaining this score but as I have too many potential explanatory variables (p = 12) compared to my sample size (n = 105), I need to use a penalized regression method for model selection, hence my interest in the LASSO.

The ordinalNet package does this. There's a paper with example here:
https://www.jstatsoft.org/article/download/v099i06/1440
Also the glmnetcr package: https://cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf

Related

How to find RERI for additive scale interaction in a cox regression model in R

I would like to find the RERI of an additive interaction for a COX regression model in R. Any idea if it's possible to do it, and if so, how to do it?
I know we can do it with stdReg package which is well explained here https://link.springer.com/article/10.1007/s10654-018-0375-y#Sec11, but the authors explain how to do it with two binary variables, and i would like to do it with 1 continous variable and 1 binary variable.

Does the function multinom() from R's nnet package fit a multinomial logistic regression, or a Poisson regression?

The documentation for the multinom() function from the nnet package in R says that it "[f]its multinomial log-linear models via neural networks" and that "[t]he response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes." Even when I go to add a tag for nnet on this question, the description says that it is software for fitting "multinomial log-linear models."
Granting that statistics has wildly inconsistent jargon that is rarely operationally defined by whoever is using it, the documentation for the function even mentions having a count response and so seems to indicate that this function is designed to model count data. Yet virtually every resource I've seen treats it exclusively as if it were fitting a multinomial logistic regression. In short, everyone interprets the results in terms of logged odds relative to the reference (as in logistic regression), not in terms of logged expected count (as in what is typically referred to as a log-linear model).
Can someone clarify what this function is actually doing and what the fitted coefficients actually mean?
nnet::multinom is fitting a multinomial logistic regression as I understand...
If you check the source code of the package, https://github.com/cran/nnet/blob/master/R/multinom.R and https://github.com/cran/nnet/blob/master/R/nnet.R, you will see that the multinom function is indeed using counts (which is a common thing to use as input for a multinomial regression model, see also the MGLM or mclogit package e.g.), and that it is fitting the multinomial regression model using a softmax transform to go from predictions on the additive log-ratio scale to predicted probabilities. The softmax transform is indeed the inverse link scale of a multinomial regression model. The way the multinom model predictions are obtained, cf.predictions from nnet::multinom, is also exactly as you would expect for a multinomial regression model (using an additive log-ratio scale parameterization, i.e. using one outcome category as a baseline).
That is, the coefficients predict the logged odds relative to the reference baseline category (i.e. it is doing a logistic regression), not the logged expected counts (like a log-linear model).
This is shown by the fact that model predictions are calculated as
fit <- nnet::multinom(...)
X <- model.matrix(fit) # covariate matrix / design matrix
betahat <- t(rbind(0, coef(fit))) # model coefficients, with expicit zero row added for reference category & transposed
preds <- mclustAddons::softmax(X %*% betahat)
Furthermore, I verified that the vcov matrix returned by nnet::multinom matches that when I use the formula for the vcov matrix of a multinomial regression model, Faster way to calculate the Hessian / Fisher Information Matrix of a nnet::multinom multinomial regression in R using Rcpp & Kronecker products.
Is it not the case that a multinomial regression model can always be reformulated as a Poisson loglinear model (i.e. as a Poisson GLM) using the Poisson trick (glmnet e.g. uses the Poisson trick to fit multinomial regression models as a Poisson GLM)?

Does glmnet package support multivariate grouped lasso regression?

I'm trying to perform a multivariate lasso regression on a dataset with 300 independent variables and 11 response variables using glmnet library. I'd like to group some of the input variables and then apply multivariate grouped lasso regression so that all the grouped variables are either selected or discarded by the lasso model depending on their significance. How can I achieve this? I did look into grplasso package but it doesn't support multivariate regression.
I assume you mean multinomial regression as you have a multiclass problem (11 classes). In addition, you want to apply group lasso. My recommendation is to use msgl package because it supports group lasso, sparse group lasso and the regular lasso as well. This can be done by supplying the alpha parameter
Alpha : the α value 0 for group lasso, 1 for lasso, between 0 and 1
gives a sparse group lasso penalty.
You can use it for binary classification or multiclass classification as in your problem. You may also tune your lambda using cross-validation using the same package. The documentation is pretty clear and there is also a nice get started page with an example of how to group your variables and perform your analysis. According to my personal experience with this package, it is incredibly fast but it is not as friendly as glmnet package.
One more thing, the package depends on another prerequisite package that needs to be installed as well which is sglOptim

Fusing learners with preprocessing in mlr - what settings to use?

I am conducting a benchmark analysis comparing different learners (logistic regression, gradient boosting, random forest, extreme gradient boosting) with the mlr package.
I understand that there are two different types of preprocessing (data and learner dependent and independent). Now I would like to conduct the data dependent preprocessing using the mlr's wrapper functionality makePreprocWrapperCaret().
However, I am unsure about the settings. As far as I understand correctly, I should impute missings with median (or mean) for logistic regression, however for tree-based models for example with very great values.
Question 1) How would I impute NAs with very great values in the code below (for the tree-based models)?
Next, as far as I understand correctly, I should cut off outliers for the logistic regression (e.g. at 99%, and 1%). However, for tree-based models that is not necessary.
Question 2) How can I cut off outlier (e.g. at 99%, and 1%) in the code below?
Lastly, (again, if I understood correctly) I should standardize the data for the logistic regression. However, I can only find the "center" option within the makePreprocWrapperCaret() which is not exactly the same.
Question 3) How can I standardize in the code below?
Many thanks in advance!!
lrn_logreg = makePreprocWrapperCaret("classif.logreg", method = c("medianImpute")) #logistic regression --> include standardization + cutoff outliers
lrn_gbm = makePreprocWrapperCaret("classif.gbm") #gradient boosting --> include imputation with great values
lrn_rf = makePreprocWrapperCaret("classif.randomForest") #Random Forest --> include imputation with great values
lrn_xgboost = makePreprocWrapperCaret("classif.xgboost") #eXtreme Gradient Boosting --> include imputation with great values
You can have a look at the mlr tutorial for imputation: https://mlr.mlr-org.com/articles/tutorial/impute.html
1)
You can use the makeImputeWrapper of mlr. For the maximum you can use imputeMax in makeImputeWrapper.
2)
For cutting of the highest and lowest values you can write your own preprocWrapper: https://mlr.mlr-org.com/articles/tutorial/preproc.html
3)
For normalization there is already a preprocWrapper function: normalizeFeatures.
See also here: https://mlr.mlr-org.com/reference/normalizeFeatures.html

Multivariate Analysis on random forest results

Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?

Resources