Automatic scaling of predictors in glmnet - r

In An Introduction to Statistical Learning, James and colleagues state
"In contrast, the ridge regression coefficient estimates can change substantially
when multiplying a given predictor by a constant. Therefore, it is best to apply ridge regression after standardizing the predictors."
I am using the glmnet package to conduct ridge and lasso regression, however none of the predictors that were highly significant predictors in a backwards stepwise regression are greater than zero using the glmnet() and cv.glmnet() functions. I am willing to accept that the stepwise regression may have delivered spurious results (there are MANY posts warning against it), however I just wanted to make certain that the lack of even a single non-zero predictor in the lasso procedure was due to the flaws in stepwise regression rather than some scaling error on my part.
I have read that the glmnet function scales and then unscales predictors automatically, 'under the hood' as it were. Can anyone verify this?

Related

Can GLMNet perform Logistic regression?

I am using GLMNet to perform LASSO on Binary Logistic models with cv.GLMNet to test selection consistency and would like to compare its performance with plain GLM Logistic regression. For fairness' sake in comparing the outputs, I would like to use GLMNet to perform this regression, however, I am unable to find any way to do so, barring using GLMnet with alpha = 0 and Lambda = 0 (Ridge without a penalty factor). However, I am unsure about this method, as it seems slightly janky, GLMnet's manual discourages the inputting of single lambda values (for speed reasons) and it provides me no z-values to determine the confidence level of the coefficient. (Essentially, the ideal output would be something similar to just using r's GLM function)
I've read through the entire manual and cant find a method of doing this, is there a way to perform Logistic Regression with GLMNet, without the penalty factor in order to get an output similar to r's GLM?

sLDA for predicting categorical response instead of continuous in R

I have a collection of documents, that might have latent topics associated with them. It is likely that each document might relate to one or more topics. I have a master file of all possible "topics"/categories and descriptions to these topics. I am seeking to create a model that predicts the topics for each document.
I could potentially use Supervised text classification using RTextTools, but that would only help me categorize documents to belong to one category or another. I am seeking to find a solution that would not only help me determine the topic proportions to the document, but also give the term-topic/category distributions.
sLDA seems like a good fit, but it seems to only predict continuous variable outcomes instead of categorical.
LDA is more of a classification method, predicting classes. other methods can be multinational logistic regression. LDA could be harder to train compared to Multinational, given a possible little improved fit it can provide.
update: LDA is a classification method where unlike logistic regression that you directly predict Pr(Y = k|X = x) using the logit link, LDA uses the Bayes theorem for prediction. It is normally a more popular compared to logistic regression (and its extension for multi-class prediction, namely multinational logistic regression) for multi-class problems.
LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. in contrast,it is suggested that logistic regression can outperform LDA if these Gaussian assumptions are not hold. To sum up, While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data as opposed to logistic regression, which makes logistic regression a more flexible and robust method when these assumptions are not hold. So what I meant was, it is important to understand your data well, and see which might fit your data better. There are good sources on read you can read and comparison of classification methods:
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
I suggest Introduction to statistical learning, on classification chapter. Hope this helps

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...

is there a way to only include factors that are significant at P<0.05 in a backward elimination in logistic regression

When doing a backward elimination using the step(), is it possible to only include those factors that are significant, for example, at P<0.05?
I am using this line at the moment
step(FulMod3,direction="backward",trace=FALSE)
to get my final model.
Answers to these questions give starting points
Logistic Regression in R (SAS-like output)
Stepwise Regression using P-Values to drop variables with nonsignificant p-values
In particular they point you towards fastbw in the rms package, which can be used in conjunction with rms::lrm (logistic regression). They also explain why stepwise regression via p values is often a really, really, really, BAD idea: see also http://www.stata.com/support/faqs/stat/stepwise.html . There are a few contexts where it is appropriate (otherwise Frank Harrell, the author of the rms package and crusader against foolish uses of stepwise regression, wouldn't have written fastbw), but they are relatively rare, usually dominated by (e.g.) penalized regression approaches or by stepwise approaches via AIC (as implemented in step): see e.g. https://stats.stackexchange.com/questions/13686/what-are-modern-easily-used-alternatives-to-stepwise-regression and https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection

Resources