Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model? - r

I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?

"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.

There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...

Related

Estimating Robust Standard Errors from Covariate Balanced Propensity Score Output

I'm using the Covariate Balancing Propensity Score (CBPS) package and I want to estimate robust standard errors for my ATT results that incorporate the weights. The MatchIt and twang tutorials both recommend using the survey package to incorporate weights into the estimate of robust standard errors, and it seems to work:
design.CBPS <- svydesign(ids=~1, weights=CBPS.object$weights, data=SUCCESS_All.01)
SE <- svyglm(dv ~ treatment, design = design.CBPS)
Additionally, the survey SEs are substantially different from the default lm() way of estimating coefficient and SE provided by the CBPS package. For those more familiar with either the CPBS or survey packages, is here any reason why this would be inappropriate or violate some assumption of the CBPS method? I don't see anything the CBPS documentation about how to best estimate standard error so that's why I'm slightly concerned.
Sandwich (robust) standard errors are the most commonly use standard errors after propensity score weighting (including CBPS). For the ATE, they are known to be conservative (too large), and for the ATT, they can be either too large or too small. For parametric methods like CBPS, it is possible to use M-estimation to account for both the estimation of the propensity scores and the outcome model, but this is fairly complicated, especially for specialized models like CBPS.
The alternative is to use the bootstrap, where you bootstrap both the propensity score estimation and estimation of the treatment effect. The WeightIt documentation contains an example of how to do bootstrapping to estimate the confidence interval around a treatment effect estimate.
Using the survey package is one way to get robust standard errors, but there are other packages you can use, such as the sandwich package as recommended in the MatchIt documentation. Under no circumstance should you use or even consider the usual lm() standard errors; these are completely inaccurate for inverse probability weights. The AsyVar() function in CBPS seems like it should provide valid standard errors, but in my experience these are also wildly inaccurate (compared to a bootstrap); the function doesn't even get the treatment effect right.
I recommend you use a bootstrap. It may take some time (you ideally want around 1000 bootstrap replications), but these standard errors will be the most accurate.

Coefficient value of covariate in Cox PH model is too big

I am trying to develop Cox PH model with time-varying covariates in R. I use coxph function from survival package. There was not any trouble during estimation process, though coefficient value of one covariates is too large, in particular, 2.5e+32.
I can't guess what is reason of this problem and how to tackle it. This variable is nonstationary and proportional assumption is violated. Does either of this facts may cause such a big value of coefficient?
More information could help framing your problem.
Anyway, I doubt non-proportionality is to blame. It would imply that you have some outliers heavily biasing your coefficient beyond reasonable expectations. You could give this a quick look by plotting the output of cox.zph.
Another possible explanation is that this rather depends on the unit of measure you used to define your covariate. Can the magnitude of the coefficient be meaningfully interpreted? If so, you could simply re-scale/standardise/log-transform that covariate to obtain a 'more manageable' coefficient (if this is theoretically appropriate).
This could also be due to the so called 'complete separation', which has been discussed here and here.

My understanding of : How does CV.GLMNET work to choose optimal lambda?

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.
A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

Confidence intervals for Ridge regression

I can't do the confidence intervals in a ridge regression. I have this model.
model5 <- glmnet(train_x,train_y,family = "gaussian",alpha=0, lambda=0.01)
And when I do the prediction I use these command:
test_pred <- predict(model5, test_x, type = "link")
Someone knows how to do the confidence interval for the predictions?
It turns out that glmnet doesn't offer standard errors (and therefore doesn't give you confidence intervals) as explained here and also addressed in this vignette (excerpt below):
It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this
package deliberately does not provide them. The reason for this is
that standard errors are not very meaningful for strongly biased
estimates such as arise from penalized estimation methods. Penalized
estimation is a procedure that reduces the variance of estimators by
introducing substantial bias. The bias of each estimator is therefore a
major component of its mean squared error, whereas its variance may
contribute only a small part.
Unfortunately, in most applications of
penalized regression it is impossible to obtain a sufficiently precise
estimate of the bias. Any bootstrap-based calculations can only give
an assessment of the variance of the estimates. Reliable estimates of
the bias are only available if reliable unbiased estimates are
available, which is typically not the case in situations in which
penalized estimates are used.
Reporting a standard error of a
penalized estimate therefore tells only part of the story. It can give
a mistaken impression of great precision, completely ignoring the
inaccuracy caused by the bias. It is certainly a mistake to make
confidence statements that are only based on an assessment of the
variance of the estimates, such as bootstrap-based confidence
intervals do.
Reliable confidence intervals around the penalized
estimates can be obtained in the case of low dimensional models using
the standard generalized linear model theory as implemented in lm, glm
and coxph. Methods for constructing reliable confidence intervals in
the high-dimensional situation are, to my knowledge, not available.
However, if you insist on confidence intervals, check out this post.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources