Confidence intervals for Ridge regression - r

I can't do the confidence intervals in a ridge regression. I have this model.
model5 <- glmnet(train_x,train_y,family = "gaussian",alpha=0, lambda=0.01)
And when I do the prediction I use these command:
test_pred <- predict(model5, test_x, type = "link")
Someone knows how to do the confidence interval for the predictions?

It turns out that glmnet doesn't offer standard errors (and therefore doesn't give you confidence intervals) as explained here and also addressed in this vignette (excerpt below):
It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this
package deliberately does not provide them. The reason for this is
that standard errors are not very meaningful for strongly biased
estimates such as arise from penalized estimation methods. Penalized
estimation is a procedure that reduces the variance of estimators by
introducing substantial bias. The bias of each estimator is therefore a
major component of its mean squared error, whereas its variance may
contribute only a small part.
Unfortunately, in most applications of
penalized regression it is impossible to obtain a sufficiently precise
estimate of the bias. Any bootstrap-based calculations can only give
an assessment of the variance of the estimates. Reliable estimates of
the bias are only available if reliable unbiased estimates are
available, which is typically not the case in situations in which
penalized estimates are used.
Reporting a standard error of a
penalized estimate therefore tells only part of the story. It can give
a mistaken impression of great precision, completely ignoring the
inaccuracy caused by the bias. It is certainly a mistake to make
confidence statements that are only based on an assessment of the
variance of the estimates, such as bootstrap-based confidence
intervals do.
Reliable confidence intervals around the penalized
estimates can be obtained in the case of low dimensional models using
the standard generalized linear model theory as implemented in lm, glm
and coxph. Methods for constructing reliable confidence intervals in
the high-dimensional situation are, to my knowledge, not available.
However, if you insist on confidence intervals, check out this post.

Related

How do I compute average marginal effect from glm with robust errors?

I am estimating a logit model with glm() and use export_summs(glm_model, robust= TRUE) to have robust standard errors. Next, I want to compute the average marginal effects using the robust standard errors which I failed so far. Do you have any solution to compute margins with robust errors?

My understanding of : How does CV.GLMNET work to choose optimal lambda?

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.
A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...

Goodness of fit functions in R

What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.

Resources