Query about ridge regression - optimum value of lambda - r

I have a query about the cv.glmnet() function in R which is supposed to find the "optimum" value of the parameter lambda for ridge regression.
In the example code below, if you experiment a bit with values of lambda that are smaller than the one that cv.glmnet() gives, you will find that the error sum of squares actually is much smaller than what cv.fit$lambda.min gives.
I have noticed this with many datasets. Even the example in the well known book "Introduction to Statistical Learning", (ISLR) by Gareth James et al has this problem. (Section 6.6.1 using the Hitters dataset). The actual value of lambda that minimizes the MSE is smaller than what the ISLR book gives. This is true both on the train data as well as new test data.
What is the reason for this? So, what exactly is cv.fit$lambda.min returning?
Ravi
data(mtcars)
y = mtcars$hp
X = model.matrix(hp~mpg+wt+drat, data=mtcars)[ ,-1]
X
lambdas = 10^seq(3, -2, by=-.1)
fit = glmnet(X, y, alpha=0, lambda=lambdas)
summary(fit)
cv.fit = cv.glmnet(X, y, alpha=0, lambda=lambdas)
# what is the optimum value of lambda?
(opt.lambda = cv.fit$lambda.min) # 1.995262
y.pred = predict(fit, s=0.01, newx=X, exact=T) # gives lower SSE
# Sum of Squares Error
(sse = sum((y.pred - y)^2))

cv.glmnet searches for lambda minimizing cross-validation score, not MSE.
From ?cv.glmnet:
The function runs glmnet nfolds+1 times; the first to get the lambda
sequence, and then the remainder to compute the fit with each of the
folds omitted. The error is accumulated, and the average error and
standard deviation over the folds is computed.

Related

Gelman-Rubin statistic for MCMCglmm model in R

I have a multivariate model with this (approximate) form:
library(MCMCglmm)
mod.1 <- MCMCglmm(
cbind(OFT1, MIS1, PC1, PC2) ~
trait-1 +
trait:sex +
trait:date,
random = ~us(trait):squirrel_id + us(trait):year,
rcov = ~us(trait):units,
family = c("gaussian", "gaussian", "gaussian", "gaussian"),
data= final_MCMC,
prior = prior.invgamma,
verbose = FALSE,
pr=TRUE, #this saves the BLUPs
nitt=103000, #number of iterations
thin=100, #interval at which the Markov chain is stored
burnin=3000)
For publication purposes, I've been asked to report the Gelman-Rubin statistic to indicate that the model has converged.
I have been trying to run:
gelman.diag(mod.1)
But, I get this error:
Error in mcmc.list(x) : Arguments must be mcmc objects
Any suggestions on the proper approach? I assume that the error means I can't pass my mod.1 output through gelman.diag(), but I am not sure what it is I am supposed to put there instead? My knowledge is quite limited here, so I'd appreciate any and all help!
Note that I haven't added the data here, but I suspect the answer is more code syntax and not data related.
The gelman.diag requires a mcmc.list. If we are running models with different set of parameters, extract the 'Sol' and place it in a list (Below, it is the same model)
library(MCMCglmm)
model1 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1)
model2 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1 )
mclist <- mcmc.list(model1$Sol, model2$Sol)
gelman.diag(mclist)
# gelman.diag(mclist)
#Potential scale reduction factors:
# Point est. Upper C.I.
#(Intercept) 1 1
According to the documentation, it seems to be applicable for more than one mcmc chain
Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution.
The input x here is
x - An mcmc.list object with more than one chain, and with starting values that are overdispersed with respect to the posterior distribution.

cv.glmnet and Leave-one out CV

I'm trying to use the function cv.glmnet to find the best lambda (using the RIDGE regression) in order to predict the class of belonging of some objects.
So the code that I have used is:
CVGLM<-cv.glmnet(x,y,nfolds=34,type.measure = "class",alpha=0,grouped = FALSE)
actually I'm not using a K-fold cross validation because my size dataset is too small, in fact I have only 34 rows. So, I'm using in nfolds the number of my rows, to compute a Leave-one out CV.
Now, I have some questions:
1) First of all: Does cv.glmnet function tune the Hyperpameter lambda or also test the "final model"?
2)One time got the best lambda, what have I to do? Have I to use predict function?
If yes, which data I have to use if I use all data to find lambda since I have used LOO CV?
3)How can I calculate R^2 from cv.glmnet function?
Here is an attempt to answer your questions:
1) cv.glmnet tests the performance of each lambda by using the cross validation of your specification. Here is an example:
library(glmnet)
data(iris)
find best lambda for iris prediction:
CVGLM <- cv.glmnet(as.matrix(iris[,-5]),
iris[,5],
nfolds = nrow(iris),
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
the miss classification error of best lambda is in
CVGLM$cvm
#output
0.06
If you test this independently using LOOCV and best lambda:
z <- lapply(1:nrow(iris), function(x){
fit <- glmnet(as.matrix(iris[-x,-5]),
iris[-x,5],
alpha = 0,
lambda = CVGLM$lambda.min,
family="multinomial")
pred <- predict(fit, as.matrix(iris[x,-5]), type = "class")
return(data.frame(pred, true = iris[x,5]))
})
z <- do.call(rbind, z)
and check the error rate it is:
sum(z$pred != z$true)/150
#output
0.06
so it looks like there is no need to test the performance using the same method as in cv.glmnet since it will be the same.
2) when you have the optimal lambda you should fit a model on the whole data set using glmnet function. What you do after with the model is entirely up to you. Most people train a model to predict something.
3) what is R^2 for a classification problem? If you could explain that then you could calculate it.
R^2 = Explained variation / Total variation
what is this in terms of classes?
Anyhow R^2 is not used for classification but rather AUC, deviance, accuracy, balanced accuracy, kappa, joudens J and so on - most of these are used for binary classification but some are available for multinomial.
I suggest this as further reading

Choose model by BIC in a stepwise algorithm after choosing model from glmnet

I have data where number of observation n is smaller than number of variables p. The answer variable is binary. For example:
n <- 10
p <- 100
x <- matrix(rnorm(n*p), ncol = p)
y <- rbinom(n, size = 1, prob = 0.5)
I would like to fit logistic model for this data. So I used the code:
model <- glmnet(x, y, family = "binomial", intercept = FALSE)
The function returns 100 models for different $\lambda$ values (penalization parameter in LASSO regression). I would like to choose the biggest model which also has n - 1 parameters or less (so less than number of observations). Let's say the chosen model is for lambda_opt.
model_one <- glmnet(x, y, family = "binomial", intercept = FALSE, lambda = lambda_opt)
Now I would like to do the second step - use step function to my model to choose the submodel which will be the best in term of BIC - Bayesian Information Criterion. Unfortunately the step function doesn't work for objects of the glmnet class.
step(model_one, direction = "backward", k = log(n))
How can I perform such procedure? Is there any other function for this specific class (glmnet) to do what I want?
BIC is a fine way to select a penalty parameter from the sequence returned by glmnet, it's faster the cross validation and works quite well at least in the settings where I've tried it.
Compute the residuals sum of square for each value of the penalty parameter in the sequence (use predict(model,x) to get the fit)
model$df gives you the degrees of freedom.
Combine those to get a BIC and pick the value of lambda corresponding to the lowers BIC.

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Resources