lambda sequence in glmnet in ridge vs lasso - r

I'm having a problem with ridge cv in glmnet calculating an unreasonable lambda sequence.
I"m running both ridge and lasso regression with glmnet using the exact same data. Lasso is fine, but ridge isn't.
ridge.cv <- cv.glmnet(preds[train.i,], resp[train.i], alpha=0, family="binomial", type.measure="class")
lasso.cv <- cv.glmnet(preds[train.i,], resp[train.i], alpha=1, family="binomial", type.measure="class")
range(lasso.cv$glmnet.fit$dev.ratio)
[1] 1.117039e-14 9.334558e-01
range(ridge.cv$glmnet.fit$dev.ratio)
[1] 1.117039e-14 1.852909e-01
> range( lasso.cv$lambda)
[1] 0.002812585 0.268474838
> range(ridge.cv$lambda)
[1] 2.812585 268.474838
So, Lasso calculates a reasonable lambda sequence, yielding a reasonable range of deviance explained. However, Ridge calculates a lambda sequence exactly 1000 times higher than that of lasso, yielding a ridiculous range of deviance explained. Dimensions of predictor matrix is 891 x 1028
Any idea why might that happen and how to fix it? I can of course input my own sequence, but I'd like to know why it happens in case it is just a symptom of a bigger problem.

From the glmnet help file:
lambda: The actual sequence of ‘lambda’ values used. When ‘alpha=0’, the largest lambda reported does not quite give the zero coefficients reported (‘lambda=inf’ would in principle). Instead, the largest ‘lambda’ for ‘alpha=0.001’ is used, and the sequence of ‘lambda’ values is derived from this.
Basically, in the case of ridge regression it's deriving lambda.max (i.e., the value of lambda that causes all coefficients to vanish) from alpha = 0.001, which would be exactly 1000 times larger than lambda.max derived from alpha = 1 (the LASSO case).
I'm not exactly sure what you mean by "fix it", since the range of meaningful lambda values changes depending on your value of alpha.

Related

How to find RMSE value? and What is good RMSE value?

I am doing forecasting of electrical power output, I have different sets of data that varies from 200-4000 observations. I have calculated forecasting but I do not know how to calculate RMSE value and R (correlation coefficient) in R. I tried to calculate it on excel and the result for rmse was 0.0078. so I have basically two questions here.
How to calculate RMSE and R value in R?
What is good RMSE value? is 0.007 a good considerable value?
Here are two functions, one to compute the MSE and the second calls the first one and takes the squre root, RMSE.
These functions accept a fitted model, not a data set. For instance the output of lm, glm, and many others.
mse <- function(x, na.rm = TRUE, ...){
e <- resid(x)
mean(e^2, na.rm = TRUE)
}
rmse <- function(x, ...) sqrt(mse(x, ...))
Like I said in a comment to the question, a value is not good on its own, it's good when compared to others obtained from other fitted models.
Root Mean Square Error (RMSE) is the standard deviation of the prediction errors. prediction errors are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.
The formula is:
Where:
f = forecasts (expected values or unknown results),
o = observed values (known results).
The bar above the squared differences is the mean (similar to x̄). The same formula can be written with the following, slightly different, notation:
Where:
Σ = summation (“add up”)
(zfi – Zoi)2 = differences, squared
N = sample size.
You can use which ever method you want as both reflects the same and "R" that you are refering to is pearson coefficient that defines the variance amount in the data
Coming to Question2 a good rmse value is always depends on the upper and lower bound of your rmse and a good value should always be smaller that gives less probe of error

Subset selection with LASSO involving categorical variables

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?
Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

GLMNet convergence issue for penalized regression

I am working on network models for political networks. One of the things I am doing is penalized inference. I am using an adaptive lasso approach by setting a penalty factor for glmnet. I have various parameters in my model: alphas and phis. The alphas are fixed effects so I want to keep them in the model while the phis are being penalized.
I have starting coefficients from the MLE estimation process of glm() to compute the adaptive weights that are set through the penalty factor of glmnet().
This is the code:
# Generate Generalized Linear Model
GenLinMod = glm(y ~ X, family = "poisson")
# Set coefficients
coefficients = coef(GenLinMod)
# Set penalty
penalty = 1/(coefficients[-1])^2
# Protect alphas
penalty[1:(n-1)] = 0
# Generate Generalized Linear Model with adaptive lasso procedure
GenLinModNet = glmnet(XS, y, family = "poisson", penalty.factor = penalty, standardize = FALSE)
For some networks this code executes just fine, however I have certain networks for which I get these errors:
Error: Matrices must have same number of columns in rbind2(.Call(dense_to_Csparse, x), y)
In addition: Warning messages:
1: from glmnet Fortran code (error code -1); Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
2: In getcoef(fit, nvars, nx, vnames) :
an empty model has been returned; probably a convergence issue
The odd thing is that they all use the same code, so I am wondering if it is a data problem.
Additional information:
+In one case I have over 500 alphas and 21 phis and these errors appear, in another case that does not work I have 200 alphas and 28 phis. But on the other hand I have a case with over 600 alphas and 28 phis and it converges nicely.
+I have tried settings for lambda.min.ratio and nlambda to no avail.
Additional question: Is the first entry of penalty the one associated with the intercept? Or is it added automatically by glmnet()? I did not find clarity about this in the glmnet vignette. My thoughts are that I shouldn't include a term for the intercept, since it's said that the penalty is internally rescaled to sum nvars and I assume the intercept isn't one of my variables.
I'm not 100% sure about this, but I think I have found the root of the problem.
I've tried to use all kinds of manual lambda sequences, even trying very large starting lambda's (1000's). This all seemed to do no good at all. However, when I tried without penalizing the alpha's, everything would converge nicely. So it probably has something to do with the amount of unpenalized variables. Maybe keeping all alpha's unpenalized forces glmnet in some divergent state. Maybe there is some sort of collinearity going on. My "solution", which is basically just doing something else, is to penalize the alpha's with the same weigth that is used for one of the phi's. This works on the assumption that some phi's are significant and the alpha's can be just as significant, instead of being fixed (which makes them infinitely significant). I'm not completely satisfied, because this is just a different approach, but it might be interesting to note that it probably has something to do with the amount of unpenalized variables.
Also, to answer my additional question: In the glmnet vignette it says that the penalty term is internally rescaled to sum to nvars. Since the intercept is not one of the variables, my guess is that it is not needed in the penalty term. Though, I've tried with both including and excluding the term, results seem to be the same. So maybe glmnet automatically removes it if it detects that the length is +1 of what it should be.

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:
# de <- data imported from sql connection
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)
**Co <- coef(?reg or reg1?,s=???)**
summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]
The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.
This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.
I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?
Thanks in advance!
glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.
cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.
In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!
Hope this helps!
Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.

Lack of convergence of glmnet when lambda=0 for family="poisson"

While getting a handle on glmnet versus glm, I ran into convergence problems for lambda=0 and family="poisson". My understanding is that with lambda=0 (and alpha=1, the default), the answers should be essentially the same.
Below is code changed slightly from the poisson example on the glmnet help page (?glmnet). The only change is that nzc = p so that all variables are in the true model
N=1000; p=50
nzc=p
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta
mu=exp(f)
y=rpois(N,mu)
#With lambda=0 glmnet throws the convergence error shown below
fit=glmnet(x,y,family="poisson",lambda=0)
#It works with default lambda passed in
# but estimates are quite different from glm.
fit=glmnet(x,y,family="poisson") #use default lambdas
fit2=glm(y~x,family="poisson")
plot(coef(fit2)[2:(p+1)],
coef(fit,s=min(fit$lambda))[2:(p+1)],
xlab="glm",ylab="glmnet")
abline(0,1)
#works fine with gaussian response and lambda=0 or default lambda
#glm and glmnet identical
mu = f
y=rnorm(N,mu)
fit=glmnet(x,y,family="gaussian",lambda=0)
fit2=glm(y~x)
plot(coef(fit2)[2:(p+1)], coef(fit)[2:(p+1)])
abline(0,1)
Here's the error message
Warning messages:
1: from glmnet Fortran code (error code -1); Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
2: In getcoef(fit, nvars, nx, vnames) :an empty model has been returned; probably a convergence issue
Updated:
The problem seems to be with the intercept being estimated by glmnet when family="poisson" and not related to the setting of lambda per se.
fit=glmnet(x,y,family="poisson")
#intercept should be close to 0
coef(fit)[1,]
#but it is huge
#passing in intercept=FALSE however generates the convergence error again
fit=glmnet(x,y,family="poisson", intercept=FALSE)
I think you are confused about lambda and alpha. alpha is the penalization factor which is set to 0 will give you ridge regression. Typically it is set to something between 0.1 and 1. lambda is typically not set, and there is a warning on the help page NOT to set it to a single value:
WARNING: use with care. Do not supply a single value for lambda
I don't know why you think a lasso penalty should be the same as an unpenalized Poisson model. The whole point of a penalized model is to be less subject to the biases and constraints of an ordinary regression model.
You get the error because you try to pass lambda = 0 to glmnet.
If you want to select the coefficients from glmnet for lambda = 0, you could use:
coef(fit, s=0)
This automatically selects the last (smallest) value of lambda. I guess you've basically done that already though, with s = min(fit$lambda). If you want to go even smaller than that you might have to manually put in a lambda sequence, but this is a little bit tricky (glmnet seems a little bit stubborn about its lambda's).
Also keep in mind that there might be some bias in glmnet, so it could be slightly different from the results of glm.

Resources