How is glmnet dealing with high number of unpenalized parameters? - r

I'm trying to fit a poisson model with glmnet where I know that a lot of the parameters have to be included in the model and then unpenalized. I'm setting their penalty.factor to 0. Unfortunately, glmnet has issue with convergence as the number of unpenalized parameters is high. Here is an example from the glmnet manual where the algorithm doesn't converge if the number of unpenalized parameters is high.
library(glmnet)
N=500; p=300
nzc=5
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta
mu=exp(f)
y=rpois(N,mu)
pena=rep(0,ncol(x))
pena[1:10]=1 # penalized only the first ten parameters
fit=glmnet(x,y,family="poisson",penalty.factor=pena,alpha=0.9)
# fit is not converging
pena=rep(0,ncol(x))
pena[1:250]=1 # penalized the first 250 parameters
fit=glmnet(x,y,family="poisson",penalty.factor=pena,alpha=0.9)
# fit is converging
I know that the model doesn't make sense for this example but at least I can reproduce the same error as in my data. It looks like lambda_max is set to infinity. Does it mean that the algorithm can't find a minimum lambda_max where all the penalized parameters are zero? How can I fit a model where a lot of the parameters are unpenalized?
Thanks a lot

Related

confidence interval of estimates in a fitted hybrid model by spatstat

hybrid Gibbs models are flexible for fitting spatial pattern data, however, I am confused on how to get the confidence interval for the fitted model's estimate. for instance, I fitted a hybrid geyer model including a hardcore and a geyer saturation components, got the estimates:
Mo.hybrid<-Hybrid(H=Hardcore(), G=Geyer(81,1))
my.hybrid<-ppm(my.X~1,Mo.hybrid, correction="bord")
#beta = 1.629279e-06
#Hard core distance: 31.85573
#Fitted G interaction parameter gamma: 10.241487
what I interested is the gamma, which present the aggregation of points. obviously, the data X is a sample, i.e., of cells in a anatomical image. in order to report statistical result, a confidence interval for gamma is needed. however, i do not have replicates for the image data.
can i simlate 10 time of the fitted hybrid model, then refitted them to get confidence interval of the estimate? something like:
mo.Y<-rmhmodel(cif=c("hardcore","geyer"),
par=list(list(beta=1.629279e-06,hc=31.85573),
list(beta=1, gamma=10.241487,r=81,sat=1)), w=my.X)
Y1<-rmh(model=mo.Y, control = list(nrep=1e6,p=1, fixall=TRUE),
start=list(n.start=c(npoint(my.X))))
Y1.fit<-ppm(Y1~1, Mo.hybrid,rbord=0.1)
# simulate and fit Y2,Y3,...Y10 in same way
or:
Y10<-simulate(my.hybrid,nsim=10)
Y1.fit<-ppm(Y10[1]~1, Mo.hybrid,rbord=0.1)
# fit Y2,Y3,...Y10 in same way
certainly, the algorithms is different, the rmh() can control simulated intensity while the simulate() does not.
now the questions are:
is it right to use simualtion to get confidence interval of estimate?
or the fitted model can provide estimate interval that could be extracted?
if simulation is ok, which algorithm is better in my case?
The function confint calculates confidence intervals for the canonical parameters of a statistical model. It is defined in the standard stats package. You can apply it to fitted point process models in spatstat: in your example just type confint(my.hybrid).
You wanted a confidence interval for the non-canonical parameter gamma. The canonical parameter is theta = log(gamma) so if you do exp(confint(my.hybrid) you can read off the confidence interval for gamma.
Confidence intervals and other forms of inference for fitted point process models are discussed in detail in the spatstat book chapters 9, 10 and 13.
The confidence intervals described above are the asymptotic ones (based on the asymptotic variance matrix using the central limit theorem).
If you really wanted to estimate the variance-covariance matrix by simulation, it would be safer and easier to fit the model using method='ho' (which performs the simulation) and then apply confint as before (which would then use the variance of the simulations rather than the asymptotic variance).
rmh.ppm and simulate.ppm are essentially the same algorithm, apart from some book-keeping. The differences observed in your example occur because you passed different arguments. You could have passed the same arguments to either of these functions.

auto.arima produces non-gaussian residual

I'm using R's auto.arimafunction - but it seems like that it does not produce gaussian errors all the time. I cannot find any documentation that it does some bootstrapping of the prediction error (if the error is not gaussian), or what it does if the error is not gaussian?
Estimation does not require Gaussian errors, even when a Gaussian likelihood is being used. A Gaussian likelihood is almost the same as least squares and will give consistent estimates for any error distribution with finite variance.
The only time that the distribution of residuals really matters is when producing prediction intervals. If the residuals are not Gaussian, the default prediction intervals will not necessarily have the correct coverage. But then you can set bootstrap=TRUE and get bootstrapped prediction intervals which are based on the empirical distribution of the residuals.

Subset selection with LASSO involving categorical variables

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?
Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

Comparing AIC for different types of models (beta and normal)

I have responses which are proportions mainly centered around 0.6-0.7, and not many of them are close to 0 or 1. I have tried fitting both normal and beta models, and the normal models yield lower AIC than the beta models. I use the lm package for fitting the normal model, and betareg for the beta model.
But I wonder it if it really possible to compare AIC values for different model types like that? I do of course use the same response variables and the same data for both regressions.
Note: I tried to read about Kullback-Leibler divergence here: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4049836 (name: The AIC Criterion and Symmetrizing the Kullback–Leibler Divergence), but got confused by this sentence on page two: "It is also assumed as in [12] that the search is carried out in a parametric family of distribution including the true model.", where [12] refers to Akaikes article from 1974. Does this imply that I cannot compare the AIC from a beta and a normal model, as the true model cannot be both beta and normal?
Note 2: I tried to logit-transform the responses and then fit a normal model, but that just made the residual plots look worse.

re-estimating confidence of C5.0 rules using another dataset

Context: I have fitted a glmnet to my data. But for operational reason we would actually like to have rule-set. So I then fitted a C5.0Rules model to the predicted class from my glmnet. i.e. the C5.0Rules is essentially approximating my glmnet. However, as a result, the C5.0Rules will report a very high confidence (and other performance metrics), because its target is easy. A natural approach to correct this is to re-estimate the confidence (and other performance metrics) using the real response, or another dataset. But I need to do this so that the model remembers this new confidence, so in the future, it will report the corrected confidence level along with the prediction. How do I do that?
Reproducible example:
library(glmnet)
library(C50)
library(caret)
data(churn)
## original glmnet
glmnet=train(churn~.-state-area_code-international_plan-voice_mail_plan,data=churnTrain,method="glmnet")
## only retain useful predictors
temp=varImp(glmnet)$importance
reducedVar=rownames(temp)[temp>0]
churnTrain2=data.frame(churnTrain[,match(reducedVar,colnames(churnTrain))],
prediction=fitted(glmnet))
## fit my C5.0 which approximates the glmnet prediction
C5=train(prediction~.,data=churnTrain2,method="C5.0Rules")
summary(C5) ## notice the high confidence and performance measure.
(An alternative approach I can think of is to get C5.0 to predict the predicted probability instead of class, but this turns it into a regression problem so I won't be able to use C5.0)

Resources