Question regarding LASSO confidence intervals using selectiveinference package in R - r

I want to get the confidence intervals for LASSO regression. For this, I used the selective inference package in R.
The fixedLassoInf function in this package provides the confidence intervals for lasso regression for a given value of lambda. Also, we can pass the coefficient vector obtained from glmnet package to this function.
The coefficients for LASSO logistic regression for a given lambda using glmnet package is as follows:
require(ISLR)
require(glmnet)
require(selectiveInference)
y1 <- Default$default
x1 <- model.matrix(default ~ student + balance + income + student*income, Default)[, -1]
lasso.mod1 <- glmnet(x1,y1, alpha = 1, lambda = 0.0003274549,family='binomial')
lasso.mod$beta
> lasso.mod1$beta
4 x 1 sparse Matrix of class "dgCMatrix"
s0
studentYes -6.131640e-01
balance 5.635401e-03
income 2.429232e-06
studentYes:income .
Then I used the fixedLassoInf function in selective inference package in R, to get the confidence intervals:
y1 <- Default$default
beta = coef(lasso.mod1, x=x1, y=y1, s=lambda/1000, exact=T)
y1= ifelse(y1=="NO",0,1)
out = fixedLassoInf(x1,(y1),beta,lambda,family="binomial",alpha=0.05)
out
However, I am getting following Warning messages:
**
Warning messages:
1: In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = "partial", :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
2: In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = "partial", :
Solution beta does not satisfy the KKT conditions (to within specified tolerances). You might try rerunning glmnet with a lower setting of the 'thresh' parameter, for a more accurate convergence.
3: glm.fit: algorithm did not converge
**
Also as the output I am getting something not correct,
Call:
fixedLassoInf(x = x1, y = (y1), beta = beta, lambda = lambda,
family = "binomial", alpha = 0.05)
Testing results at lambda = 0.000, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 1142.801 1884.776 1 -Inf -60.633 0 0
2 0.386 1664.734 0 0.023 Inf 0 0
3 0.029 3318.110 0 0.001 Inf 0 0
4 -0.029 -1029.985 1 -Inf -0.003 0 0
Note: coefficients shown are partial regression coefficients
Based on the warning message, there is a problem with the Karush Kuhn Tucker (KKT) condition.
Can anyone help me to figure this out?
Thank you.

One of my university teachers always said
Fitting is an art, not a technique.
What I mean: Do expect that you need manual work for parameter guessing and multiple iterations of fitting. You might even question the method of fitting itself, but let's not go that path.
Anyhow, R will not do the magic of finding the correct model (now: number of parameters for LASSO) for you. From the output you show, you seem to have 4 variables, of which 3 are close to zero, therefore I suggest to start with...
Bounding the maximal number of variables in the model, i.e. dfmax=2 seems a good start
Limiting the maximum number of variables ever to be nonzero, e.g. pmax=2
The documentation of glment further details on other options.

Related

emmeans: regrid() for binomial GLMM with user-defined link function

I have fitted a binomial GLMM in R with a modified link function with a fixed guessing probability as suggested in this thread - except that the guessing probability is 1/2 and not 1/3. Therefore the sigmoidal activation in my case becomes:
P(correct) = 0.5 + 0.5*(exp(term)/(1 + exp(term))).
My model looks like this:
library(lme4)
m = 2
mod = glmer(correct ~ group*stim_strength + (stim_strength|subject) ,
family=binomial(link=mafc.logit(m)), data=obs_data)
where: guessing probability is 1/m; correct is a categorical variable indicating correct/incorrect response; group is a factor with two levels; stim_strength is numerical with values in [0,1]; mafc.logit is the function suggested in the thread.
I'm essentially fitting separate psychometric curves of the stimulus strength (stim_strength) for the two groups, while taking into account the inter-subject fluctuations in slope and intercept (random effect structure (stim_strength|subject))
This is what I get:
plot_model(mod, type = 'emm', terms = c('stim_strength', 'group'))
---> plot
The model describes the data nicely, and I now want to perform some post-hoc analyses on it. Specifically, I want to run for example:
mod.emm = emmeans(mod, ~group|stim_strength, at=list(stim_strength=c(.25,.75)))
confint(regrid(mod.emm))
contrast(regrid(mod.emm), 'pairwise', simple = 'group', combine = TRUE, adjust = 'holm')
i.e. compute confidence intervals for the %correct of the two groups at some specified values of stim_strength, and compare the %correct of the two groups at these values.
Note that I'm using regrid(), because I want the analyses to be done on the back-transformed values, not on the linear part of the model!
However, regrid() won't work with a user-defined link function. In fact, the regrid is just ignored here, as you can see e.g. from the output of the confint() call above (estimates are labelled as prob but they're clearly not transformed to [.5,1]):
stim_strength = 0.25:
group prob SE df asymp.LCL asymp.UCL
1 -1.329 0.173 Inf -1.716 -0.942
2 -0.553 0.161 Inf -0.913 -0.192
stim_strength = 0.75:
group prob SE df asymp.LCL asymp.UCL
1 1.853 0.372 Inf 1.018 2.687
2 3.375 0.395 Inf 2.489 4.261
Similarly, when adding type='response' in emmeans, I get the message:
Unknown transformation "mafc.logit(2)": no transformation done
Any workaround?
Thanks!
Looking at the linked suggestion, it appears that mafc.logit() is a function that returns a list with all the information needed to implement the transform. All you need to do is update the emmGrid object with that information:
mod.emm <- update(mod.emm, tran = mafc.logit(2))
confint(regrid(mod.emm), adjust = 'holm')
# etc...
See, for example, this vignette section and possibly other parts of that vignette.

Standard Error of the Regression for NLS Model

I am currently working on a non-linear analysis of various datasets using nls model. On the other hand, I want to calculate the standard error of the regression of the nls model.
The formula of the standard error of regression:
n <- nrow(na.omit((data))
SE = (sqrt(sum(pv-av)^2)/(n-2))
where pv is the predicted value and av is the actual value.
I have a problem on calculating the standard error. Should I calculate the predicted value and actual value first? Are the values based on the dataset? Any help is highly appreciated. Thank You.
R provides this via sigma:
fm <- nls(demand ~ a + b * Time, BOD, start = list(a = 1, b = 1))
sigma(fm)
## [1] 3.085016
This would also work where deviance gives residual sum of squares.
sqrt(deviance(fm) / (nobs(fm) - length(coef(fm))))
## [1] 3.085016

SIMR package - effect sizes

I'm using SIMR package to estimate power and effect sizes of my models. I don't understand how the package estimates the effect sizes, though, and what kind of an effect it reports (is it Cohen's d?).
E.g.
For my model, in which AQ and LSAS are continuous predictors and cond is a categorical (3 level) predictor, I get this output (for AQ):
> model.cnv.cue = lme4::lmer(DV ~ AQ_centr + cond + LSAS_centr + (1 | code), data = mydata, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'AQ_centr', (95% confidence interval):
60.50% (53.36, 67.32)
Test: Kenward Roger (package pbkrtest)
Effect size for AQ_centr is -0.048
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 23 s
nb: result might be an observed power calculation
Is it Cohen's d = -0.048? Or r? What does Kenward Roger test have to do with this?
And then, when I run it for the categorical predictor, there are no effect sizes reported:
> model.cnv.cue = lme4::lmer(CNV_500_cue ~ cond + AQ_centr + LSAS_centr + (1 | code), data = ANT, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'cond', (95% confidence interval):
95.50% (91.63, 97.92)
Test: Likelihood ratio
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 13 s
nb: result might be an observed power calculation
So how does the package estimate the effect sizes? And how to get effect sizes for categorical predictors?
The effect size -0.048 is the slope of your predictor AQ_centr.
Kenward Roger tests are used to calculate your p-values for the continuous predictor; for your categorical predictor Likelihood ratio tests are used. Instead of KR you could have also used Bootstrap etc. (it's just the way of computing p-values).
Your 3 level categorical predictor is probably split into 2 dummy variables when entering the model. If you are interested in the effect of one specific dummy variable (let's say cond2), you can run a z-test on it, like so:
powerSim(model.cnv.cue, fixed('cond2', 'z'), nsim=200)
To find out about the dummy variables, you can take a look at the model summary:
summary(model.cnv.cue)$coef
More info can be found here:
https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504
https://besjournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2F2041-210X.12504&file=mee312504-sup-0001-AppendixS1.html

Model evaluation in R with confusion matrix

Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0

Cubic spline method for longitudinal series data?

I have a serial data formatted as follows:
time milk Animal_ID
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
. . .
Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below:
Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:
mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)
I am hoping if somebody has already dealt with this problem would give me an advice. Thanks
The full dataset can be accessed from here:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0
I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via
ls("package:mgcv")
And you can see there are many of them.
For your specific data and problem, you may use a model with formula:
model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:
dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat) ## 12624 observations
length(unique(dat$time)) ## 157 unique time points
length(ID <- levels(dat$Animal_ID)) ## 355 cows
There are 157 unique values, so I reckon k = 100 is possibly appropriate.
For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).
Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:
fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.
Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.
mgcv also provides
summary.gam function for model summary
gam.check for basic model checking
plot.gam function for plotting individual terms
predict.gam (or predict.bam) for prediction on new data.
For example, the summary of the above suggested model is:
> summary(fit)
Family: gaussian
Link function: identity
Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.1950 0.2704 96.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.805 Deviance explained = 81.1%
fREML = 29643 Scale est. = 5.5681 n = 12624
The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:
plot.gam(fit, page = 1)
Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.
The diagnostic figures returned by
invisible(gam.check(fit))
looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).
If you want to make prediction for Animal_ID = 26, you may do
newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
Note that
You need to include both variables in newd (otherwise mgcv complains missing variable)
since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.
If you want to make prediction for more than one cows, try something like this:
pred.ID <- ID[1:10] ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
Note that
I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
I specified Animal_ID as factors, because this is a random effect.
For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.
Final update
Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:
model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.
I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.

Resources