error flexible calibration curve with val.prob.ci.2 in LASSO logistic regression model (internal calibration) in R - r

i want to calculate a flexible calibration curve after developing a logistic regression model with cv.glmnet function using the LASSO in R.
Here's part of my code:
install.packages("glmnet")
install.packages("CalibrationCurves")
#building model with 4 predictors, binomial outcome, 10 fold cross-validation for identifying lambda.min
cv.fit <-cv.glmnet(x=data.matrix(data[,2:5]),y=data$outcome, alpha=1,standardize=TRUE,intercept=TRUE,type.measure="deviance",nfolds=10,weights=WeightsTest2)
#calculating predicted probabilities of model in original dataset
pred_original_logodds <- c(predict(cv.fit, newx=data.matrix(data[,2:5]), s="lambda.min",type="response"))
#calculating probabilities
predict_original <- exp(pred_original_logodds)/(1+exp(pred_original_logodds))
#Fit calibration plot using val.prob.ci.2 function
CalibrationCurves::val.prob.ci.2(p = predict_original, y = data$outcome[enter image description here][1])
I get the following warning message: Warning: collapsing to unique 'x' values.
#Output
A 95% confidence interval is given for the calibration intercept, calibration slope and c-statistic.
Dxy C (ROC) R2 D
0.5737520 0.7868760 0.2907194 0.2128636
D:Chi-sq D:p U U:Chi-sq
127.6538494 0.0000000 0.5822965 348.4664350
U:p Q Brier Intercept
0.0000000 -0.3694329 0.2689491 -1.4232385
Slope Emax Brier scaled Eavg
6.7630170 0.4582670 -0.4947364 0.3431554
ECI
13.4254419
The flexible calibration curve is not properly fitted - i think due to "unique x values". What does this warning message mean?
Regards,
Max

Related

Confidence Interval of the predicted mean of a LMER object for large dataset

I would like to get the confidence interval (CI) for the predicted mean of a Linear Mixed Effect Model on a large dataset (~40k rows), which is itself a subset of an even larger dataset. This CI is then used for estimating the uncertainty of another calculation that uses the mean and its related CI as input data.
I managed to create a prediction estimate and interval for the full dataset, but a Prediction Interval is not the same and much larger than a CI. Beside bootstrapping (which takes way too much time with this much data), I cannot find a method that would allow me to estimate a CI – either because it is throwing errors or because it only offers to calculate Prediction intervals.
I quite recently moved into LME and I might therefore have overseen some obvious method.
Here is what I did so far in more detail:
The input data is confidential and I can therefore unfortunately not share any extract.
But in general, we have one dependent variable (y) representing the probability of a event and 2 categorical (c1 and c2) and two continuous variables (x1 and x2) with some weighting factor (w1). Some values in the dataset are missing. An extract of the first rows of the data could look like the example below:
c1
c2
x1
x2
w1
y
London
small
1
10
NA
NA
London
small
1
20
NA
NA
London
large
2
10
0.2
0.1
Paris
small
1
10
0.2
0.23
Paris
large
2
10
0.3
0.3
Based on this input data, I am then fitting a LMER model in the following form:
lmer1 <- lme4::lmer( y ~ x1 * poly(x2, 5) + ((x1 * poly(x2 ,5)) | c1),
data = df,
weights = w1,
control = lme4::lmerControl(check.conv.singular = lme4::.makeCC(action = "ignore", tol = 1e-3)))
This runs for some minutes and returns several warnings:
Warning messages: 1: In optwrap(optimizer, devfun, getStart(start,
rho$pp), lower = rho$lower, : convergence code 5 from nloptwrap:
NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above)
was reached.
2: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : unable to evaluate scaled gradient
3: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : Model failed to converge: degenerate Hessian with
11 negative eigenvalues
I increased the MAXEVAL parameter but this still did not help to get rid of the warnings and I found that despite these warnings, the model is still fitted. I therefore started to apply different methods to get a prediction of the mean for the whole dataset and the related CI for the mean.
predictInterval
I started with creating a Prediction Interval for the full dataset:
predictions <- merTools::predictInterval(lmer1,
newdata = df,
which = "full",
n.sims = 1000,
include.resid.var = FALSE,
level=0.95,
stat="mean")
However, as stated above, the Prediction Interval is not the same as the CI (see also https://datascienceplus.com/prediction-interval-the-wider-sister-of-confidence-interval/).
I found that the general predict function has the option to set interval to either “prediction” or “confidence”, but this option does not exist with the prediction from a LMER object. And I could not find another possibility to switch from Prediction Interval to CI – even though I would believe that the data drawn should be sufficient to do this.
confint
I then saw that there is a function called “confint”, but when running this function I get the following error:
predicition_ci = lme4::confint.merMod(lmer1)
Computing profile confidence intervals ...
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
I found this thread (Error when estimating CI for GLMM using confint()), which said that I need to reduce the “devtol” parameter by setting a different profile. But doing so results in the same error:
lmer1_devtol = profile(lmer1, devtol = 1e-7)
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
add_ci
I found the function “add_ci” but this again resulted in another error:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
I then set the new “allow.new.levels” parameter to TRUE like in the description of the prediction function, but this parameter seems not to be carried through:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05,
allow.new.levels = TRUE)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
Diag
I found a method to calculate CI intervals for the sleepstudy data, which uses a matrix conversion with diag.
Designmat <- model.matrix(as.formula("y ~ x1 * poly(x2, 5)")[-2], df)
predvar <- diag(Designmat %*% vcov(lmer1) %*% t(Designmat))
#With new data
newdat = df
newdat$pred <- predict(lmer1, newdat, allow.new.levels = TRUE)
Designmat <- model.matrix(formula(lmer1)[-2], newdat)
But the diag method does not work for such large datasets.
bootMer
As said earlier, the boostrapping of the confidence interval with bootMer is taking too much time for this subset of data (I started it 1 day ago and it is still running). I tried to use some parallel processing with the sleepstudy sample data but this could not increase the speed dramatically, so I would assume it will have the same effect on my large dataset.
merBoot <- bootMer(lmer1, predict, nsim = 1000, re.form = NA)
Others
I have read through all these post (and more), but none of them could help me to get the CI in reasonable time for my case. But maybe I have overseen something.
https://stats.stackexchange.com/questions/344012/confidence-intervals-from-bootmer-in-r-and-pros-cons-of-different-interval-type
https://stats.stackexchange.com/questions/117641/how-trustworthy-are-the-confidence-intervals-for-lmer-objects-through-effects-pa
How to get coefficients and their confidence intervals in mixed effects models?
Error when estimating CI for GLMM using confint()
https://stats.stackexchange.com/questions/235018/r-extract-and-plot-confidence-intervals-from-a-lmer-object-using-ggplot
How to get confidence intervals for lmer object?
Confidence intervals for the predicted probabilities from glmer object, error with bootMer
https://rdrr.io/cran/ciTools/man/add_ci.lmerMod.html
Error when estimating Confidence interval in lme4
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/
Unsurprising to me but unfortunate for you, nonconvergence of mixed model estimation and difficulty in generating confidence intervals results from the misuse of a linear model for data with a limited dependent variable. "Despite these warnings, the model is still fitted" is a dangerous practice, as iterations are not to be used from predictions if not converged. As you described, the dependent variable (y) represents the probability of an event, which is a continuous variable between zero and one. Using a linear model to predict probability constitutes a linear probability regression, which requires censoring predicted outcomes (e.g. forcing all predicted values greater than .99 to be .99 while forcing all predicted values smaller than .01 to be .01) and adjusting for heterogenous variances using weighted least squares (see https://bookdown.org/ccolonescu/RPoE4/heteroskedasticity.html). Having continuous variables produce both fixed and random effects also burden the convergence, while some or all the random effects of continuous variables may not be necessary. The use of weights can be also problematic.
Instead of a linear probability regression, beta regression works best for dependent variables which are proportions and probabilities. Beta regression without random effects is done in betareg::betareg(). glmmTMB::glmmTMB() handles beta regression with random effects. Start from a simple setting where only the intercept has random effects such as
glmmTMB(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = list(family = "beta", link ="logit"), data = df)
You may compare the result with glmer() and lmer()
glmer(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = gaussian(link = "logit"), data = df)
lmer(log(y/(1-y)) ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), data = df)
glmer() and lmer() with the above specifications are equivalent, and both assume that predicting log(y/(1-y)) has normal residuals, while glmmTMB() assumes that y follows a gamma distribution. lmer() results are easier to explain and receive wider support from other packages, since they are linear models. On the other hand, glmmTMB() may fit better according to AIC, BIC, and log likelihood. Note that all three requires y strictly in (0, 1) noninclusive. To include occasional zeros and ones, manipulate observations at both boundaries by introducing a small tolerance usually equal to half of the smaller distance from a boundary to its closest observed value (see https://stats.stackexchange.com/questions/109702 and https://graphworkflow.com/eda/bounded01/). For probabilities with either or both of many zeros and ones, zero-, one-, and zero-one–inflated beta regression is fitted via gamlss::gamlss(). See Korosteleva, O. (2019). Advanced regression models with SAS and R. CRC Press.
Add random effects of slopes if necessary according to likelihood ratio tests. Make sure there are enough levels in c1 (e.g. more than 10 different cities) to necessitate mixed effect models. The {glmmTMB} package extends glm() and glmer(). Its alternative {brms} package is built for Bayesian approach. Note that the weights = argument in glmmTMB() as in glm() specifies that values in weights are inversely proportional to the dispersions and are not automatically scaled to sum to one unless integer values which specifies number of observation units. Therefore, you need to investigate what w1 stands for and evaluate how to use it in modeling.
merTools::predictInterval() generates many kinds of intervals for mixed models, some comparable to confidence intervals and prediction intervals in linear models without random effects. However, it supports lmer() model objects only. See https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html and https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html.
predictInterval(lmer(), include.resid.var = F) includes uncertainty from both fixed and random effects of all coefficients including the intercept but excludes variation from multiple measurements of the same group or individual. This can be considered similar to prediction intervals of linear models without random effects. predictInterval(lmer(), include.resid.var = F, fix.intercept.variance = T) generates shorter CI than above by accounting for covariance between the fixed and random effects of the intercept. predictInterval(lmer(), include.resid.var = F, ignore.fixed.terms = "(Intercept)") also shortens CI by removing uncertainty from the fixed effect of the intercept. If there are no random slopes other than random intercept, the last two methods are comparable to confidence intervals of of linear models without random effects. confint(lmear()) and confint(profile(lmear())) generates confidence intervals of modal parameters such as a slope, so they do not produce confidence intervals of predicted outcomes.
You may also find the following functions and packages useful for generating CIs of mixed effect models.
ggeffect() {ggeffects} predictions() {marginaleffects} and margins() prediction() {margins} {predictions}
They can produce predictions averaged over observed distribution of covariates, instead of making predictions by holding some predictors at specific values such as means or modes which can be misleading and not useful.

Making sense of gbm survival prediction model

I am a newbie in using and making sense of ML methods and currently doing survival analysis using gbm package in R.
I have difficulty understanding some of the output of the survival prediction model. I have checked this tutorial and this post but still, find trouble in making sense of the outputted survival prediction model.
Here is my code for analysis based on example data:
rm(list=ls(all=TRUE))
library(randomForestSRC)
library(gbm)
library(survival)
library(Hmisc)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
model <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph",
cv.folds = 5)
summary(model)
best.iter <- gbm.perf(model, plot.it = TRUE, method = 'cv',
overlay = TRUE) #to get the optimal number of Boosting iterations
best.iter
#Us the best number of tree to produce predicted values for each observation in newdata
# return a vector of prediction on n.trees indicting log hazard scale.f(x)
# By default the predictions are on log hazard scale for coxph
# proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
# estimate the f(x) component of the hazard function
pred.train <- predict(object=model, newdata=data.train, n.trees = best.iter)
pred.test <- predict(object=model, newdata=data.test, n.trees = best.iter)
#trainig set
Hmisc::rcorr.cens(-pred.train, Surv(data.train$days, data.train$status))
#val set
Hmisc::rcorr.cens(-pred.test, Surv(data.test$days, data.test$status))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = data.train$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all:
surv.rate <- exp(-exp(pred.train)*basehaz.cum)
surv.rate
res_train <- data.train
# predicted outcome for train set
res_train$pred <- pred.train
res_train$survival_rate <- surv.rate
res_train
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.test$days, #The survival times.
delta=data.test$status, #The censoring indicator
f.x=pred.test, #The predicted values of the regression model on the log hazard scale.
t.eval = data.test$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all at specified time is:
surv.rate <- exp(-exp(pred.test)*basehaz.cum)
surv.rate
res_test <- data.test
# predicted outcome for test set
res_test$pred <- pred.test
res_test$survival_rate <- surv.rate
res_test
#--------------------------------------------------
#Estimate survival rate at time of interest
# Specify time of interest
time.interest <- sort(unique(data.train$days[data.train$status==1]))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = time.interest, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
#For individual $i$ in test set, estimation of survival function is:
surf.i <- exp(-exp(pred.test[1])*basehaz.cum) #survival rate
#Estimation of survival rate of all at specified time is:
specif.time <- time.interest[10]
surv.rate <- exp(-exp(pred.test)*basehaz.cum[10])
cat("Survival Rate of all at time", specif.time, "\n")
print(surv.rate)
The output returned from the predict function represents the f(x) component of the hazard function ( h(t|x)=lambda(t)*exp(f(x)) ).
My questions:
• A bit confused about whether hazard ratios can be calculated here?
• Wondering how can I divide the population into low-risk and high-risk groups? Can I rely on the estimated f(x) component of the hazard function to do the scoring system for the training set? I aim from this to have a scoring system where I show KM plots for low and high-risk groups for training and test sets.
• How can I construct calibration curve plots where I can plot observed survival vs. predicted survival for the training set and test set?
Amer. Thx for your reading of my tutorial!
As you mentioned that "The output returned from the predict function represents the f(x) component of the hazard function ( h(t|x)=lambda(t)*exp(f(x)) )", maybe we need to understand the hazard function, i.e. h(t|x).
Before this, please sure that you have the basic knowledge of survival analysis. if not, it's recommended to read the great post. I think the post would help you solve the questions.
Back to your questions:
Exactly, we can get the hazard ratios of log scale by invoking the predict function. Therefore, the hazard ratio can be calculated by exp() .
Sure! Relying on the values of hazard ratio, we can divide the population into low-risk and high-risk groups. Alternatively, you can use the median of hazard ratios as the cutoff value. I think the cutoff value should be derived from the training set, and then test in the test set. If your model is effective, KM plots for low and high-risk groups would have a significant difference (measured by log-rank test statistically).
Calibration curve plots are often used to evaluated the performance of model that outputs probabilities or likelihoods ranged from [0.0, 1.0]. We can calculate the survival function, and then specify a time point of interest, e.g. 5-Year. At last, we compare the survival probabilities with the actual survival state at the specified time, which is just the same as we do evaluating a binary classification model. More details of obtaining survival function can refer to my tutorial, and the principles can be found in that post aforementioned.

unexpected output in plotting ROC curve for SVM classifier

While i'm trying to plot SVM using ROC curve for discrete classification it didn't produce a curve matched with its accuracy rate . while the acc. rate produced for SVM using confusion matrix was 88.1%
and it produced the following curve.
ROC Curve for SVM in R
also i calculated the area under curve (AUC) it gives me 1.0 this means that the acc. rate has to be 100 % not 88.1%.
here is the code that i used to produce it
x <- subset(mov[3522:4521,-17])
q <- (mov[3522:4521,17])
svm_model1 <- svm(x,q,cost = .1 , gamma =.5,probability = TRUE)
a <- predict(svm_model1,type="prob", newdata = mov[3522:4521,-17],
probability = TRUE)
library(ROCR)
rocc <-prediction (attr(a, "probabilities")[,2],mov[3522:4521,'y'])
per <- performance(rocc, "tpr","fpr")
plot(per,colorize=T,lwd= 3,main="ROC curve for SVM")
i found a question that could be related but unfortunately i couldn't get it
does anyone know why i got this ROC curve?

How to detect Heteroscedasticity in Random Foreest Model?

I am working on a regression model in Random Forest, I want to judge whether there is heteroscedasticity in the model or not?
When I am developing Linear Model I can see that there is heteroscedasticity and the curve looks like below graph, I want to check similar residual plot for Random Forest Model.
I am working in R.
It's an Expense Model basis Income,Branch,TotalFamilyMember
We can recreate the plot with the residuals from the predicted values:
#Using the regression example from ?randomForest
ozone.rf <- randomForest(Ozone ~ ., data=airq, mtry=3,
importance=TRUE)
#Find residuals by subtracting predicted from acutal values
err <- ozone.rf$predicted - airq$Ozone
#Make data frame holding residuals and fitted values
df <- data.frame(Residuals=err, Fitted.Values=ozone.rf$predicted)
#Sort data by fitted values
df2 <- df[order(df$Fitted.Values),]
#Create plot
plot(Residuals~Fitted.Values, data=df2)
#Add origin line at (0,0) with grey color #8
abline(0,0, col=8)
#Add the same smoothing line from lm regression with color red #2
lines(lowess(df2$Fitted.Values, df2$Residuals), col=2)
Update
There is a much easier way. I realized that the plot is just a regression of residuals and fitted values, therefore this gives the same output:
fitted.values <- ozone.rf$predicted
residuals <- fitted.values - ozone.rf$y
plot(lm(residuals ~ fitted.values), which=1)

Estimating mean time to failure with survreg/flexsurvreg with standard error

I am trying to estimate the mean time to failure for a Weibull distribution fitted to some survival data with flexsurvreg from the flexsurv package. I need to be able to estimate the standard error for use in a simulation model.
Using flexsurvreg with the lung data as an example;
require(flexsurv)
lungS <- Surv(lung$time,lung$status)
lungfit <- flexsurvreg(lungS~1,dist="weibull")
lungfit
Call:
flexsurvreg(formula = lungS ~ 1, dist = "weibull")
Maximum likelihood estimates:
est L95% U95%
shape 1.32 1.14 1.52
scale 418.00 372.00 469.00
N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.851, df = 2
AIC = 2311.702
Now, calculating the mean is just a case of plugging in the estimated parameter values into the standard formula, but is there an easy way of getting out the standard error of this estimate? Can survreg do this?
In flexsurv version 0.2, if x is the fitted model object, then x$cov is the covariance matrix of the parameter estimates, with positive parameters on the log scale. You could then use the asymptotic normal property of maximum likelihood estimators. Simulate a large number of multivariate normal vectors, with the estimates as means, and this covariance matrix (using e.g. rmvnorm from the mvtnorm package). This gives you replicates of the parameter estimates under sampling uncertainty. Calculate the corresponding mean survival for each replicate, then take the SD or quantiles of the resulting sample to get the standard error or a confidence interval.

Resources