How to calculate risk difference from a regression model in R? - r

I know how to calculate risk difference with a 2x2 table, but I have no idea how to do this with a regression model, even though it is a quite widely used method when you need to adjust variables in question.
In case I'm not making any sense, here's an article that discusses proper ways to calculate risk difference, but unfortunately it doesn't contain any code: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-016-0217-0

If I have understood your question, I think the book Regression Modelling Strategies may be what you are looking for. For example:
# Load the rms Package by FE Harrell et al., remember to install the package first..
library(rms)
# Create a fit object using some dummy data in the package
fit <- npsurv(Surv(time, status) ~ x, data = aml)
# Then you can plot a Kaplan-Meier survival curve.
plot(fit)
# Then plot the 'Risk difference' for your data, with 95% confidence limits
survdiffplot(fit, xlim = c(0,60))

Related

Survival Curves For Cox PH Models. Checking My Understanding About Plotting Them

Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

How to extract average ROC curve predictions using ROCR?

The ROCR library in R offer the ability to plot an average ROC curve (right from the ROCR reference manual):
library(ROCR)
library(ROCR)
data(ROCR.xval)
# plot ROC curves for several cross-validation runs (dotted
# in grey), overlaid by the vertical average curve and boxplots
# showing the vertical spread around the average.
data(ROCR.xval)
pred <- prediction(ROCR.xval$predictions, ROCR.xval$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,col="grey82",lty=3)
plot(perf,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)
Lovely. Unfortunately, there's seemingly no ability to obtain the average ROC curve itself as an object/dataframe/etc. for further statistical testing (say, with pROC). I did do some research (albeit perhaps after the fact), and I found this post:
Global variables in R
I looked through ROCR's code reveals the following lines for passing a result to a plot:
performance_plots.R, (starting at line 451)
## compute average curve
perf.avg <- perf.sampled
perf.avg#x.values <- list( rowMeans( data.frame( perf.avg#x.values)))
perf.avg#y.values <- list(rowMeans( data.frame( perf.avg#y.values)))
perf.avg#alpha.values <- list( alpha.values )
So, using the trace function I looked up here (General suggestions for debugging in R):
trace(.performance.plot.horizontal.avg, edit=TRUE)
I added the following line to the performance_plots.R after the lines listed above:
perf.rocr.avg <<- perf.avg # note the double `<<`
A horrible hack, yet it works as I can plot perf.rocr.avg without a problem. Unfortunately, when using pROC, I can't compare my averaged ROC curve because it requires a pROC roc object. That's fine, but the catch is that the pROC roc object requires the original prediction and reference data to create. As far as I can tell, ROCR is averaging the ROC curves themselves and not the predictions, so it seems I can't get what I want out of ROCR.
Is there a way to reverse-engineer the predictions from the averaged ROC curve created by ROCR?
I met the same problem as you. In my perspective, the average ROC generated by the ROCR package just assigned numeric values, while other statistical attribution (e.g. confidence interval) lacks. That means statistic with the average ROC may make no sense and that's why the roc object can't be generated by (tpr, fpr) list in PRoc package. However, I find a paper to address this problem, i.e., the comparison between average ROCs. The title is "The average area under correlated receiver operating characteristic curves: a nonparametric approach based on generalized two-sample Wilcoxon statistics". I hope that's helpful.

Regression kriging of binomial data

I use gstat to predict a binomial data, but the predicted values go above 1 and below 0. Does anyone know how I can deal with this issue? Thanks.
data(meuse)
data(meuse.grid)
coordinates(meuse) <- ~x+y
coordinates(meuse.grid) <- ~x+y
gridded(meuse.grid) <- TRUE
#glm model
glm.lime <- glm(lime~dist+ffreq, meuse, family=binomial(link="logit"))
summary(glm.lime)
#variogram of residuals
var <- variogram(lime~dist+ffreq, data=meuse)
fit.var <- fit.variogram(var, vgm(nugget=0.9, "Sph", range=sqrt(diff(meuse#bbox\[1,\])^2 + diff(meuse#bbox\[2,\])^2)/4, psill=var(glm.lime$residuals)))
plot(var, fit.var, plot.nu=T)
#universal kriging
kri <- krige(lime~dist+ffreq, meuse, meuse.grid, fit.var)
spplot(kri[1])
In general, with this kind of regression kriging approach there is no guarantee that the model will be valid as the calculation of the trend and the residuals is separated. A few notes on your code. Notice that you use variogram to calculate the residual variogram, but variogram uses a normal linear model to calculate the trend and thus also the residuals. You need to determine your residuals from your glm, and then calculate a residual variogram based on that.
You could do this manually, or have a look at the fit.gstatModel function from the GSIF package. You could also have a look at binom.krige from the geoRglm package. This thread on R-sig-geo might also be interesting:
Taking residuas from a GLM is rather different from using indicator
variables. Also there may be even some differences depending on which
kind of GLM residuals you take. Run a GLM and exploring the residuals
e.g. via variograms, is something I consider a routine practice, but
it does not aways tell you the whole story. Fitting a GLGM
(generealised linear geostatitical model) can be more conclusive since
you can do infereces on the model parameters and access the relevance
of the spatial term more objectively. This was the original motivation
for geoRglm doing all the modelling at once and not by two steps such
as fitting a model without correlation and then modelling residuals.
This came with the extra burden of calibrating the MCMC algorithms.
Later spBayes came to scene and indeed looks promissing proposing a
more general framework whereas geoRglm is rather specific to
univariate binomial and poison models.
As Roger says there is scope to play around with other alternatives
like the GLMM or maybe MCMCpack, but this is certainly not ready
"out-of-the-box" and code will need to be adapted for spatial
purposes.

Regression evaluation in R

Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.

Resources