I use gstat to predict a binomial data, but the predicted values go above 1 and below 0. Does anyone know how I can deal with this issue? Thanks.
data(meuse)
data(meuse.grid)
coordinates(meuse) <- ~x+y
coordinates(meuse.grid) <- ~x+y
gridded(meuse.grid) <- TRUE
#glm model
glm.lime <- glm(lime~dist+ffreq, meuse, family=binomial(link="logit"))
summary(glm.lime)
#variogram of residuals
var <- variogram(lime~dist+ffreq, data=meuse)
fit.var <- fit.variogram(var, vgm(nugget=0.9, "Sph", range=sqrt(diff(meuse#bbox\[1,\])^2 + diff(meuse#bbox\[2,\])^2)/4, psill=var(glm.lime$residuals)))
plot(var, fit.var, plot.nu=T)
#universal kriging
kri <- krige(lime~dist+ffreq, meuse, meuse.grid, fit.var)
spplot(kri[1])
In general, with this kind of regression kriging approach there is no guarantee that the model will be valid as the calculation of the trend and the residuals is separated. A few notes on your code. Notice that you use variogram to calculate the residual variogram, but variogram uses a normal linear model to calculate the trend and thus also the residuals. You need to determine your residuals from your glm, and then calculate a residual variogram based on that.
You could do this manually, or have a look at the fit.gstatModel function from the GSIF package. You could also have a look at binom.krige from the geoRglm package. This thread on R-sig-geo might also be interesting:
Taking residuas from a GLM is rather different from using indicator
variables. Also there may be even some differences depending on which
kind of GLM residuals you take. Run a GLM and exploring the residuals
e.g. via variograms, is something I consider a routine practice, but
it does not aways tell you the whole story. Fitting a GLGM
(generealised linear geostatitical model) can be more conclusive since
you can do infereces on the model parameters and access the relevance
of the spatial term more objectively. This was the original motivation
for geoRglm doing all the modelling at once and not by two steps such
as fitting a model without correlation and then modelling residuals.
This came with the extra burden of calibrating the MCMC algorithms.
Later spBayes came to scene and indeed looks promissing proposing a
more general framework whereas geoRglm is rather specific to
univariate binomial and poison models.
As Roger says there is scope to play around with other alternatives
like the GLMM or maybe MCMCpack, but this is certainly not ready
"out-of-the-box" and code will need to be adapted for spatial
purposes.
Related
I know how to calculate risk difference with a 2x2 table, but I have no idea how to do this with a regression model, even though it is a quite widely used method when you need to adjust variables in question.
In case I'm not making any sense, here's an article that discusses proper ways to calculate risk difference, but unfortunately it doesn't contain any code: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-016-0217-0
If I have understood your question, I think the book Regression Modelling Strategies may be what you are looking for. For example:
# Load the rms Package by FE Harrell et al., remember to install the package first..
library(rms)
# Create a fit object using some dummy data in the package
fit <- npsurv(Surv(time, status) ~ x, data = aml)
# Then you can plot a Kaplan-Meier survival curve.
plot(fit)
# Then plot the 'Risk difference' for your data, with 95% confidence limits
survdiffplot(fit, xlim = c(0,60))
Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.
I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck
Context: I have fitted a glmnet to my data. But for operational reason we would actually like to have rule-set. So I then fitted a C5.0Rules model to the predicted class from my glmnet. i.e. the C5.0Rules is essentially approximating my glmnet. However, as a result, the C5.0Rules will report a very high confidence (and other performance metrics), because its target is easy. A natural approach to correct this is to re-estimate the confidence (and other performance metrics) using the real response, or another dataset. But I need to do this so that the model remembers this new confidence, so in the future, it will report the corrected confidence level along with the prediction. How do I do that?
Reproducible example:
library(glmnet)
library(C50)
library(caret)
data(churn)
## original glmnet
glmnet=train(churn~.-state-area_code-international_plan-voice_mail_plan,data=churnTrain,method="glmnet")
## only retain useful predictors
temp=varImp(glmnet)$importance
reducedVar=rownames(temp)[temp>0]
churnTrain2=data.frame(churnTrain[,match(reducedVar,colnames(churnTrain))],
prediction=fitted(glmnet))
## fit my C5.0 which approximates the glmnet prediction
C5=train(prediction~.,data=churnTrain2,method="C5.0Rules")
summary(C5) ## notice the high confidence and performance measure.
(An alternative approach I can think of is to get C5.0 to predict the predicted probability instead of class, but this turns it into a regression problem so I won't be able to use C5.0)
Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.