I'm trying to run a nest survival model using the logistic-exposure method based on Shaffer, 2004. I have a range of parameters and wish to compare all possible models and then estimate model-averaged parameters using shrinkage as in Burnham and Anderson, 2002. However, I am having trouble figuring out how to estimate the confidence intervals for the shrinkage adjusted parameters.
Is it possible to estimate confidence intervals for the model-averaged parameters estimated using shrinkage? I can easily extract the mean estimates for the model-averaged parameters with shrinkage using model.average$coef.shrinkage but am unclear how to get the corresponding confidence intervals.
Any help is gratefully appreciated. I'm currently working with the MuMIn package as I get errors with AICcmodavg regarding the link function.
Below is a simplified version of the code I'm using:
library(MuMIn)
# Logistical Exposure Link Function
# See Shaffer, T. 2004. A unifying approach to analyzing nest success.
# Auk 121(2): 526-540.
logexp <- function(days = 1)
{
require(MASS)
linkfun <- function(mu) qlogis(mu^(1/days))
linkinv <- function(eta) plogis(eta)^days
mu.eta <- function(eta) days * plogis(eta)^(days-1) *
.Call("logit_mu_eta", eta, PACKAGE = "stats")
valideta <- function(eta) TRUE
link <- paste("logexp(", days, ")", sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta, name = link),
class = "link-glm")
}
# randomly generate data
nest.data <- data.frame(egg=rep(1,100), chick=runif(100), exposure=trunc(rnorm(100,113,10)), density=rnorm(100,0,1), height=rnorm(100,0,1))
nest.data$chick[nest.data$chick<=0.5] <- 0
nest.data$chick[nest.data$chick!=0] <- 1
# run global logistic exposure model
glm.logexp <- glm(chick/egg ~ density * height, family=binomial(logexp(days=nest.data$exposure)), data=nest.data)
# evaluate all possible models
model.set <- dredge(glm.logexp)
# model average 95% confidence set and estimate parameters using shrinkage
mod.avg <- model.avg(model.set, beta=TRUE)
(mod.avg$coef.shrinkage)
Any ideas on how to extract/generate the corresponding confidence intervals?
Thanks
Amy
After pondering about this for a while, I have come up with the following solution based on equation 5 in Lukacs, P. M., Burnham, K. P., & Anderson, D. R. (2009). Model selection bias and Freedman’s paradox. Annals of the Institute of Statistical Mathematics, 62(1), 117–125. Any comments on its validity would be appreciated.
The code follows on from that above:
# MuMIn generated shrinkage estimate
shrinkage.coef <- mod.avg$coef.shrinkage
# beta parameters for each variable/model combination
coef.array <- mod.avg$coefArray
coef.array <- replace(coef.array, is.na(coef.array), 0) # replace NAs with zeros
# generate empty dataframe for estimates
shrinkage.estimates <- data.frame(shrinkage.coef,variance=NA)
# calculate shrinkage-adjusted variance based on Lukacs et al, 2009
for(i in 1:dim(coef.array)[3]){
input <- data.frame(coef.array[,,i],weight=model.set$weight)
variance <- rep(NA,dim(input)[2])
for (j in 1:dim(input)[2]){
variance[j] <- input$weight[j] * (input$Std..Err[j]^2 + (input$Estimate[j] - shrinkage.estimates$shrinkage.coef[i])^2)
}
shrinkage.estimates$variance[i] <- sum(variance)
}
# calculate confidence intervals
shrinkage.estimates$lci <- shrinkage.estimates$shrinkage.coef - 1.96*shrinkage.estimates$variance
shrinkage.estimates$uci <- shrinkage.estimates$shrinkage.coef + 1.96*shrinkage.estimates$variance
Related
I've been trying to manually get the response values given by the predict.glm function from the stats package in R. However, I'm unable to do so. I only know how to manually get the value with a binomial distribution. I would really appreciate some help. I created two small models (one with Gamma family and one with inverse Gaussian family).
library(stats)
library(dplyr)
data("USArrests")
#Gamma distribution model
model_gam <- glm(Rape~Murder + Assault + UrbanPop, data=USArrests, family=Gamma)
print(summary(model_gam))
responses_gam <- model_gam %>% predict(USArrests[1,], type="response")
print(responses_gam)
#Trying to manually get responses for gamma model
paste(coef(model_gam), names(coef(model_gam)), sep="*", collapse="+")
# "0.108221470842499*(Intercept)+-0.00122165587689519*Murder+-9.47425665022909e-05*Assault+-0.000467789606041651*UrbanPop"
print(USArrests[1,])
#Murder: 13.2, Assault: 236, UrbanPop: 58
x = 0.108221470842499 - 0.00122165587689519 * 13.2 - 9.47425665022909e-05 * 236 - 0.000467789606041651 * 58
# This is wrong. Do I have to include the dispersion? (which is 0.10609)
print (exp(x)/(1+exp(x)))
# result should be (from predict function): 26.02872
# exp(x)/(1+exp(x)) gives: 0.510649
# Gaussian distribution model
model_gaus <- glm(Rape~Murder + Assault + UrbanPop, data=USArrests, family=inverse.gaussian(link="log"))
responses_gaus <- model_gaus %>% predict(USArrests[1,], type="response")
print(summary(model_gaus))
print(responses_gaus)
#Trying to manually get responses for gaussian model
paste(coef(model_gaus), names(coef(model_gaus)), sep="*", collapse="+")
# "0.108221470842499*(Intercept)+-0.00122165587689519*Murder+-9.47425665022909e-05*Assault+-0.000467789606041651*UrbanPop"
x = 1.70049202188329-0.0326196928618521* 13.2 -0.00234379099421488*236-0.00991369000675323*58
# Dispersion in this case is 0.004390825
print(exp(x)/(1+exp(x)))
# result should be (from predict function): 26.02872
# exp(x)/(1+exp(x)) it is: 0.5353866
built-in predict()
predict(model_gaus)["Alabama"] ## 3.259201
by hand
cat(paste(round(coef(model_gaus),5), names(coef(model_gaus)), sep="*", collapse="+"),"\n")
## 1.70049*(Intercept)+0.03262*Murder+0.00234*Assault+0.00991*UrbanPop
USArrests["Albama",]
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
The value of the intercept is always 1, so we have
1.70049*1+0.03262*13.2+0.00234*236+0.00991*58
## [1] 3.258094
(close enough, since I rounded some things)
You don't need to do anything with the dispersion or the inverse-link function, since the Gaussian model uses an identity link.
using the model matrix
Mathematically, the regression equation is defined as X %*% beta where beta is the vector of coefficients and X is the model matrix (for your example, it's a column of ones for the intercept plus your predictors; for models with categorical predictors or more complex terms like splines, it's a little more complicated). You can extract the model matrix from the matrix with model.matrix():
Xg <- model.matrix(model_gaus)
drop(Xg["Alabama",] %*% coef(model_gaus))
For the Gamma model, you would use exactly the same procedure, but at the end you would transform the linear expression you computed (the linear predictor) by 1/x (the inverse link function for the Gamma). (Note that you need predict(..., type = "response") to get the inverse-transformed prediction; otherwise [default type = "link"] R will give you just the plain linear expression.] If you used a log link instead you would exponentiate. More generally,
invlinkfun <- family(fitted_model)$linkinv
X <- model.matrix(fitted_model)
beta <- coef(fitted_model)
invlinkfun(X %*% beta)
The inverse Gaussian model uses a 1/mu^2 link by default; inverse.gaussian()$linkinv is function(eta) { 1/sqrt(eta) }
I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.
I am attempting to run a penalized regression on claims data, where the penalized covariates are simply the origin and development year and there are no unpenalized covariates. There are occasions where you run across negative 'y values' which is a nightmare as the link function is log.
A work around this is defining the 'quasipoisson', I believe David Firth wrote the following code:
quasipoisson <- function (link = "log")
## Amended by David Firth, 2003.01.16, at points labelled ###
## to cope with negative y values
##
## Computes Pearson X^2 rather than Poisson deviance
##
## Starting values are all equal to the global mean
{
linktemp <- substitute(link)
if (!is.character(linktemp)) {
linktemp <- deparse(linktemp)
if (linktemp == "link")
linktemp <- eval(link)
}
if (any(linktemp == c("log", "identity", "sqrt")))
stats <- make.link(linktemp)
else stop(paste(linktemp, "link not available for
poisson",
"family; available links are", "\"identity\", \"log\"
and \"sqrt\""))
variance <- function(mu) mu
validmu <- function(mu) all(mu > 0)
dev.resids <- function(y, mu, wt) wt*(y-mu)^2/mu ###
aic <- function(y, n, mu, wt, dev) NA
initialize <- expression({
n <- rep(1, nobs)
mustart <- rep(mean(y), length(y)) ###
})
structure(list(family = "quasipoisson", link = linktemp,
linkfun = stats$linkfun, linkinv = stats$linkinv,
variance = variance,
dev.resids = dev.resids, aic = aic, mu.eta =
stats$mu.eta,
initialize = initialize, validmu = validmu, valideta =
stats$valideta),
class = "family")
}
This acts as a workaround, as is great when applying a standard GLM to claims data, ie you have:
model <- glm(claims ~ origin + dev, family = quasipoisson(),
subset=!is.na(claims), data=dataset)
Now I am interested to apply penalized regression in a similar manner, as issues will arise with log link I would like to introduce this new family. I am very new to this area, but I have imported the package 'penalized' in R, and have tried the following:
GLM_train<-penalized(claims,~origin+dev,unpenalized=~0,data=df_train,model="quasipoisson", lambda2=1)
Where 'df_train' is simply a dataframe containing three columns:
'claims' - the aggregate claim amount (wrt the triangle)
'origin'- the claim origin
'dev' - the development period (in months)
An example of df_train:
claims<-c(0.000, 8752.912, 16357.009, 20878.468, 22479.931, 21516.718, 22413.001, 22488.901, 21516.718, 23285.851 ,21793.339, 22765.522,
12253.054, 14257.608 ,19807.112 ,21696.767 ,22040.339, 21004.614 ,22061.039 ,22061.039, 20678.934, 21735.359, 20678.934, 21735.359)
origin<-c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12)
dev <-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2)
df_test=data.frame(claims=claims,origin=origin,dev=dev)
I have tried playing around with the models that are within the package, but the GLM does not converge. I am hoping that the introduction of the quasipoisson will alter this, is there anyway to run the penalized regression with the model as this predefinined quasipoisson?
Apologies again if trivial - I am very new to this!
Thanks!
I have been struggling with the following problem for some time and would be very grateful for any help. I am running a logit model in R using the mlogit function and am able to generate the predicted probability of choosing each alternative for a given value of the predictors as follows:
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
Fish_fit<-Fish[-(1:4),]
Fish_test<-Fish[1:4,]
m <- mlogit(mode ~price+ catch | income, data = Fish_fit)
predict(m,newdata=Fish_test,)
I cannot, however, work out how to add confidence intervals to the predicted probability estimates. I have already tried adding arguments to the predict function, but none seem to generate them. Any ideas on how it can be achieved would be much appreciated.
One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.
For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.
library(MASS)
est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)
sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))
cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240
First time asking a question here, I'll do my best to be explicit - but let me know if I should provide more info! Second, that's a long question...hopefully simple to solve for someone ;)! So using "R", I'm modelling multivariate GARCH models based on some paper (Manera et al. 2012).
I model the Constant Conditional Correlation (CCC) and Dynamic Conditional Correlation (DCC) models with external regressors in the mean equations; using "R" version 3.0.1 with package "rugarch" version 1.2-2 for the univariate GARCH with external regressors, and "ccgarch" package (version 0.2.0-2) for the CCC/DCC models. (I'm currently looking into the "rmgarch" package - but it seems to be only for the DCC and I need CCC model too.)
I have problem in the mean equations of my models. In the paper that I mentionned above, the parameter estimates of the mean equation between the CCC and DCC models changes! And I don't know how I would do that in R...
(currently, looking on Google and into Tsay's book "analysis of financial time series" and Engle's book "Anticipating correlations" to find my mistake)
What I mean by "my mean equations don't change between CCC and DCC models", it the following: I specify the univariate GARCH for my n=5 time series with the package rugarch. Then, I use the estimates parameters of the GARCH (ARCH + GARCH terms) and use them for both the CCC and DCC functions "eccc.sim()" and "dcc.sim()". Then, from eccc.estimation() and dcc.estimation() functions, I can retrieve the estimates for the variance equations as well as the correlation matrices. But not for the mean equation.
I post the R-code (reproducible and my original one) for univariate models and the CCC model only. Thank you already for reading my post!!!!!
Note: in the code below, "data.repl" is a "zoo" object of dim 843x22 (9 daily Commodities returns series and explanatory variables series). The multivariate GARCH is for 5 series only.
Reproducible code:
# libraries:
library(rugarch)
library(ccgarch)
library(quantmod)
# Creating fake data:
dataRegr <- matrix(rep(rnorm(3149, 11, 1),1), ncol=1, nrow=3149)
dataFuelsLag1 <- matrix(rep(rnorm(3149, 24, 8),2), ncol=2, nrow=3149)
#S&P 500 via quantmod and Yahoo Finance
T0 <- "2000-06-23"
T1 <- "2012-12-31"
getSymbols("^GSPC", src="yahoo", from=T0, to=T1)
sp500.close <- GSPC[,"GSPC.Close"],
getSymbols("UBS", src="yahoo", from=T0, to=T1)
ubs.close <- UBS[,"UBS.Close"]
dataReplic <- merge(sp500.close, ubs.close, all=TRUE)
dataReplic[which(is.na(dataReplic[,2])),2] <- 0 #replace NA
### (G)ARCH modelling ###
#########################
# External regressors: macrovariables and all fuels+biofuel Working's T index
ext.regr.ext <- dataRegr
regre.fuels <- cbind(dataFuelsLag1, dataRegr)
### spec of GARCH(1,1) spec with AR(1) ###
garch11.fuels <- as.list(1:2)
for(i in 1:2){
garch11.fuels[[i]] <- ugarchspec(mean.model = list(armaOrder=c(1,0),
external.regressors = as.matrix(regre.fuels[,-i])))
}
### fit of GARCH(1,1) AR(1) ###
garch11.fuels.fit <- as.list(1:2)
for(i in 1:2){
garch11.fuels.fit[[i]] <- ugarchfit(garch11.fuels[[i]], dataReplic[,i])
}
##################################################################
#### CCC fuels: with external regression in the mean eqaution ####
##################################################################
nObs <- length(data.repl[-1,1])
coef.unlist <- sapply(garch11.fuels.fit, coef)
cccFuels.a <- rep(0.1, 2)
cccFuels.A <- diag(coef.unlist[6,])
cccFuels.B <- diag(coef.unlist[7, ])
cccFuels.R <- corr.test(data.repl[,fuels.ind], data.repl[,fuels.ind])$r
# model=extended (Jeantheau (1998))
ccc.fuels.sim <- eccc.sim(nobs = nObs, a=cccFuels.a, A=cccFuels.A,
B=cccFuels.B, R=cccFuels.R, model="extended")
ccc.fuels.eps <- ccc.fuels.sim$eps
ccc.fuels.est <- eccc.estimation(a=cccFuels.a, A=cccFuels.A,
B=cccFuels.B, R=cccFuels.R,
dvar=ccc.fuels.eps, model="extended")
ccc.fuels.condCorr <- round(corr.test(ccc.fuels.est$std.resid,
ccc.fuels.est$std.resid)$r,digits=3)
My original code:
### (G)ARCH modelling ###
#########################
# External regressors: macrovariables and all fuels+biofuel Working's T index
ext.regr.ext <- as.matrix(data.repl[-1,c(10:13, 16, 19:22)])
regre.fuels <- cbind(fuel.lag1, ext.regr.ext) #fuel.lag1 is the pre-lagged series
### spec of GARCH(1,1) spec with AR(1) ###
garch11.fuels <- as.list(1:5)
for(i in 1:5){
garch11.fuels[[i]] <- ugarchspec(mean.model = list(armaOrder=c(1,0),
external.regressors = as.matrix(regre.fuels[,-i])))
}# regre.fuels[,-i] => "-i" because I model an AR(1) for each mean equation
### fit of GARCH(1,1) AR(1) ###
garch11.fuels.fit <- as.list(1:5)
for(i in 1:5){
j <- i
if(j==5){j <- 7} #because 5th "fuels" is actually column #7 in data.repl
garch11.fuels.fit[[i]] <- ugarchfit(garch11.fuels[[i]], as.matrix(data.repl[-1,j])))
}
#fuelsLag1.names <- paste(cmdty.names[fuels.ind], "(-1)")
fuelsLag1.names <- cmdty.names[fuels.ind]
rowNames.ext <- c("Constant", fuelsLag1.names, "Working's T Gasoline", "Working's T Heating Oil",
"Working's T Natural Gas", "Working's T Crude Oil",
"Working's T Soybean Oil", "Junk Bond", "T-bill",
"SP500", "Exch.Rate")
ic.n <- c("Akaike", "Bayes")
garch11.ext.univSpec <- univ.spec(garch11.fuels.fit, ols.fit.ext, rowNames.ext,
rowNum=c(1:15), colNames=cmdty.names[fuels.ind],
ccc=TRUE)
##################################################################
#### CCC fuels: with external regression in the mean eqaution ####
##################################################################
# From my GARCH(1,1)-AR(1) model, I extract ARCH and GARCH
# in order to model a CCC GARCH model:
nObs <- length(data.repl[-1,1])
coef.unlist <- sapply(garch11.fuels.fit, coef)
cccFuels.a <- rep(0.1, length(fuels.ind))
cccFuels.A <- diag(coef.unlist[17,])
cccFuels.B <- diag(coef.unlist[18, ])
#based on Engle(2009) book, page 31:
cccFuels.R <- corr.test(data.repl[,fuels.ind], data.repl[,fuels.ind])$r
# model=extended (Jeantheau (1998))
# "allow the squared errors and variances of the series to affect
# the dynamics of the individual conditional variances
ccc.fuels.sim <- eccc.sim(nobs = nObs, a=cccFuels.a, A=cccFuels.A,
B=cccFuels.B, R=cccFuels.R, model="extended")
ccc.fuels.eps <- ccc.fuels.sim$eps
ccc.fuels.est <- eccc.estimation(a=cccFuels.a, A=cccFuels.A,
B=cccFuels.B, R=cccFuels.R,
dvar=ccc.fuels.eps, model="extended")
ccc.fuels.condCorr <- round(corr.test(ccc.fuels.est$std.resid,
ccc.fuels.est$std.resid)$r,digits=3)
colnames(ccc.fuels.condCorr) <- cmdty.names[fuels.ind]
rownames(ccc.fuels.condCorr) <- cmdty.names[fuels.ind]
lowerTri(ccc.fuels.condCorr, rep=NA)
Are you aware that there is a whole package rmgarch for multivariate GARCH models?
Per its DESCRIPTION, it covers
Feasible multivariate GARCH models including DCC, GO-GARCH and
Copula-GARCH.
Well, I hope this is not too late. Here is what I found from the rmgarch manual: "the CCC model is calculated using a static GARCH copula (Normal) model".