Quick way to calculate a confidence interval after changing dispersion parameter - r

I'm teaching a modeling class in R. The students are all SAS users, and I have to create course materials that exactly match (when possible) SAS output. I'm working on the Poisson regression section and trying to match PROC GENMOD, with a "dscale" option that modifies the dispersion index so that the deviance/df==1.
Easy enough to do, but I need confidence intervals. I'd like to show the students how to do it without hand calculating them. Something akin to confint_default() or confint()
Data
skin_cancer <- data.frame(CASES=c(1,16,30,71,102,130,133,40,4,38,
119,221,259,310,226,65),
CITY=c(rep(0,8),rep(1,8)),
N=c(172875, 123065,96216,92051,72159,54722,
32185,8328,181343,146207,121374,111353,
83004,55932,29007,7583),
agegp=c(1:8,1:8))
skin_cancer$ln_n = log(skin_cancer$N)
The model
fit <- glm(CASES ~ CITY, family="poisson", offset=ln_n, data=skin_cancer)
Changing the dispersion index
summary(fit, dispersion= deviance(fit) / df.residual(fit)))
That gets me the "correct" standard errors (correct according to SAS). But obviously I can't run confint() on a summary() object.
Any ideas? Bonus points if you can tell me how to change the dispersion index within the model so I don't have to do it within the summary() call.
Thanks.

This is an interesting question, and slightly deeper than it seems.
The simplest potential answer is to use family="quasipoisson" instead of poisson:
fitQ <- update(fit, family="quasipoisson")
confint(fitQ)
However, this won't let you adjust the dispersion to be whatever you want; it specifically changes the dispersion to the estimate R calculates in summary.glm, which is based on the Pearson chi-squared (sum of squared Pearson residuals) rather than the deviance, i.e.
sum((object$weights * object$residuals^2)[object$weights > 0])/df.r
You should be aware that stats:::confint.glm() (which actually uses MASS:::confint.glm) computes profile confidence intervals rather than Wald confidence intervals (i.e., this is not just a matter of adjusting the standard deviations).
If you're satisfied with Wald confidence intervals (which are generally less accurate) you could hack stats::confint.default() as follows (note that the dispersion title is a little bit misleading, as this function basically assumes that the original dispersion of the model is fixed to 1: this won't work as expected if you use a model that estimates dispersion).
confint_wald_glm <- function(object, parm, level=0.95, dispersion=NULL) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(vcov(object)))[parm]
if (!is.null(dispersion)) ses <- sqrt(dispersion)*ses
ci[] <- cf[parm] + ses %o% fac
ci
}
confint_wald_glm(fit)
confint_wald_glm(fit,dispersion=2)

Related

why I can't get a confidence interval using predict function in R

I am trying to get a confidence interval for my response in a poisson regression model. Here is my data:
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
What I've done so far:
(i) read my data
(ii) fit a Y by poisson regression with X as a covariate
model <- glm(Y ~ X, family = "poisson", data = mydata)
(iii) use predict()
predict(model,newdata=data.frame(X=4),se.fit=TRUE,interval="confidence",level=0.95, type = "response")
I was expecting to get "fit, lwr, upr" for my response but I got the following instead:
$fit
1
30.21439
$se.fit
1
6.984273
$residual.scale
[1] 1
Could anyone offer some suggestions? I am new to R and struggling with this problem for a long time.
Thank you very much.
First, the function predict() that you are using is the method predict.glm(). If you look at its help file, it does not even have arguments 'interval' or 'level'. It doesn't flag them as erroneous because predict.glm() has the (in)famous ... argument, that absorbs all 'extra' arguments. You can write confidence=34.2 and interval="woohoo" and it still gives the same answer. It only produces the estimate and the standard error.
Second, one COULD then take the fit +/- 2*se to get an approximate 95 percent confidence interval. However, without getting into the weeds of confidence intervals, pivotal statistics, non-normality in the response scale, etc., this doesn't give very satisfying intervals because, for instance, they often include impossible negative values.
So, I think a better approach is to form an interval in the link scale, then transform it (this is still an approximation, but probably better):
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
model <- glm(Y ~ X, family = "poisson")
tmp <- predict(model, newdata=data.frame(X=4),se.fit=TRUE, type = "link")
exp(tmp$fit - 2*tmp$se.fit)
1
19.02976
exp(tmp$fit + 2*tmp$se.fit)
1
47.97273

Getting the AR(p) values from ARMA model in R

I am interested to get the values of the AR(p) part from any ARIMA model in R.
For example, suppose I estimate the following ARIMA model
I am interested to get these values defined as:
What I do is this:
ar_model <- auto.arima(mydata) # fit an ARIMA model as decided by the auto.arima
fitted_ar_model <- fitted(ar_model)
resid_ar_model <- residuals(ar_model)
ar_p <- fitted_ar_model - resid_ar_model
The thing with code is that I am not sure whether the residuals(ar_model) include only the contemporaneous residuals or also the moving average part.
I will attempt an answer here. Unfortunately, writing formulas in SO is really cumbersome. So, firstly, I think you need to make a distinction between the "epsilon" of your DGP (data generation process), also called shocks or errors (which are unobservable) and the model residuals. Shocks are unobservable because you don't know the true values of the model parameters (in this case the AR and MA parameters) - you can only estimate them. So there is a difference (in a simple AR(1) model without a constant for ease of writing using a "hat" notation for estimates) between y_t - phi_1*y_{t-1} and y_t - \hat{phi_1}*y_{t-1} - the former being the "shock" and the latter being the "residual". Typically you would also have as notation \hat{y_t} = \hat{phi_1}*y_{t-1} and so residual defined as e_t = y_t - \hat{y_t}.
Now so, given that, I think what you would like to get is y_t - e_t - \hat{phi}*e_{t-1}, \hat{phi} being the estimated MA1 parameter. The following is an example where I have explicitly fitted an ARMA(2,1) without a constant to the LakeHuron time series. I have looked at the model residuals using residuals (resid in df) and also calculated by hand (resid2 in df) - they differ a bit due to approximating initial conditions that the residuals method makes which I didn't really dig into. You will see that the values get closer and closer as t increases and the influence of the initial conditions diminishes (this is due to the fact that we have a stationary ARMA model). In the same way I have calculated the "AR_fit" as I call it once as ar_1 * y_{t-1} + ar_2 * y_{t-2} (ar_fit1 in df) and a second time as y_t - e_t - ma_1 * e_{t-1} (ar_fit2 in df). Again, you will see that the values converge as t increases (due to the same initial conditions used to calculate the residuals using the residuals function).
library(forecast)
data <- LakeHuron - mean(LakeHuron)
nobs <- length(data)
ar_model <- arima(data, order = c(2, 0, 1), include.mean = FALSE)
fitted <- fitted(ar_model)
resid <- residuals(ar_model)
ar1 <- ar_model$coef[1]
ar2 <- ar_model$coef[2]
ma1 <- ar_model$coef[3]
df <- data.frame(data, fitted, resid)
resid2 = data[3:nobs] - ar1*data[2:(nobs-1)] - ar2*data[1:(nobs-2)] - ma1*resid[2:(nobs-1)]
resid2 <- c(NA, NA, resid2)
df$resid2 <- resid2
ar_fit1 <- ar1*data[2:(nobs-1)] + ar2*data[1:(nobs-2)]
ar_fit1<- c(NA, NA, ar_fit1)
df$ar_fit1 <- ar_fit1
ar_fit2 <- data[3:nobs] - resid[3:nobs] - ma1*resid[2:(nobs-1)]
ar_fit2<- c(NA, NA, ar_fit2)
df$ar_fit2 <- ar_fit2

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Resources