tensor product smoothers the gam() function te() in R - gamlss

I am working with additive models but There is no interaction between age and height in my data. How can I modify the following code in order to remove the interaction between age and height? Do I need to remove ga(~te(lheight,lage,k=10))) from the code? Also, if I want keep the interaction then how I can calculate the Coefficient of height in Mu Coefficient section as I need to write the formula for Mu, sigma and nu Coefficients ?
library(gamlss.add)
nC <- detectCores()
dbhh<-transform(dbhh, lage=log(age), lheight=log(ht))
M1<-gamlss(head~1, family=BCTo, data=dbhh, n.cyc=100)
M2<-stepGAICAll.A(M1,
scope=list(lower=~1, upper=~pb(lheight) + pb(lage) +
ga(~te(lheight,lage,k=10))),
k=4, parallel="multicore", ncpus = nC)
dbhh1<-subset(dbhh,(resid(M2)>-3.5)&(resid(M2)<3.5))
M3<-gamlss(head~ga(~te(log(age), log(ht),k=10)),
sigma.fo=~pb(log(ht)), nu.fo=~pb(log(age)),
tau.fo=~pb(log(ht)), family=BCTo, data=dbhh1,
n.cyc=100)

Related

Quick way to calculate a confidence interval after changing dispersion parameter

I'm teaching a modeling class in R. The students are all SAS users, and I have to create course materials that exactly match (when possible) SAS output. I'm working on the Poisson regression section and trying to match PROC GENMOD, with a "dscale" option that modifies the dispersion index so that the deviance/df==1.
Easy enough to do, but I need confidence intervals. I'd like to show the students how to do it without hand calculating them. Something akin to confint_default() or confint()
Data
skin_cancer <- data.frame(CASES=c(1,16,30,71,102,130,133,40,4,38,
119,221,259,310,226,65),
CITY=c(rep(0,8),rep(1,8)),
N=c(172875, 123065,96216,92051,72159,54722,
32185,8328,181343,146207,121374,111353,
83004,55932,29007,7583),
agegp=c(1:8,1:8))
skin_cancer$ln_n = log(skin_cancer$N)
The model
fit <- glm(CASES ~ CITY, family="poisson", offset=ln_n, data=skin_cancer)
Changing the dispersion index
summary(fit, dispersion= deviance(fit) / df.residual(fit)))
That gets me the "correct" standard errors (correct according to SAS). But obviously I can't run confint() on a summary() object.
Any ideas? Bonus points if you can tell me how to change the dispersion index within the model so I don't have to do it within the summary() call.
Thanks.
This is an interesting question, and slightly deeper than it seems.
The simplest potential answer is to use family="quasipoisson" instead of poisson:
fitQ <- update(fit, family="quasipoisson")
confint(fitQ)
However, this won't let you adjust the dispersion to be whatever you want; it specifically changes the dispersion to the estimate R calculates in summary.glm, which is based on the Pearson chi-squared (sum of squared Pearson residuals) rather than the deviance, i.e.
sum((object$weights * object$residuals^2)[object$weights > 0])/df.r
You should be aware that stats:::confint.glm() (which actually uses MASS:::confint.glm) computes profile confidence intervals rather than Wald confidence intervals (i.e., this is not just a matter of adjusting the standard deviations).
If you're satisfied with Wald confidence intervals (which are generally less accurate) you could hack stats::confint.default() as follows (note that the dispersion title is a little bit misleading, as this function basically assumes that the original dispersion of the model is fixed to 1: this won't work as expected if you use a model that estimates dispersion).
confint_wald_glm <- function(object, parm, level=0.95, dispersion=NULL) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(vcov(object)))[parm]
if (!is.null(dispersion)) ses <- sqrt(dispersion)*ses
ci[] <- cf[parm] + ses %o% fac
ci
}
confint_wald_glm(fit)
confint_wald_glm(fit,dispersion=2)

Interpretation of output and prediction from cpglmm

I am helping a colleague fit a Compound-Poisson Generalized Linear Mixed Model in R, using the cpglmm-function from the cplm-package (link). The model involves a three-way interaction and I would like to compute some interpretable quantities. So far, I have tried to calculate some Odds-ratios but I am not sure this is the right way to do it?
# Fit model with three-way interaction in fixed effects #
m <- cpglmm(ncs ~ diversity_index*diversity_speciality*n_authors + selfcit +
n_refs + (1|region), data = diversity)
# Calculate Odds-ratio #
se <- sqrt(diag(vcov(m)))
tab <- cbind(Est = m$fixef,
S.E. = se,
LL = m$fixef - 1.96 * se,
UL = m$fixef + 1.96 * se)
print(exp(tab), digits=3)
I also want to compute some predicted values, e.g predicted probabilities or the like, but I can't get predict() to work for the cpglmm. Is there any functions I could use?

Creating R Squared function for CPLM package

For my graduate research I'm using the CPLM package (specifically the cpglmm function) to account for zero-inflated data (Tweedie compound Poisson distribution) in a data set looking at the effects of logging on breeding bird densities. This isn't a widely used package like lme4, nlme, etc. Therefore, the model validation methods that can be used on these more commonly used packages cannot be used on cpglmm.
I'm currently at the stage of describing the fit of my models and am trying to calculate R-squared values, both marginal and conditional. Unfortunately I cannot use the r2glmm package or MuMln to calculate R-squared values because they do not support cpglmm. Therefore, I've had to calculate those values manually through an example found here (example found in Appendix 6 under cpglmm parasite models, pg. 33). Here's the script from that example:
# Fit null model without fixed effects (but including all random effects)
parmodCPr <- cpglmm(Parasite ~ 1 + (1 | Population) + (1 | Container), data = DataAll)
# Fit alternative model including fixed and all random effects
parmodCPf <- cpglmm(Parasite ~ Sex + Treatment + Habitat + (1 | Population) +
(1 | Container), data = DataAll)
# Calculation of the variance in fitted values
VarF <- var(as.vector(model.matrix(parmodCPf) %*% fixef(parmodCPf)))
# getting the observation-level variance Null model
phiN <- parmodCPr#phi # the dispersion parameter
pN <- parmodCPr#p # the index parameter
mu <- exp(fixef(parmodCPr) + 0.5 * (VarCorr(parmodCPr)$Population[1] + VarCorr(parmodCPr)$Container[1]))
VarOdN <- phiN * mu^(pN - 2) # the delta method
# Full model
phiF <- parmodCPf#phi # the dispersion parameter
pF <- parmodCPf#p # the index parameter
VarOdF <- phiF * mu^(pF - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarF/(VarF + sum(as.numeric(VarCorr(parmodCPf))) + VarOdF)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarF + sum(as.numeric(VarCorr(parmodCPf))))/(VarF + sum(as.numeric(VarCorr(parmodCPf))) +
VarOdF)
What I would like to be able to do is write a function in R using this code outputting both the marginal and conditional R-squared values (RglmmM and RglmmC) with my models as the input. I'd greatly appreciate any help with this problem. Hopefully I have supplied enough information.
Thanks.
Believe I figured it out. Here's an example I wrote up:
R2glmm <- function(model){
# Calculation of the variance in fitted values
VarALT <- var(as.vector(model.matrix(model) %*% fixef(model)))
# getting the observation-level variance Null model
phiNULL <- NULLmodel$phi # the dispersion parameter
pNULL <- NULLmodel$p # the index parameter
mu <- exp(fixef(NULLmodel) + 0.5 * (VarCorr(NULLmodel)$YEAR[1]))
VarOdNULL <- phiNULL * mu^(pNULL - 2) # the delta method
# Alternate model
phiALT <- model$phi # the dispersion parameter
pALT <- model$p # the index parameter
VarOdALT <- phiALT * mu^(pALT - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarALT/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarALT + sum(as.numeric(VarCorr(model))))/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
return(c(R2glmmM, R2glmmC))
}
Variables containing ALT refers to the alternate model. "model" represents any cpglmm model you need to run through the function.
Hope this helps someone out. Been working on this problem and other related ones for ages now.

Bootstrapping CI for a Logistic Regression Model

I have a logistic regression model that I am using to predict size at maturity for king crab, but I am having trouble setting up the code for bootstrapping using the boot package. This is what I have:
#FEMALE GKC SAM#
LowerChatham<-read.table(file=file.choose(),header=TRUE)
#LOGISTIC REGRESSION FIT#
glm.out<-glm(Mature~CL,family=binomial(link=logit),data=LowerChatham)
plot(Mature~CL,data=LowerChatham)
lines(LowerChatham$CL,glm.out$fitted,col="red")
title(main="Lower Chatham")
summary(glm.out)
segments(98.9,0,98.9,0.5,col=1,lty=3,lwd=3)
SAM<-data.frame(CL=98.97)
predict(glm.out,SAM,type="response")
I would like to to bootstrap the statistic CL=98.97 since I am interested in the size at which 50% of crab are mature, but I have no idea how to setup my function to specify the that statistic and let alone the bootstrap function in general to get my 95% C.I. Any help would be greatly appreciated! Thanks!
In each bootstrap iteration, you want to do something like
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
range[match(TRUE,p>.5)] # predicted probability of 50% maturity
where you specify a range of values of CL to whatever precision you need. Then calculate the predicted probability of maturity at each of those levels. Then find the threshold value in the range where the predicted probability cross 0.5. This is the statistic it sounds like you want to bootstrap.
You also don't need the boot to do this. If you define a function that samples and outputs that statistic as its result, you can just do replicate(1000, myfun) to get your bootstrap distribution, as follows:
myfun <- function(){
srows <- sample(1:nrow(LowerChatham),nrow(LowerChatham),TRUE)
glm.out <- (Mature ~ CL, family=binomial(link=logit), data=LowerChatham[srows,])
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
return(range[match(TRUE,p>.5)]) # predicted probability of 50% maturity
}
bootdist <- replicate(1000, myfun()) # your distribution
quantile(unlist(bootdist),c(.025,.975)) # 95% CI

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Resources