How is estimated degree of freedom in GAM determined? - r

I was working on my GAM model:
y <- c(0.0000943615,0.0074918919,0.0157332851,0.0783308615,
0.1546375803,0.5558444681,0.8583806898,0.9617216854,
0.9848004112,0.9964662546)
x <- log(c(0.05, 0.1, 0.15, 0.2, 0.4, 0.8, 1.6, 3.2, 4.5, 6.4))
fit.gam <- mgcv::gam(y ~ s(x,k=-1, bs="cr"))
summary(fit.gam)
coef(fit.gam)
The model summary tells me that the edf of s(x) is 6.893 with p-value = 0.0017:
> summary(fit.gam)
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = -1, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46135 0.00629 73.34 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 6.893 7.902 585.7 0.0017 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.998 Deviance explained = 100%
GCV = 0.0018783 Scale est. = 0.0003957 n = 10
The coefficients of my model still contain 9 coefficients for s(x).
> coef(fit.gam)
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4 s(x).5 s(x).6
0.4613501 -0.3450787 -0.3229509 -0.2895761 -0.1783854 0.1976228 0.5040469
s(x).7 s(x).8 s(x).9
0.6135856 0.6338979 0.6470116
My question is I understand that GAM penalized the variable x in some extent so that the estimated degree of freedom of x = 6.893 < 9, but from the coefficients of s(x), it is hard for me to tell which basis is penalized. How should I understand the relationship between edf and coefficients of s(x)? Thanks!

Related

Difference between glm with sandwich package and glmrob for Poisson distribution

I´m having some doubts with glmrob(package: robustbase). I want to use glmrob to get the same results as I was having with glm + sandwich.
I was writing:
p_3 <- glm(formula = var1~ var2,
family = poisson(link=log),
data = p3,
na.action = na.omit)
coeftest(p_3, vcov = sandwich)
Both variables are categorical. var1 has two categories and var2 has four.
Now I'm trying to use glmrob to get everything in the same step:
p_2 <- glmrob(formula = var1~ var2,
family = poisson (link=log),
data = p3,
na.action = na.omit,
method= "Mqle",
control = glmrobMqle.control(tcc= 1.2)
)
summary(p2)and summary(p_3)don´t yield the same results so I think that I need to make some changes to this two lines method= "Mqle",control = glmrobMqle.control(tcc= 1.2)but I don´t really know which ones.
Maybe I have to use method="MT" as it works for Poisson models, but I´m not sure.
The outputs:
With glmrob:
summary(p2)
>Call: glmrob(formula = dummy28_n ~ coexistencia, family = poisson(link = log), data = p3, na.action = na.omit, method = "Mqle", control = glmrobMqle.control(tcc = 1.2))
>Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7020 0.5947 -2.862 0.00421 **
coexistenciaPobreza energética 1.1627 0.9176 1.267 0.20514
coexistenciaInseguridad residencial 0.7671 0.6930 1.107 0.26830
coexistenciaCoexistencia de inseguridades 1.3688 0.6087 2.249 0.02453 *
---
>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r * w.x:
143 weights are ~= 1. The remaining 3 ones are
71 124 145
0.6266 0.6266 0.6266
>Number of observations: 146
Fitted by method ‘Mqle’ (in 8 iterations)
>(Dispersion parameter for poisson family taken to be 1)
>No deviance values available
Algorithmic parameters:
acc tcc
0.0001 1.2000
maxit
50
test.acc
"coef"
with glm and sandwich:
coef <- coeftest(p_3, vcov = sandwich)
coef
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.79176 0.52705 -3.3996 0.0006748 ***
coexistenciaPobreza energética 1.09861 0.72648 1.5122 0.1304744
coexistenciaInseguridad residencial 0.69315 0.60093 1.1535 0.2487189
coexistenciaCoexistencia de inseguridades 1.32972 0.53259 2.4967 0.0125349 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
´´´

How do I get the minimum model for a quasipoisson GLM

I have ran a quasipoisson GLM with the following code:
Output3 <- glm(GCN ~ DHSI + N + P, PondsTask2, family = quasipoisson(link = "log"))
and received this output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.69713 0.56293 -3.015 0.00272 **
DHSI 3.44795 0.74749 4.613 0.00000519 ***
N -0.59648 0.36357 -1.641 0.10157
P -0.01964 0.37419 -0.052 0.95816
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
With the DHSI being statistically significant, but the other two variables not being significant. How do I go about dropping variables until I have the minimum model?

Getting adjusted r-squared value for each line in a geom_smooth gam

I produced the below graph using ggplot2.
PlotEchi = ggplot(data=Echinoidea,
aes(x=Year, y=mean, group = aspect, linetype = aspect, shape=aspect)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.025, position=pd) +
geom_point(position=pd, size=2) +
geom_smooth(method = "gam", formula = y~s(x, k=3), se=F, size = 0.5,colour="black") +
xlab("") +
ylab("Abundance (mean +/- SE)") +
facet_wrap(~ species, scales = "free", ncol=1) +
scale_y_continuous(limits=c(min(y=0), max(Echinoidea$mean+Echinoidea$se))) +
scale_x_continuous(limits=c(min(Echinoidea$Year-0.125), max(Echinoidea$Year+0.125)))
What I would like to do is easily retrieve the adjusted R-square for each of the fitted lines without doing an individual mgcv::gam for each plotted line using model<-gam(df, formula = y~s(x1)....). Any ideas?
This is not really possible, because ggplot2 throws away the fitted object. You can see this in the source here.
1. Solving the problem by patching ggplot2
One ugly workaround is to patch the ggplot2 code on the fly to print out the results. You can do this as follows. The initial assignment throws an error but things work anyways. To undo this just restart your R session.
library(ggplot2)
# assignInNamespace patches `predictdf.glm` from ggplot2 and adds
# a line that prints the summary of the model. For some reason, this
# creates an error, but things work nonetheless.
assignInNamespace("predictdf.glm", function(model, xseq, se, level) {
pred <- stats::predict(model, newdata = data.frame(x = xseq), se.fit = se,
type = "link")
print(summary(model)) # this is the line I added
if (se) {
std <- stats::qnorm(level / 2 + 0.5)
data.frame(
x = xseq,
y = model$family$linkinv(as.vector(pred$fit)),
ymin = model$family$linkinv(as.vector(pred$fit - std * pred$se.fit)),
ymax = model$family$linkinv(as.vector(pred$fit + std * pred$se.fit)),
se = as.vector(pred$se.fit)
)
} else {
data.frame(x = xseq, y = model$family$linkinv(as.vector(pred)))
}
}, "ggplot2")
Now we can make a plot with the patched ggplot2:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() + geom_smooth(se = F, method = "gam", formula = y ~ s(x, bs = "cs"))
Console output:
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4280 0.0365 93.91 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.546 9 5.947 5.64e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.536 Deviance explained = 55.1%
GCV = 0.070196 Scale est. = 0.066622 n = 50
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.77000 0.03797 72.96 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.564 9 1.961 8.42e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.268 Deviance explained = 29.1%
GCV = 0.075969 Scale est. = 0.072074 n = 50
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.97400 0.04102 72.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.279 9 1.229 0.001 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.191 Deviance explained = 21.2%
GCV = 0.088147 Scale est. = 0.08413 n = 50
Note: I do not recommend this approach.
2. Solving the problem by fitting models via tidyverse
I think it's better to just run your models separately. Doing so is quite easy with tidyverse and broom, so I'm not sure why you wouldn't want to do it.
library(tidyverse)
library(broom)
iris %>% nest(-Species) %>%
mutate(fit = map(data, ~mgcv::gam(Sepal.Width ~ s(Sepal.Length, bs = "cs"), data = .)),
results = map(fit, glance),
R.square = map_dbl(fit, ~ summary(.)$r.sq)) %>%
unnest(results) %>%
select(-data, -fit)
# Species R.square df logLik AIC BIC deviance df.residual
# 1 setosa 0.5363514 2.546009 -1.922197 10.93641 17.71646 3.161460 47.45399
# 2 versicolor 0.2680611 2.563623 -3.879391 14.88603 21.69976 3.418909 47.43638
# 3 virginica 0.1910916 2.278569 -7.895997 22.34913 28.61783 4.014793 47.72143
As you can see, the extracted R squared values are exactly the same in both cases.

R fGarch: presample matrix for garchSpec()

I am trying to specify GARCH model by function fGarch::garchSpec() and i need a specified presample. As defined in manual:
presample: a numeric three column matrix with start values for the
series, for the innovations, and for the conditional
variances.
But i am pretty sure, that this is not the correct order. After reading the manuals and codes for functions: 'garchFit', 'garchSpec', 'garchSim' I am still quite confused.
The question is: how to exactly build presample matrix?
You don't need to set an argument to presample. It will supply a "good" guess and in estimating the parameters it will not matter. If you want to simulate data, then I would just make sure the burnin, n.start, is large enough.
Let's look at an example:
library(fGarch)
## First we simulate some data without setting presample:
# we set up the model by spec:
set.seed(911)
spec <- garchSpec(model = list(mu = 0.02, omega = 0.05, alpha = 0.2, beta = 0.75))
# then simulate our GARCH(1,1) model:
garchSim <- garchSim(spec, n = 200, n.start = 1)
plot(garchSim)
And estimates:
> garchFit(~ garch(1, 1), data = garchSim)
Error Analysis:
Estimate Std. Error t value Pr(>|t|)
mu -0.02196 0.05800 -0.379 0.7049
omega 0.03567 0.02681 1.331 0.1833
alpha1 0.12074 0.04952 2.438 0.0148 *
beta1 0.84527 0.05597 15.103 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log Likelihood:
-265.8417 normalized: -1.329209
Let us now try to add a very xtreme presample. In the above model (and this seed) the presample was:
> spec#presample
Presample:
time z h y
1 0 -0.4324072 1 0.02
now we replace it by c(100, 0.1, 0.1). Since my model is a GARCH(1,1) without any ARMA-part, I only need to set 3 parameters as descriped in the documentation ?garchSpec. After updating spec we estimate the same model:
set.seed(911)
spec#presample <- matrix(c(0.1, 0.1, 0.1), ncol = 3)
garchFit(~ garch(1, 1), data = garchSim)
with the same output:
Error Analysis:
Estimate Std. Error t value Pr(>|t|)
mu -0.02196 0.05800 -0.379 0.7049
omega 0.03567 0.02681 1.331 0.1833
alpha1 0.12074 0.04952 2.438 0.0148 *
beta1 0.84527 0.05597 15.103 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log Likelihood:
-265.8417 normalized: -1.329209
The likelihood and estimates are identical, but notice when we simulate with the new spec:
set.seed(911)
garchSim <- garchSim(spec, n = 200, n.start = 1)
plot(garchSim)
, the extreme initial sample supplied messed up our nice simulation. But by increasing the burn.in we get:
set.seed(911)
garchSim <- garchSim(spec, n = 200, n.start = 100)
plot(garchSim)

Variable Selection with mgcv

Is there a way of automating variable selection of a GAM in R, similar to step? I've read the documentation of step.gam and selection.gam, but I've yet to see a answer with code that works. Additionally, I've tried method= "REML" and select = TRUE, but neither remove insignificant variables from the model.
I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.
Example:
library(mgcv)
set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1),
pred1 = rnorm(100, 10, 1),
pred2 = rnorm(100, 0, 1),
pred3 = rnorm(100, 0, 1),
pred4 = rnorm(100, 0, 1))
model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
data = dat, method = "REML", select = TRUE)
summary(model)
#Family: gaussian
#Link function: identity
#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)
#Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.02267 0.08426 0.269 0.788
#Approximate significance of smooth terms:
# edf Ref.df F p-value
#s(pred1) 0.8770 9 0.212 0.1174
#s(pred2) 1.8613 9 0.638 0.0374 *
#s(pred3) 0.5439 9 0.133 0.1406
#s(pred4) 0.4504 9 0.091 0.1775
---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#R-sq.(adj) = 0.0887 Deviance explained = 12.3%
#-REML = 129.06 Scale est. = 0.70996 n = 100
Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. They concluded that an additional penalty term in the smoothness selection procedure gave the best results. This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations:
model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")
In addition to specifying select = TRUE in your call to function gam, you can increase the value of argument gamma to get stronger penalization. For example, we generate some data:
library("mgcv")
set.seed(2)
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model
We fit a GAM with 'standard' penalization and variable selection:
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.890 0.246 32.07 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x0) 1.363 1.640 0.804 0.3174
## s(x1) 1.681 2.088 11.309 1.35e-05 ***
## s(x2) 5.931 7.086 16.240 < 2e-16 ***
## s(x3) 1.002 1.004 4.102 0.0435 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) = 0.253 Deviance explained = 27.1%
## -REML = 1212.5 Scale est. = 24.206 n = 400
par(mfrow = c(2, 2))
plot(b)
We fit a GAM with stronger penalization and variable selection:
b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.8898 0.2604 30.3 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x0) 5.330e-05 9 0.000 0.1868
## s(x1) 5.427e-01 9 0.967 7.4e-05 ***
## s(x2) 1.549e+00 9 6.210 < 2e-16 ***
## s(x3) 6.155e-05 9 0.000 0.0812 .
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) = 0.163 Deviance explained = 16.7%
## -REML = 179.46 Scale est. = 27.115 n = 400
plot(b2)
According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.
A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero. This is what we also observe in the plots and output above: With higher value of gamma, some effects are practically penalized out (edf values close 0, F-value of 0), while the other effects are closer to linear (edf values closer to 1).

Resources