Is there a way of automating variable selection of a GAM in R, similar to step? I've read the documentation of step.gam and selection.gam, but I've yet to see a answer with code that works. Additionally, I've tried method= "REML" and select = TRUE, but neither remove insignificant variables from the model.
I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.
Example:
library(mgcv)
set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1),
pred1 = rnorm(100, 10, 1),
pred2 = rnorm(100, 0, 1),
pred3 = rnorm(100, 0, 1),
pred4 = rnorm(100, 0, 1))
model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
data = dat, method = "REML", select = TRUE)
summary(model)
#Family: gaussian
#Link function: identity
#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)
#Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.02267 0.08426 0.269 0.788
#Approximate significance of smooth terms:
# edf Ref.df F p-value
#s(pred1) 0.8770 9 0.212 0.1174
#s(pred2) 1.8613 9 0.638 0.0374 *
#s(pred3) 0.5439 9 0.133 0.1406
#s(pred4) 0.4504 9 0.091 0.1775
---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#R-sq.(adj) = 0.0887 Deviance explained = 12.3%
#-REML = 129.06 Scale est. = 0.70996 n = 100
Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. They concluded that an additional penalty term in the smoothness selection procedure gave the best results. This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations:
model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")
In addition to specifying select = TRUE in your call to function gam, you can increase the value of argument gamma to get stronger penalization. For example, we generate some data:
library("mgcv")
set.seed(2)
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model
We fit a GAM with 'standard' penalization and variable selection:
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.890 0.246 32.07 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x0) 1.363 1.640 0.804 0.3174
## s(x1) 1.681 2.088 11.309 1.35e-05 ***
## s(x2) 5.931 7.086 16.240 < 2e-16 ***
## s(x3) 1.002 1.004 4.102 0.0435 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) = 0.253 Deviance explained = 27.1%
## -REML = 1212.5 Scale est. = 24.206 n = 400
par(mfrow = c(2, 2))
plot(b)
We fit a GAM with stronger penalization and variable selection:
b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.8898 0.2604 30.3 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x0) 5.330e-05 9 0.000 0.1868
## s(x1) 5.427e-01 9 0.967 7.4e-05 ***
## s(x2) 1.549e+00 9 6.210 < 2e-16 ***
## s(x3) 6.155e-05 9 0.000 0.0812 .
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) = 0.163 Deviance explained = 16.7%
## -REML = 179.46 Scale est. = 27.115 n = 400
plot(b2)
According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.
A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero. This is what we also observe in the plots and output above: With higher value of gamma, some effects are practically penalized out (edf values close 0, F-value of 0), while the other effects are closer to linear (edf values closer to 1).
Related
I am trying to get the nonlinear formula of y0, a, and b for my curve so I cam plot it on my graph. the summary(nls.mod) output does not show the y0 that I will need to plot the curve and I am not sure why as I have tried everything. The code is below:
# BH this version of plot is used for diagnostic plots for
# BH residuals of a linear model, i.e. using lm.
plot(mdl3 <- lm(ETR ~ wp_Mpa + I(wp_Mpa^2) + I(wp_Mpa^3), data = dat3))
prd <- data.frame(x = seq(-4, 0, by = 0.5))
result <- prd
result$mdl3 <- predict(mdl3, newdat3 = prd)
# BH use nls to fit this model y0+a*exp(b*x)
nls.mod <- nls(ETR ~ y0 + a*exp(b*wp_Mpa), start=c(a = -4, b = 0), data=dat3.no_na)
summary(nls.mod)
and here is the output:
Formula: ETR ~ y0 + a * exp(b * wp_Mpa)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 85.85515 8.62005 9.960 <2e-16 ***
b 0.14157 0.07444 1.902 0.0593 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 58.49 on 134 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 1.515e-06
as you can see for some reason only the a and b show up but the y0 is supposed to be above that
I tried to reassign the variables and that just continued to give me the same output contacted a stats consultant and they just said I needed to change the variables but it still didnt work
I´m having some doubts with glmrob(package: robustbase). I want to use glmrob to get the same results as I was having with glm + sandwich.
I was writing:
p_3 <- glm(formula = var1~ var2,
family = poisson(link=log),
data = p3,
na.action = na.omit)
coeftest(p_3, vcov = sandwich)
Both variables are categorical. var1 has two categories and var2 has four.
Now I'm trying to use glmrob to get everything in the same step:
p_2 <- glmrob(formula = var1~ var2,
family = poisson (link=log),
data = p3,
na.action = na.omit,
method= "Mqle",
control = glmrobMqle.control(tcc= 1.2)
)
summary(p2)and summary(p_3)don´t yield the same results so I think that I need to make some changes to this two lines method= "Mqle",control = glmrobMqle.control(tcc= 1.2)but I don´t really know which ones.
Maybe I have to use method="MT" as it works for Poisson models, but I´m not sure.
The outputs:
With glmrob:
summary(p2)
>Call: glmrob(formula = dummy28_n ~ coexistencia, family = poisson(link = log), data = p3, na.action = na.omit, method = "Mqle", control = glmrobMqle.control(tcc = 1.2))
>Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7020 0.5947 -2.862 0.00421 **
coexistenciaPobreza energética 1.1627 0.9176 1.267 0.20514
coexistenciaInseguridad residencial 0.7671 0.6930 1.107 0.26830
coexistenciaCoexistencia de inseguridades 1.3688 0.6087 2.249 0.02453 *
---
>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r * w.x:
143 weights are ~= 1. The remaining 3 ones are
71 124 145
0.6266 0.6266 0.6266
>Number of observations: 146
Fitted by method ‘Mqle’ (in 8 iterations)
>(Dispersion parameter for poisson family taken to be 1)
>No deviance values available
Algorithmic parameters:
acc tcc
0.0001 1.2000
maxit
50
test.acc
"coef"
with glm and sandwich:
coef <- coeftest(p_3, vcov = sandwich)
coef
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.79176 0.52705 -3.3996 0.0006748 ***
coexistenciaPobreza energética 1.09861 0.72648 1.5122 0.1304744
coexistenciaInseguridad residencial 0.69315 0.60093 1.1535 0.2487189
coexistenciaCoexistencia de inseguridades 1.32972 0.53259 2.4967 0.0125349 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
´´´
Let's consider this code as an example:
a = 10
b = 2
c = 1.05
set.seed(123,kind="Mersenne-Twister",normal.kind="Inversion")
x = runif(100)
data = data.frame(X = x, Y = a + b/c * (((1-x)^-c)-1))
fit_sp <- minpack.lm::nlsLM( formula = Y ~ a + b/c * (((1-X)^-c)-1),
data = data, start = c(a = 0, b = 0.1, c = 0.01),
control = nls.control(maxiter = 1000),
lower = c(0.0001,0.0001,0.0001))
fit_sp
Nonlinear regression model
model: Y ~ a + b/c * (((1 - X)^-c) - 1)
data: data
a b c
10.00 2.00 1.05
residual sum-of-squares: 1.507e-26
Number of iterations to convergence: 13
Achieved convergence tolerance: 1.49e-08
summary(fit_sp)
Formula: Y ~ a + b/c * (((1 - X)^-c) - 1)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.000e+01 1.516e-15 6.598e+15 <2e-16 ***
b 2.000e+00 4.372e-16 4.574e+15 <2e-16 ***
c 1.050e+00 5.319e-17 1.974e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.246e-14 on 97 degrees of freedom
Number of iterations to convergence: 13
Achieved convergence tolerance: 1.49e-08
I know that non linear least squares calculates the coefficients that minimize the sum of squared residuals. But how is it possible to obtain by hand the standard errors and the t-statistics for the parameters estimate?
I would like to know if my response variable (y) is significantly affected by a categorical variable ("x"), with two discrete values ("high" and "low").
This response value "y" has spatial autocorrelation, which are numerical spatial eigenvector, named as "MEM2" and "MEM5". I'm trying to fix these spatial eigenvectors into the function using a gamm4. But I don't know where I fit them into the formula.
I've made by two ways:
(1) First:
m2 <- gamm(y ~ x, random=list(MEM2=~1, MEM5=~1), family=poisson )
summary(m2)
summary(m2$gam)
The output
Family: poisson
Link function: log
Formula:
y ~ x
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4574 0.2029 2.254 0.0300 *
xbaixa -0.6330 0.3338 -1.897 0.0655 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0625
Scale est. = 1 n = 40
(2) Second:
m2 <- gamm(y ~ x +s(MEM2) + s(MEM5),
random=list(MEM2=~1, MEM5=~1), family=poisson )
summary(m2)
summary(m2$gam)
The output:
Family: poisson
Link function: log
Formula:
y ~ x + s(MEM2) + s(MEM5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3434 0.2275 1.509 0.140
xbaixa -0.4438 0.3689 -1.203 0.237
Approximate significance of smooth terms:
edf Ref.df F p-value
s(MEM2) 1 1 1.343 0.254
s(MEM5) 1 1 0.994 0.325
R-sq.(adj) = 0.0567
Scale est. = 1 n = 40
Anyone knows if it's right? And which I should use?
I produced the below graph using ggplot2.
PlotEchi = ggplot(data=Echinoidea,
aes(x=Year, y=mean, group = aspect, linetype = aspect, shape=aspect)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.025, position=pd) +
geom_point(position=pd, size=2) +
geom_smooth(method = "gam", formula = y~s(x, k=3), se=F, size = 0.5,colour="black") +
xlab("") +
ylab("Abundance (mean +/- SE)") +
facet_wrap(~ species, scales = "free", ncol=1) +
scale_y_continuous(limits=c(min(y=0), max(Echinoidea$mean+Echinoidea$se))) +
scale_x_continuous(limits=c(min(Echinoidea$Year-0.125), max(Echinoidea$Year+0.125)))
What I would like to do is easily retrieve the adjusted R-square for each of the fitted lines without doing an individual mgcv::gam for each plotted line using model<-gam(df, formula = y~s(x1)....). Any ideas?
This is not really possible, because ggplot2 throws away the fitted object. You can see this in the source here.
1. Solving the problem by patching ggplot2
One ugly workaround is to patch the ggplot2 code on the fly to print out the results. You can do this as follows. The initial assignment throws an error but things work anyways. To undo this just restart your R session.
library(ggplot2)
# assignInNamespace patches `predictdf.glm` from ggplot2 and adds
# a line that prints the summary of the model. For some reason, this
# creates an error, but things work nonetheless.
assignInNamespace("predictdf.glm", function(model, xseq, se, level) {
pred <- stats::predict(model, newdata = data.frame(x = xseq), se.fit = se,
type = "link")
print(summary(model)) # this is the line I added
if (se) {
std <- stats::qnorm(level / 2 + 0.5)
data.frame(
x = xseq,
y = model$family$linkinv(as.vector(pred$fit)),
ymin = model$family$linkinv(as.vector(pred$fit - std * pred$se.fit)),
ymax = model$family$linkinv(as.vector(pred$fit + std * pred$se.fit)),
se = as.vector(pred$se.fit)
)
} else {
data.frame(x = xseq, y = model$family$linkinv(as.vector(pred)))
}
}, "ggplot2")
Now we can make a plot with the patched ggplot2:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() + geom_smooth(se = F, method = "gam", formula = y ~ s(x, bs = "cs"))
Console output:
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4280 0.0365 93.91 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.546 9 5.947 5.64e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.536 Deviance explained = 55.1%
GCV = 0.070196 Scale est. = 0.066622 n = 50
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.77000 0.03797 72.96 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.564 9 1.961 8.42e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.268 Deviance explained = 29.1%
GCV = 0.075969 Scale est. = 0.072074 n = 50
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "cs")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.97400 0.04102 72.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 1.279 9 1.229 0.001 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.191 Deviance explained = 21.2%
GCV = 0.088147 Scale est. = 0.08413 n = 50
Note: I do not recommend this approach.
2. Solving the problem by fitting models via tidyverse
I think it's better to just run your models separately. Doing so is quite easy with tidyverse and broom, so I'm not sure why you wouldn't want to do it.
library(tidyverse)
library(broom)
iris %>% nest(-Species) %>%
mutate(fit = map(data, ~mgcv::gam(Sepal.Width ~ s(Sepal.Length, bs = "cs"), data = .)),
results = map(fit, glance),
R.square = map_dbl(fit, ~ summary(.)$r.sq)) %>%
unnest(results) %>%
select(-data, -fit)
# Species R.square df logLik AIC BIC deviance df.residual
# 1 setosa 0.5363514 2.546009 -1.922197 10.93641 17.71646 3.161460 47.45399
# 2 versicolor 0.2680611 2.563623 -3.879391 14.88603 21.69976 3.418909 47.43638
# 3 virginica 0.1910916 2.278569 -7.895997 22.34913 28.61783 4.014793 47.72143
As you can see, the extracted R squared values are exactly the same in both cases.