Visualize Generalized Additive Model (GAM) in R - r

I want to achieve a GAM plot that looks like this
Image from https://stats.stackexchange.com/questions/179947/statistical-differences-between-two-hourly-patterns/446048#446048
How can I accomplish this?
Model is
model = gam(y ~ s(t) + g, data = d)

The general way to do this is to compute model estimates (fitted values) over the range of the covariate(s) of interest for each group. The reproducible example below illustrates once way to do this using {mgcv} to fit the GAM and my {gratia} package for some helper functions to facilitate the process.
library("gratia")
library("mgcv")
library("ggplot2")
eg_data <- data_sim("eg4", n = 400, dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ s(x2) + fac, data = eg_data, method = "REML")
ds <- data_slice(m, x2 = evenly(x2, n = 100), fac = evenly(fac))
fv <- fitted_values(m, data = ds)
The last line gets you fitted values from the model at the covariate combinations specified in the data slice:
> fv
# A tibble: 300 × 6
x2 fac fitted se lower upper
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 0.00131 1 -1.05 0.559 -2.15 0.0412
2 0.00131 2 -3.35 0.563 -4.45 -2.25
3 0.00131 3 1.13 0.557 0.0395 2.22
4 0.0114 1 -0.849 0.515 -1.86 0.160
5 0.0114 2 -3.14 0.519 -4.16 -2.13
6 0.0114 3 1.34 0.513 0.332 2.34
7 0.0215 1 -0.642 0.474 -1.57 0.287
8 0.0215 2 -2.94 0.480 -3.88 -2.00
9 0.0215 3 1.54 0.473 0.616 2.47
10 0.0316 1 -0.437 0.439 -1.30 0.424
# … with 290 more rows
# ℹ Use `print(n = ...)` to see more rows
This object is in a form suitable for plotting with ggplot():
fv |>
ggplot(aes(x = x2, y = fitted, colour = fac)) +
geom_point(data = eg_data, mapping = aes(y = y), size = 0.5) +
geom_ribbon(aes(x = x2, ymin = lower, ymax = upper, fill = fac,
colour = NULL),
alpha = 0.2) +
geom_line()
which produces
You can enhance and/or modify this using your ggplot skills.
The basic point with this model is that you have a common smooth effect of a covariate (here x2) plus group means (for the factor fac). Hence the curves are "parallel".
Note that there's a lot of variation around the estimated curves in this model because the simulated data are from a richer model with group-specific smooths and smooth effects of other covariates.

gg.bs30 <- ggplot(data,aes(x=Predictor,y=Output,col=class))+geom_point()+
geom_smooth(method='gam',formula=y ~ splines::bs(x, 30)) + facet_grid(class ~.)
print(gg.bs30)
Code from -> https://github.com/mariocastro73/ML2020-2021/blob/master/scripts/gams-with-ggplot-classes.R

Related

Looping over grouped data using the nls function in R

I have a grouped dataset. I have my data grouped by GaugeID. I have an nls function that I want to loop over each group and provide an output value.
library(tidyverse)
library(stats)
# sample of data (yearly), first column is gauge (grouping variable), year, then two formula inputs PETvP and ETvP
# A tibble: 10 x 4
GaugeID WATERYR PETvP ETvP
<chr> <dbl> <dbl> <dbl>
1 06892000 1981 0.854 0.754
2 06892000 1982 0.798 0.708
3 06892000 1983 1.12 0.856
4 06892000 1984 0.905 0.720
5 06892000 1985 0.721 0.618
6 06892000 1986 0.717 0.625
7 06892000 1987 0.930 0.783
8 06892000 1988 1.57 0.945
9 06892000 1989 1.15 0.739
10 06892000 1990 0.933 0.805
11 08171300 1981 0.854 0.754
12 08171300 1982 0.798 0.708
13 08171300 1983 1.12 0.856
14 08171300 1984 0.905 0.720
15 08171300 1985 0.721 0.618
16 08171300 1986 0.717 0.625
17 08171300 1987 0.930 0.783
18 08171300 1988 1.57 0.945
19 08171300 1989 1.15 0.739
20 08171300 1990 0.933 0.805
# attempted for loop
for (i in unique(yearly$GaugeID)) {
myValue = nls(ETvP[i] ~ I(1 + PETvP[i] - (1 + PETvP[i]^(w))^(1/w)), data = yearly,
start = list(w = 2), trace = TRUE)
}
I get the following error
Error in model.frame.default(formula = ~ETvP + i + PETvP, data = yearly) :
variable lengths differ (found for 'i')
I haven't found much information regarding looping with the nls function. Essentially, I am producing curves and need the value of the curve (w) to output for each gauge.
It works if I assign the formula to just one gauge (if I subset the data, i.e for the first gauge), but not when I try to use it on the entire data frame with grouped data.
For example, this works
# gaugeA
# A tibble: 10 x 4
GaugeID WATERYR PETvP ETvP
<chr> <dbl> <dbl> <dbl>
1 06892000 1981 0.854 0.754
2 06892000 1982 0.798 0.708
3 06892000 1983 1.12 0.856
4 06892000 1984 0.905 0.720
5 06892000 1985 0.721 0.618
6 06892000 1986 0.717 0.625
7 06892000 1987 0.930 0.783
8 06892000 1988 1.57 0.945
9 06892000 1989 1.15 0.739
10 06892000 1990 0.933 0.805
test = nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)), data = gaugeA,
start = list(w = 2), trace = TRUE)
1.574756 (4.26e+00): par = (2)
0.2649549 (1.46e+00): par = (2.875457)
0.09466832 (3.32e-01): par = (3.59986)
0.08543699 (2.53e-02): par = (3.881397)
0.08538308 (9.49e-05): par = (3.907099)
0.08538308 (1.13e-06): par = (3.907001)
> test
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: gaugeA
w
3.907
residual sum-of-squares: 0.08538
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.128e-06
Any ideas on how I can get the subset results for my entire grouped dataframe? It has over 600 different gauges in it. Thank you in advance.
Any of the following will work:
Using summarise:
df %>%
group_by(GaugeID) %>%
summarise(result = list(nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)),
data = cur_data(),
start = list(w = 2)))) %>%
pull(result)
[[1]]
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: cur_data()
w
3.607
residual sum-of-squares: 0.01694
Number of iterations to convergence: 5
Achieved convergence tolerance: 7.11e-08
[[2]]
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: cur_data()
w
1.086
residual sum-of-squares: 0.1532
Number of iterations to convergence: 5
Achieved convergence tolerance: 2.685e-07
Using map:
df %>%
group_split(GaugeID) %>%
map(~nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)),
data = .x,
start = list(w = 2)))
I usally prefer purrr and dplyr for looping functions on grouped data.
I cant edit the data, but maybe this works:
library(dplyr)
library(purrr)
yearly %>% group_by(GaugeID) %>% summarise(test = nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)), data = gaugeA, start = list(w = 2), trace = TRUE)
A single model can be formulated eliminating loops. Ensure that GaugeID is a factor, subscript w by GaugeID in the formula and provide a starting value list whose w component is a vector with a starting value for each level of GaugeID.
df$GaugeID <- factor(df$GaugeID)
fo <- ETvP ~ 1 + PETvP - (1 + PETvP^(w[GaugeID]))^(1/w[GaugeID])
st <- list(w = rep(2, nlevels(df$GaugeID)))
fm <- nls(fo, df, start = st)
fm
summary(fm)
data.frame(GaugeID = levels(df$GaugeID), coef(summary(fm)), check.names = FALSE)

How to add interaction terms in multinomial regression

I am using the mlogit function from the mlogit package to run a multinomial logit regression. I am not sure how to add interaction terms into my model. Here is a toy dataset and my attempt to add interactions:
library(mlogit)
data <- data.frame(y=sample(1:3, 24, replace = TRUE),
x1 = c(rep(1,12), rep(2,12)),
x2 = rep(c(rep(1,4), rep(2,4), rep(3,4)),2),
x3=rnorm(24),
z1 = sample(1:10, 24, replace = TRUE))
m0 <- mlogit(y ~ 0|x1 + x2 + x3 + z1, shape = "wide", data = data) #model with only main effects
m1 <- mlogit(y ~ 0|(x1 + x2 + x3 + z1)^2, shape = "wide", data = data) #model assuming with all possible 2-way interactions?
The output from summary(m1) shows:
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):2 86.41088 164.93831 0.5239 0.6003
(Intercept):3 62.43859 163.57346 0.3817 0.7027
x1:2 -32.27065 82.62474 -0.3906 0.6961
x1:3 0.24661 84.07429 0.0029 0.9977
x2:2 -75.09247 81.36496 -0.9229 0.3561
x2:3 -85.16452 81.40983 -1.0461 0.2955
x3:2 113.11778 119.15990 0.9493 0.3425
x3:3 112.77622 117.74567 0.9578 0.3382
z1:2 11.18665 22.32508 0.5011 0.6163
z1:3 13.15552 22.26441 0.5909 0.5546
x1:2 34.01298 39.66983 0.8574 0.3912
x1:3 32.19141 39.48373 0.8153 0.4149
x1:2 -53.86747 59.75696 -0.9014 0.3674
x1:3 -47.97693 59.09055 -0.8119 0.4168
x1:2 -6.98799 11.29920 -0.6185 0.5363
x1:3 -10.41574 11.52313 -0.9039 0.3660
x2:2 0.59185 6.68807 0.0885 0.9295
x2:3 2.63458 4.94419 0.5329 0.5941
x2:2 0.80945 2.03769 0.3972 0.6912
x2:3 2.60383 2.21878 1.1735 0.2406
x3:2 -0.64112 1.64678 -0.3893 0.6970
x3:3 -2.14289 1.98436 -1.0799 0.2802
the first column is not quite clear to me what specific interactions were outputted. Any pointers will be greatly appreciated!
This might be a clearer way to do it:
library(dplyr)
library(broom)
library(nnet)
multinom(formula = y ~ (x1 + x2 + x3 + z1)^2, data = data) %>%
tidy()
# A tibble: 22 x 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 (Intercept) -158. 247. -0.640 0.522
2 2 x1 -388. 247. -1.57 0.116
3 2 x2 -13.4 248. -0.0543 0.957
4 2 x3 120. 334. 0.360 0.719
5 2 z1 173. 968. 0.179 0.858
6 2 x1:x2 337. 248. 1.36 0.174
7 2 x1:x3 40.2 334. 0.120 0.904
8 2 x1:z1 -53.8 968. -0.0555 0.956
9 2 x2:x3 -137. 1018. -0.135 0.893
10 2 x2:z1 -76.6 910. -0.0841 0.933
# … with 12 more rows

Extraction LM stats into table

I have made a graph that displays r2, p-value and equation from linear regressions in the top left corner using stat_poly_eq.
Now I wish to have the stats from the linear regression extracted into a table.
For an example, in the mtcars dataset, if I want to do linear regression on plots of hp against disp for each cylinder group (e.g. 4, 6, 8) and then extract the linear regression stats into a table, how could I do that?
Thanks!
Here's the graph I have:
library(ggplot2)
library(ggpmisc)
formula <- y~x
ggplot(mtcars, aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",formula = formula) +
theme_bw()+
facet_wrap(~cyl, scales = "free")+
stat_poly_eq(
aes(label = paste(stat(adj.rr.label), stat(eq.label), stat(p.value.label), sep = "*\", \"*")),
formula = formula, parse = TRUE, size=3)
Do you mean something like this?
With nest_by, divide the rest of the columns in separated tibbles by each cyl
With summarise, calculate each lm. You need to set it into a list.
Operate like a normal list with map and calculate the stuff you need: coefficients (extractable with broom::tidy) and adj.r.squared (with summary(.)$adj.r.squared)
unnest the result of broom::tidy to make a unique tibble.
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>%
summarise(mdl = list(lm(hp ~ disp, data)), .groups = "drop") %>%
mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
mdl = map(mdl, broom::tidy)) %>%
unnest(mdl)
#> # A tibble: 6 x 7
#> cyl term estimate std.error statistic p.value adjrsquared
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 47.0 25.3 1.86 0.0960 0.0988
#> 2 4 disp 0.339 0.234 1.45 0.182 0.0988
#> 3 6 (Intercept) 177. 42.0 4.22 0.00829 0.117
#> 4 6 disp -0.300 0.224 -1.34 0.238 0.117
#> 5 8 (Intercept) 178. 77.4 2.30 0.0405 -0.0682
#> 6 8 disp 0.0890 0.216 0.413 0.687 -0.0682

How to pipe the tidy-ed lm model CI's into ggplot2?

I have the following code that is computing for every year bewteen 1961:2018 the effects of both predictor variables: base on balls per game (BB) and home runs per game (HR) on the response variable runs per game (R):
rm(list = ls())
library(dbplyr)
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ BB + HR, data = .), conf.int = TRUE)) %>% filter(term=="BB")
fit
> fit
# A tibble: 58 x 8
# Groups: yearID [58]
yearID term estimate std.error statistic p.value conf.low conf.high
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 BB 0.0845 0.168 0.502 0.623 -0.274 0.443
2 1962 BB 0.142 0.273 0.520 0.610 -0.434 0.718
3 1963 BB 0.339 0.242 1.40 0.178 -0.171 0.849
4 1964 BB -0.105 0.302 -0.349 0.731 -0.742 0.532
5 1965 BB 0.235 0.253 0.928 0.366 -0.299 0.768
6 1966 BB 0.104 0.216 0.482 0.636 -0.351 0.559
7 1967 BB 0.0660 0.223 0.296 0.771 -0.405 0.537
8 1968 BB -0.199 0.203 -0.983 0.340 -0.627 0.229
9 1969 BB 0.153 0.163 0.942 0.357 -0.185 0.492
10 1970 BB 0.239 0.157 1.52 0.143 -0.0874 0.566
# ... with 48 more rows
I now would like to output this "fit" which is actually a tibble (or modernized data frame) into ggplot to show the estimates per year as points but also the regression line along with the CI's computed by the lm model and not simply recomputing it with geom_smooth(method = "lm").
I have tried the following without success. I know that the augment from broom should operate on the lm model output directly and therefore the following code is wrong but it illustrates what I'm trying to achieve:
augment(fit) %>%
ggplot() +
geom_point(aes(yearID, estimate)) +
geom_line(aes(yearID, .fitted), col = "blue")
How can I do that without "cheating" (double computing the lm once and then on the ggplot as well) and doing:
fit %>% ggplot(aes(yearID,estimate)) + geom_point() + geom_smooth(method = "lm")
I took a similar route to Patrick, using map() and nest():
library(tidyverse)
library(broom)
library(Lahman)
library(magrittr)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(
BB = BB / G,
HR = HR / G,
R = R / G
) %>%
nest(data = -yearID) %>%
mutate(
model = map(data, ~ lm(R ~ BB + HR, .x)), # apply model to all nested groups
m_tidy = map(model, tidy), # tidy up
est = map_dbl(m_tidy, ~ .x %>% # pull BB estimate from each group
filter(term == "BB") %>%
pull(estimate)),
)
Now at this point you could just %$% right into this next portion but I've kept them separate here so talk about mimicking the confidence interval properly. The geom_smooth() confidence interval is based on the t-distribution and not the normal distribution. Thus, we'll have to do a bit of extra work to get out intervals to work:
fit %$%
lm(est ~ yearID) %>%
augment() %>%
mutate(m.se.fit = .se.fit * qt(1 - (1-0.95)/2, nrow(fit))) %>% # 95% conf int calc
ggplot(aes(yearID, est)) +
geom_point() +
geom_line(aes(y = .fitted), col = "blue") +
geom_ribbon(aes(ymin = .fitted - m.se.fit, ymax = .fitted + m.se.fit), alpha = .2)
This plot essentially mirrors the desired plot:
fit %>% ggplot(aes(yearID, est)) +
geom_point() +
geom_smooth(method = "lm")
Created on 2019-10-23 by the reprex package (v0.3.0)
You can try map functions from the purrr package, which is included in tidyverse. A possible code for your described problem is listed below. Should also be possible with lapply if you are not that familar with the purrr package.
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
# consolidate your data
nest() %>%
# creates new nested column with your regression data
mutate(model = map(data, function(df)
tidy(lm(R ~ BB + HR, data = df), conf.int = TRUE) %>%
filter(term=="BB")
),
# extract the column estimate
model_est = map_dbl(model, function(df)
df %>% pull(estimate)
),
# extract the column conf.low
model_conf.low = map_dbl(model, function(df)
df %>% pull(conf.low)
),
# extract the column conf.high
model_conf.high = map_dbl(model, function(df)
df %>% pull(conf.high)
)
)
fit %>% ggplot(aes(yearID,model_est)) + geom_point() +
geom_line(aes(yearID, model_conf.low)) +
geom_line(aes(yearID, model_conf.high))

How to build a summary table of glm's parameters and AICcWt

I have a dataset that I am using to build generalised linear models. The response variable is binary (absence/presence) and the explanatory variables are categorical.
CODE
library(tidyverse)
library(AICcmodavg)
# Data
set.seed(123)
t <- tibble(ID = 1:100,
A = as.factor(sample(c(0, 1), 100, T)),
B = as.factor(sample(c("black", "white"), 100, T)),
C = as.factor(sample(c("pos", "neg", "either"), 100, T)))
# Candidate set of models - Binomial family because response variable
# is binary (0 for absent & 1 for present)
# Global model is A ~ B_black + C_either
m1 <- glm(A ~ 1, binomial, t)
m2 <- glm(A ~ B, binomial, t)
m3 <- glm(A ~ C, binomial, t)
m4 <- glm(A ~ B + C, binomial, t)
# List with all models
ms <- list(null = m1, m_B = m2, m_C = m3, m_BC = m4)
# Summary table
aic_tbl <- aictab(ms)
PROBLEM
I want to build a table like the one below that summarises the coefficients, standard errors, and Akaike weights of the models within my candidate set.
QUESTION
Can anyone suggest how to best build this table using my list of models and AIC table?
Just to point it out: broom gets you half-way to where you want to get by turning the model output into a dataframe, which you can then reshape.
library(broom)
bind_rows(lapply(ms, tidy), .id="key")
key term estimate std.error statistic p.value
1 null (Intercept) -0.12014431182649532 0.200 -0.59963969517107030 0.549
2 m_B (Intercept) 0.00000000000000123 0.283 0.00000000000000433 1.000
3 m_B Bwhite -0.24116205496397874 0.401 -0.60071814968372905 0.548
4 m_C (Intercept) -0.47957308026188367 0.353 -1.35892869678271544 0.174
5 m_C Cneg 0.80499548069651150 0.507 1.58784953814722285 0.112
6 m_C Cpos 0.30772282333522433 0.490 0.62856402205887851 0.530
7 m_BC (Intercept) -0.36339654526926718 0.399 -0.90984856337213305 0.363
8 m_BC Bwhite -0.25083209866475475 0.408 -0.61515191157571303 0.538
9 m_BC Cneg 0.81144822536950656 0.508 1.59682131202527056 0.110
10 m_BC Cpos 0.32706970242195277 0.492 0.66527127770403538 0.506
And if you must insist of the layout of your table, I came up with the following (arguably clumsy) way of rearranging everything:
out <- bind_rows(lapply(ms, tidy), .id="mod")
t1 <- out %>% select(mod, term, estimate) %>% spread(term, estimate) %>% base::t
t2 <- out %>% select(mod, term, std.error) %>% spread(term, std.error) %>% base::t
rownames(t2) <- paste0(rownames(t2), "_std_e")
tmp <- rbind(t1, t2[-1,])
new_t <- as.data.frame(tmp[-1,])
colnames(new_t) <- tmp[1,]
new_t
Alternatively, you may want to familiarise yourself with packages that are meant to display model output for publication, e.g. texreg or stargazer come to mind:
library(texreg)
screenreg(ms)
==================================================
null m_B m_C m_BC
--------------------------------------------------
(Intercept) -0.12 0.00 -0.48 -0.36
(0.20) (0.28) (0.35) (0.40)
Bwhite -0.24 -0.25
(0.40) (0.41)
Cneg 0.80 0.81
(0.51) (0.51)
Cpos 0.31 0.33
(0.49) (0.49)
--------------------------------------------------
AIC 140.27 141.91 141.66 143.28
BIC 142.87 147.12 149.48 153.70
Log Likelihood -69.13 -68.95 -67.83 -67.64
Deviance 138.27 137.91 135.66 135.28
Num. obs. 100 100 100 100
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Resources