I have two GAMs which have the same predictor variables but different independent variables. I would like to combine the two GAMs to a set of plots where the smooth component (partial residuals) of each predictor variable are in the same panel (differentiated with e.g. color). Reproducible example:
# Required packages
require(mgcv)
require(mgcViz)
# Dataset
data("swiss")
# GAM models
fit1 <- mgcv::gam(Fertility ~ s(Examination) + s(Education), data = swiss)
fit2 <- mgcv::gam(Agriculture ~ s(Examination) + s(Education), data = swiss)
# Converting GAM objects to a gamViz objects
viz_fit1 <- mgcViz::getViz(fit1)
viz_fit2 <- mgcViz::getViz(fit2)
# Make plotGAM objects
trt_fit1 <- plot(viz_fit1, allTerms = T) + l_fitLine()
trt_fit2 <- plot(viz_fit2, allTerms = T) + l_fitLine()
# Print plots
print(trt_fit1, pages = 1)
print(trt_fit2, pages = 1)
Plot of fit1 looks like this:
And fit2 like this:
So I would like to combine the two Examinations into one panel, and the two Educations into another one, showing the independent variable (from different GAMs) with different color/linetype.
You could also do this using my {gratia} 📦 and the compare_smooths() function:
library("gratia")
library("mgcv")
# Dataset
data("swiss")
# GAM models
fit1 <- gam(Fertility ~ s(Examination) + s(Education),
data = swiss, method = "REML")
fit2 <- gam(Agriculture ~ s(Examination) + s(Education),
data = swiss, method = "REML")
# create and object that contains the info to compare smooths
comp <- compare_smooths(fit1, fit2)
# plot
draw(comp)
This produces
The output from compare_smooth() is a nested data frame (tibble)
r$> comp
# A tibble: 4 × 5
model smooth type by data
<chr> <chr> <chr> <chr> <list>
1 fit1 s(Education) TPRS NA <tibble [100 × 3]>
2 fit2 s(Education) TPRS NA <tibble [100 × 3]>
3 fit1 s(Examination) TPRS NA <tibble [100 × 3]>
4 fit2 s(Examination) TPRS NA <tibble [100 × 3]>
So if you want to do customising of the plot etc, you'll need to know how to work with nested data frames or just do
library("tidyr")
unnest(comp, data)
which gets you:
r$> unnest(comp, data)
# A tibble: 400 × 8
model smooth type by est se Education Examination
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 fit1 s(Education) TPRS NA 1.19 3.48 1 NA
2 fit1 s(Education) TPRS NA 1.37 3.20 1.53 NA
3 fit1 s(Education) TPRS NA 1.56 2.94 2.05 NA
4 fit1 s(Education) TPRS NA 1.75 2.70 2.58 NA
5 fit1 s(Education) TPRS NA 1.93 2.49 3.10 NA
6 fit1 s(Education) TPRS NA 2.11 2.29 3.63 NA
7 fit1 s(Education) TPRS NA 2.28 2.11 4.15 NA
8 fit1 s(Education) TPRS NA 2.44 1.95 4.68 NA
9 fit1 s(Education) TPRS NA 2.59 1.82 5.20 NA
10 fit1 s(Education) TPRS NA 2.72 1.71 5.73 NA
# … with 390 more rows
To create your own plots then, we proceed from the unnested data frames and add the confidence interval
ucomp <- unnest(comp, data) %>%
add_confint()
Then plot each panel in turn
library("ggplot2")
library("dplyr")
p_edu <- ucomp |>
filter(smooth == "s(Education)") |> # <-- only one comparison at a time
ggplot(aes(x = Education, y = est)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = model),
alpha = 0.2) +
geom_line(aes(colour = model)) +
scale_fill_brewer(palette = "Set1") + # <-- change fill scale
scale_colour_brewer(palette = "Set1") + # <-- change colour scale
geom_rug(data = swiss, # <-- rug
mapping = aes(x = Education, y = NULL),
sides = "b", alpha = 0.4) +
labs(title = "s(Education)", y = "Estimate",
colour = "Model", fill = "Model")
p_exam <- ucomp |>
filter(smooth == "s(Examination)") |>
ggplot(aes(x = Examination, y = est)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = model),
alpha = 0.2) +
geom_line(aes(colour = model)) +
scale_fill_brewer(palette = "Set1") + # <-- change fill scale
scale_colour_brewer(palette = "Set1") + # <-- change colour scale
geom_rug(data = swiss, # <-- rug
mapping = aes(x = Examination, y = NULL),
sides = "b", alpha = 0.4) +
labs(title = "s(Examination)", y = "Estimate",
colour = "Model", fill = "Model")
Now use the {patchwork} package to put the plots together
library("patchwork")
p_edu + p_exam + plot_layout(guides = "collect")
which produces
This is all using {ggplot2} so you'll need to look at other scales if you want more control over the colours ?scale_fill_manual for example or provide other ready-made discrete scales if you want to use an existing palette.
I could make some of this easier in {gratia} - I could allow users to provide a scale to be used for the colour and fill, and also if they supply the raw data I could draw the rugs too.
If you want them in the same plot, you can pull the data from your fit with trt_fit1[["plots"]][[1]]$data$fit and plot them yourself. I looked at the plot style from the mgcViz github. You can add a second axis or scale as necessary.
library(tidyverse)
exam_dat <-
bind_rows(trt_fit1[["plots"]][[1]]$data$fit %>% mutate(fit = "Fit 1"),
trt_fit2[["plots"]][[1]]$data$fit %>% mutate(fit = "Fit 2"))
ggplot(data = exam_dat, aes(x = x, y = y, colour = fit)) +
geom_line() +
labs(x = "Examination", y = "s(Examination)") +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
To simply get them on the same panel, you could use gridExtra as fit1 and fit2 have a ggplot object.
gridExtra::grid.arrange(
trt_fit1[["plots"]][[2]]$ggObj,
trt_fit2[["plots"]][[2]]$ggObj,
nrow = 1)
Created on 2022-02-18 by the reprex package (v2.0.1)
Related
I'm using ggplot to visualize many linear regressions and facet them by groups. I'd like geom_smooth() to show the trend line as one color if P < 0.05, a different color if P < 0.10, and not show it at all if P ≥ 0.10.
I managed to do this using a loop to extract P-values from lm() for each regression, then join them with the data used for plotting. Then I add another column of color names to pass to aes(), determined conditionally from the P-values, and use scale_color_identity() to achieve my goal.
Here's an example:
library(tidyverse)
#make mtcars a tibble and cyl a factor, for convenience
mtcars1 <- as_tibble(mtcars) %>% dplyr::mutate(cyl = as.factor(cyl))
#initialize a list to store p-values from lm() for each level of factor
p.list <- vector(mode = "list", length = length(levels(mtcars1$cyl)))
names(p.list) <- levels(mtcars1$cyl)
#loop to calculate p-values for each level of mtcars$cyl
for(i in seq_along(levels(mtcars1$cyl))){
mtcars.sub <- mtcars1 %>% dplyr::filter(cyl == levels(.$cyl)[i])
lm.pval <- mtcars.sub %>%
dplyr::distinct(cyl) %>%
dplyr::mutate(P =
summary(lm(mpg ~ disp, data = mtcars.sub))$coefficients[2,4] ##extract P-value
)
p.list[[i]] <- lm.pval
}
#join p-values to dataset and add column to use with scale_color_identity()
mtcars.p <- mtcars1 %>% dplyr::left_join(dplyr::bind_rows(p.list, .id = "cyl"), by = "cyl") %>%
dplyr::mutate(p.color = ifelse(P < 0.05, "black",
ifelse(P < 0.10, "lightblue", NA)))
#plot
ggplot(data = mtcars.p, aes(x = disp, y = mpg)) +
geom_smooth(method = "lm",
se = FALSE,
aes(color = p.color)) +
geom_point() +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
This seems like too many initial steps for something that should be relatively easy. Are these steps necessary, or is there a more efficient way of doing this? Can ggplot or any other packages out there do this on their own, without having to first extract p-values from lm()?
After specifying your regression function, you can include the line function within ggplot:
myline<-lm(mpg ~ disp, data = mtcars)
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_abline(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue')+
geom_point(color='red') +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
The same as above, you can use this geom_smooth() command as well:
geom_smooth(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue',se=F,method='lm')+
We may simplify the steps with a group by operation and also instead of extracting each component, the output can be in a tibble with tidy from broom
library(broom)
library(dplyr)
library(tidyr)
mtcars1 %>%
group_by(cyl) %>%
summarise(out = list(tidy(lm(mpg ~ disp, data = cur_data())))) %>%
unnest(out)
-output
# A tibble: 6 x 6
cyl term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) 40.9 3.59 11.4 0.00000120
2 4 disp -0.135 0.0332 -4.07 0.00278
3 6 (Intercept) 19.1 2.91 6.55 0.00124
4 6 disp 0.00361 0.0156 0.232 0.826
5 8 (Intercept) 22.0 3.35 6.59 0.0000259
6 8 disp -0.0196 0.00932 -2.11 0.0568
I have created a boxplot using ggplot from the results of my tukey test. I have added the letters of significance above my boxes but the letters are not in order. I wish for my first sample to be "a" and then have "b" and then "c".
I used the following code;
value_max =
Rosettes %>%
group_by(Genotype) %>%
summarize(max_value = max(X0.5xMS))
hsd=HSD.test(aov(X0.5xMS~Genotype, data=Rosettes), trt = "Genotype", group = T)
sig.letters <- hsd$groups[order(row.names(hsd$groups)), ]
p <- ggplot(data = Rosettes, aes(x = Genotype, y = X0.5xMS)) +
geom_boxplot(aes(fill = Genotype,)) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ theme(axis.text.x = element_text(angle = 90)) +
geom_text(data = value_max, aes(x=Genotype, y = 0.1 + max_value, label = sig.letters$groups), vjust=0)+
stat_boxplot(geom = 'errorbar', width = 0.1)+
ggtitle("Rosette Tukey Results \n 0.5xMS") + theme(plot.title = element_text(hjust=0.5))+
xlab("Genotype") + ylab("Rosette Area (cm2)"); p
This code has given me the desired graph, the order of the letters is my only issue. If anyone could help, I would be very grateful.
Does this help? I know it is not your data and not your functions (aov and hsd.test but it does what you want, I believe. (Please find more about the compact letter display here.)
library(emmeans)
library(multcomp)
library(multcompView)
# set up model
model <- lm(weight ~ group, data = PlantGrowth)
# get (adjusted) weight means per group
model_means <- emmeans(object = model,
specs = "group")
# add letters to each mean
model_means_cld <- cld(object = model_means,
adjust = "Tukey",
Letters = letters,
reversed = TRUE, # <---- this one here!
alpha = 0.05)
# show output
model_means_cld
#> group emmean SE df lower.CL upper.CL .group
#> trt2 5.53 0.197 27 5.02 6.03 a
#> ctrl 5.03 0.197 27 4.53 5.53 ab
#> trt1 4.66 0.197 27 4.16 5.16 b
#>
#> Confidence level used: 0.95
#> Conf-level adjustment: sidak method for 3 estimates
#> P value adjustment: tukey method for comparing a family of 3 estimates
#> significance level used: alpha = 0.05
Created on 2021-10-18 by the reprex package (v2.0.1)
For a very basic demonstration, I'm trying to show that the log transformation linear model is the best one for a given set of data. To demonstrate that I'm looking to compare it to standard lm, square root etc, to show that graphically, the log transform of the linear model fits best as compared to the other 2. The question is, how do create multiple overlapping different lm lines in one plot,? If I could label them that would also be great?
Here is sample true data with starter ggplot
library(tidyverse)
p=runif(100,1,100)
q=6+3*log(p)+rnorm(100)
sample <- data.frame(p,q)
ggplot(data = sample) +
geom_point(mapping = aes(x = p, y = q))
You could compute the lines yourself, e.g. like this:
# Make a tibble containing name of transform and the actual function
transforms <- tibble(Transform = c("log", "sqrt", "linear"),
Function = list(log, sqrt, function(x) x))
# Compute the regression coefs and turn it into a tidy table
lm_df <- transforms %>%
group_by(Transform) %>%
group_modify(~ {
lm(q ~ .x$Function[[1]](p), data = sample) %>%
broom::tidy() %>%
select(term, estimate) %>%
pivot_longer(estimate) %>%
mutate(Function = .x$Function)
})
> lm_df
# A tibble: 6 x 5
# Groups: Transform [3]
Transform term name value Function
<chr> <chr> <chr> <dbl> <list>
1 linear (Intercept) estimate 12.6 <fn>
2 linear .x$Function[[1]](p) estimate 0.0834 <fn>
3 log (Intercept) estimate 5.89 <fn>
4 log .x$Function[[1]](p) estimate 2.99 <fn>
5 sqrt (Intercept) estimate 9.35 <fn>
6 sqrt .x$Function[[1]](p) estimate 1.11 <fn>
# Evaluate the functions at different x values
lm_df <- lm_df %>%
pivot_wider(names_from = term, values_from = value) %>%
rename("Intercept" = `(Intercept)`, "Slope" = `.x$Function[[1]](p)`) %>%
group_modify(~ {
tibble(
y = .x$Intercept + .x$Slope * .x$Function[[1]](seq(0, max(sample$p))),
x = seq(0, max(sample$p))
)
})
> lm_df
# A tibble: 300 x 3
# Groups: Transform [3]
Transform y x
<chr> <dbl> <int>
1 linear 12.6 0
2 linear 12.7 1
3 linear 12.8 2
4 linear 12.9 3
5 linear 12.9 4
6 linear 13.0 5
7 linear 13.1 6
8 linear 13.2 7
9 linear 13.3 8
10 linear 13.4 9
# ... with 290 more rows
# Plot the functions
ggplot() +
geom_point(data = sample, mapping = aes(x = p, y = q)) +
geom_line(data = lm_df, aes(x = x, y = y, color = Transform))
This doesn't handle the labeling (you could use annotate() to add labels manually), but:
gg0 <- ggplot(data = sample, aes(x=p, y=q)) + geom_point()
gg0 + geom_smooth(method="lm", formula=y~x) +
geom_smooth(method="lm", formula=y~log(x), colour="red") +
geom_smooth(method="lm", formula=y~sqrt(x), colour="purple")
I need to plot the relationship between x and y where polynomials of x predict y. This is done using the poly() function in order to ensure polynomials are orthogonal.
How do I plot this relationship considering linear, quadratic and cubic terms together ? The issue is the coefficients for the different terms are not scaled as x is.
I provide some example code below. I have tried reassigning the contrast values for each polynomial to x.
This solution gives impossible predicted values.
Thank you in advance for your help !
Best wishes,
Eric
Here is an example code:
x = sample(0:6,100,replace = TRUE)
y = (x*0.2) + (x^2*.05) + (x^3*0.001)
y = y + rnorm(100)
x = poly(x,3)
m = lm(y~x)
TAB = summary(m)$coefficients
### Reassigning the corresponding contrast values to each polynomial of x:
eq = function(x,TAB,start) {
#argument 'start' is used to determine the position of the linear coefficient, quadratic and cubic follow
pols = poly(x,3)
x1=pols[,1]; x2=pols[,2]; x3=pols[,3]
TAB[1,1] + x1[x]*TAB[start,1] + x2[x] * TAB[start+1,1] + x3[x] * TAB[start+2,1]
}
plot(eq(0:7,TAB,2))
Actually, you can use poly directly in formula for lm().
y ~ poly(x, 3) in lm() might be what you want.
For plot, I'll use ggplot2 package which has geom_smooth() function. It can draw the fitted curve. You should specify
method = "lm" argument
and the formula
library(tidyverse)
x <- sample(0:6,100,replace = TRUE)
y <- (x*0.2) + (x^2*.05) + (x^3*0.001)
eps <- rnorm(100)
(df <- data_frame(y = y + eps, x = x))
#> # A tibble: 100 x 2
#> y x
#> <dbl> <int>
#> 1 3.34 4
#> 2 1.23 5
#> 3 1.38 3
#> 4 -0.115 2
#> 5 1.94 5
#> 6 3.87 6
#> 7 -0.707 3
#> 8 0.954 3
#> 9 1.19 3
#> 10 -1.34 0
#> # ... with 90 more rows
Using your simulated data set,
df %>%
ggplot() + # this should be declared at first with the data set
aes(x, y) + # aesthetic
geom_point() + # data points
geom_smooth(method = "lm", formula = y ~ poly(x, 3)) # lm fit
If you want to remove the points: erase geom_point()
df %>%
ggplot() +
aes(x, y) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
transparency solution: control alpha less than 1
df %>%
ggplot() +
aes(x, y) +
geom_point(alpha = .3) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
consider this simple example
dataframe <- data_frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
> dataframe
# A tibble: 6 x 2
x y
<dbl> <dbl>
1 1 12
2 2 24
3 3 24
4 4 34
5 5 12
6 6 15
dataframe %>% ggplot(., aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x)
Here the standard errors are computed with the default option. However, I would like to use the robust variance-covariance matrix available in the package sandwich and lmtest
That is, using vcovHC(mymodel, "HC3")
Is there a way to get that in a simple way using the geom_smooth() function?
UPDATE: 2021-03-17 It was recently pointed out to me that the ggeffects package handles different VCOVs automatically, including the trickier HAC case that I originally demonstrated below. Quick example of the latter:
library(ggeffects)
library(sandwich) ## For HAC and other robust VCOVs
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
reg1 <- lm(y ~ x, data = d)
plot(ggpredict(reg1, "x", vcov.fun = "vcovHAC"))
#> Loading required namespace: ggplot2
## This gives you a regular ggplot2 object. So you can add layers as you
## normally would. E.g. If you'd like to compare with the original data...
library(ggplot2)
last_plot() +
geom_point(data = d, aes(x, y)) +
labs(caption = 'Shaded region indicates HAC 95% CI.')
Created on 2021-03-17 by the reprex package (v1.0.0)
My original answer follows below...
HC robust SEs (simple)
This is easily done now thanks to the estimatr package and its family of lm_robust functions. E.g.
library(tidyverse)
library(estimatr)
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
d %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm_robust', formula = y~x, fill="#E41A1C") + ## Robust (HC) SEs
geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
labs(
title = "Plotting HC robust SEs in ggplot2",
subtitle = "Regular SEs in grey for comparison"
) +
theme_minimal()
Created on 2020-03-08 by the reprex package (v0.3.0)
HAC robust SEs (a bit more legwork)
The one caveat is that estimatr does not yet offer support for HAC (i.e. heteroscedasticity and autocorrelation consistent) SEs a la Newey-West. However, it is possible to obtain these manually with the sandwich package... which is kind of what the original question was asking anyway. You can then plot them using geom_ribbon().
I'll say for the record that HAC SEs don't make much sense for this particular data set. But here's an example of how you could do it, riffing off this excellent SO answer on a related topic.
library(tidyverse)
library(sandwich)
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
reg1 <- lm(y~x, data = d)
## Generate a prediction DF
pred_df <- data.frame(fit = predict(reg1))
## Get the design matrix
X_mat <- model.matrix(reg1)
## Get HAC VCOV matrix and calculate SEs
v_hac <- NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) ## HAC VCOV (adjusted for small data sample)
#> Warning in meatHAC(x, order.by = order.by, prewhite = prewhite, weights =
#> weights, : more weights than observations, only first n used
var_fit_hac <- rowSums((X_mat %*% v_hac) * X_mat) ## Point-wise variance for predicted mean
se_fit_hac <- sqrt(var_fit_hac) ## SEs
## Add these to pred_df and calculate the 95% CI
pred_df <-
pred_df %>%
mutate(se_fit_hac = se_fit_hac) %>%
mutate(
lwr_hac = fit - qt(0.975, df=reg1$df.residual)*se_fit_hac,
upr_hac = fit + qt(0.975, df=reg1$df.residual)*se_fit_hac
)
pred_df
#> fit se_fit_hac lwr_hac upr_hac
#> 1 20.95238 4.250961 9.149822 32.75494
#> 2 20.63810 2.945392 12.460377 28.81581
#> 3 20.32381 1.986900 14.807291 25.84033
#> 4 20.00952 1.971797 14.534936 25.48411
#> 5 19.69524 2.914785 11.602497 27.78798
#> 6 19.38095 4.215654 7.676421 31.08548
## Plot it
bind_cols(
d,
pred_df
) %>%
ggplot(aes(x = x, y = y, ymin=lwr_hac, ymax=upr_hac)) +
geom_point() +
geom_ribbon(fill="#E41A1C", alpha=0.3, col=NA) + ## Robust (HAC) SEs
geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
labs(
title = "Plotting HAC SEs in ggplot2",
subtitle = "Regular SEs in grey for comparison",
caption = "Note: Do HAC SEs make sense for this dataset? Definitely not!"
) +
theme_minimal()
Created on 2020-03-08 by the reprex package (v0.3.0)
Note that you could also use this approach to manually calculate and plot other robust SE predictions (e.g. HC1, HC2,etc.) if you so wished. All you would need to do is use the relevant sandwich estimator. For instance, using vcovHC(reg1, type = "HC2") instead of NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) will give you an identical HC-robust CI to the first example that uses the estimatr package.
I am very new to this whole robust SE thing, but I was able to generate the following:
zz = '
x y
1 1 12
2 2 24
3 3 24
4 4 34
5 5 12
6 6 15
'
df <- read.table(text = zz, header = TRUE)
df
library(sandwich)
library(lmtest)
lm.model<-lm(y ~ x, data = df)
coef(lm.model)
se = sqrt(diag(vcovHC(lm.model, type = "HC3")))
fit = predict(lm.model)
predframe <- with(df,data.frame(x,
y = fit,
lwr = fit - 1.96 * se,
upr = fit + 1.96 * se))
library(ggplot2)
ggplot(df, aes(x = x, y = y))+
geom_point()+
geom_line(data = predframe)+
geom_ribbon(data = predframe, aes(ymin = lwr,ymax = upr), alpha = 0.3)