Plotting smooth functions from my GAM in ggplot - r

I have created a GAM and set up the predictions but having trouble with how to plot any smooth functions from my model. Been trying to plot these in ggplot but having trouble with the arguments/aesthetics now I have added in a the month aswell, seening some people say to use geom_smooth() too but I'm not sure. If anyone can advise me on this that would be great, i had added my data, model and predictions below;
model
mod = gam(co2 ~ s(timeStep, k = 200, bs = "cs") + s(month, k = 12, bs = "cc"),
data = carbonD,
family = gaussian(link = "identity"))
predictions
#create predictions
preds = predict(mod, type = 'terms', se.fit = TRUE)
#combine our predictions with coefficients
fit = preds$fit + coef(mod)[1]
data snipet
carbonD
co2 month year timeStep
1 315.42 1 1959 1
2 316.31 2 1959 2
3 316.50 3 1959 3
4 317.56 4 1959 4
5 318.13 5 1959 5
6 318.00 6 1959 6
7 316.39 7 1959 7
8 314.65 8 1959 8
9 313.68 9 1959 9
10 313.18 10 1959 10
11 314.66 11 1959 11
12 315.43 12 1959 12
13 316.27 1 1960 13
14 316.81 2 1960 14
15 317.42 3 1960 15

There are two ways to plot your exact model in ggplot. One is to use geom_smooth, but you can't do this with two variables on the right hand side. Actually, in your case it would be possible because month is calculable from the time step, but let's ignore that for now and just plot your model predictions directly using a ribbon and a line.
First, load the required packages and create the model (note because we only have a snippet of your data, I have had to reduce the number of knots)
library(mgcv)
library(ggplot2)
mod = gam(co2 ~ s(timeStep, k = 4, bs = "cs") + s(month, k = 12, bs = "cc"),
data = carbonD,
family = gaussian(link = "identity"))
Now we create a little data frame of the values we want our predictions at, with 1000 points across the range of our data:
newdata <- data.frame(timeStep = seq(1, 15, length.out = 1000),
month = (seq(1, 15, length.out = 1000) - 1) %% 12 + 1)
Now we make our predictions and use the standard error fit to create an upper and lower confidence band.
pred <- predict(mod, newdata, type = 'response', se.fit = TRUE)
newdata$co2 <- pred$fit
newdata$lower <- pred$fit - 1.96 * pred$se.fit
newdata$upper <- pred$fit + 1.96 * pred$se.fit
Now we can plot our results:
ggplot(carbonD, aes(timeStep, co2)) +
geom_point() +
geom_ribbon(data = newdata, alpha = 0.3,
aes(ymin = lower, ymax = upper, fill = "confidence interval")) +
geom_line(data = newdata, aes(color = "GAM")) +
scale_fill_manual(values = "lightblue", name = NULL) +
scale_color_manual(values = "darkblue", name = NULL) +
theme_minimal(base_size = 16)
It is also possible to use your gam within geom_smooth directly, but you need to be able to express the model in terms of y and x, where x is the time step. You can get the month by subtracting 1 from the time step, getting this number modulo 12, and adding 1 again, so it is possible to avoid explicitly creating a prediction data frame, at the cost of making the plotting code more complex:
ggplot(carbonD, aes(timeStep, co2)) +
geom_point() +
geom_smooth(formula = y ~ s(x, k = 4, bs = "cs") +
s((x - 1) %% 12 + 1, k = 12, bs = "cc"),
method = "gam", size = 0.7,
method.args = list(family = gaussian(link = "identity")),
aes(color = "gam", fill = "confidence interval")) +
scale_fill_manual(values = "lightblue", name = NULL) +
scale_color_manual(values = "darkblue", name = NULL) +
theme_minimal(base_size = 16)
As a caveat to this, it is not clear to me that you should have both month and timestep, since one is just the modulus of the other. It may be better to use just timestep alone, or use year and month if you want to separate the long-term and seasonal effects.

The simplest way would be to use geom_smooth with LOESS: geom_smooth(method="loess", span=0.5) and play with the span parameter to get a more smooth or wiggly shape.

Related

Plotting a conference interval on a poisson glm (ggplot)

I have a poisson glm and i am trying to plot 95% condidence intervals using ggplot. My issue lies when i use the geom_ribbon() argument. I think my model and CIs are all set up fine, its just the ggplot code that i cannot get to work, if anyone knows what ive done wrong in the geom_ribbon argument this could be great
model, CIs and plot code
#creating the poisson GLM model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#make predictions
model3_preds = predict(model3, type = 'response')
#create predictions for confidence intervals
predictions_model3 = predict(model3, aids, se.fit = TRUE, type = 'response')
#calculate 95% confidence intervals limit
upper_mod3 = predictions_model3$fit+1.96*predictions_model3$se.fit
lower_mod3 = predictions_model3$fit-1.96*predictions_model3$se.fit
#combining our predictions and confidence intervals into a df
predframe_model3 = data.frame(lwr = lower_mod3, upr = upper_mod3, data = aids$date, cases = aids$cases)
#plotting our model with 95% confidence intervals around the mean
ggplot(aids, aes(date, cases)) +
geom_ribbon(data = predframe_model3, aes(ymin = lwr, ymax = upr), fill = 'grey') +
geom_point() +
geom_line(aes(date, model3_preds), col = 'red')
aids data snipet if needed
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
In this case, you can plot directly with ggplot without a prediction data frame, using geom_smooth:
ggplot(aids, aes(date, cases)) +
geom_smooth(method = glm, formula = y ~ x, color = "red",
method.args = list(family = poisson)) +
geom_point()

Plotting prediction intervals for mixed effects model

I have implemented a mixed effects model for my experiment for how error rate affects reaction time. I now want to calculate prediction intervals and then plot them.
Here is an example of my df
ppid error_rate RT pNum
1 1_1_4 2.865371 0.43339 1
2 1_1_77 11.459301 0.45000 1
3 1_1_80 2.865371 0.38320 1
4 1_2_26 3.820155 0.49990 1
5 1_2_31 2.865371 0.56680 1
6 1_2_32 3.820155 0.58330 1
7 1_2_33 2.865371 0.50000 1
8 1_2_40 3.820155 0.44980 1
9 1_2_43 2.865371 0.56660 1
10 1_2_54 11.459301 0.46670 1
11 1_2_63 2.865371 0.43350 1
12 1_2_64 2.865371 0.46680 1
13 1_2_71 2.865371 0.54990 1
14 1_2_76 2.865371 0.48350 1
15 1_2_85 2.865371 0.53340 1
16 1_2_88 3.820155 0.43340 1
17 1_2_89 3.820155 0.53320 1
18 1_3_0 3.820155 0.45080 1
19 1_3_1 2.865371 0.45022 1
20 1_3_19 2.865371 0.46651 1
I then implement the mixed effects model, generate some prediction intervals for each data point and then combine my original data with the predictions:
library(lme4)
library(merTools)
library(ggplot2)
fit <- lmer(formula = RT ~ error_rate + (1 + error_rate | pNum), data = data)
pred <- cbind(data, predictInterval(fit, data))
I then plot this using ggplot and get the following plot:
ggplot(pred) +
geom_line(aes(x = error_rate, y = fit)) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = .2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = .1) +
ylab("RT")
My plot makes sense to me: I have the black line indicating the predicted values for each error rate, and a shaded area which denotes the intervals. However I'm unsure why I'm getting the straight vertical lines in the middle of each error rate level within my data points? Also my horizontal prediction line seems wonky... does anybody know why this might be, and how to eradicate it? Many thanks!
One way to have a line connecting the error_rate values without the vertical lines, is to plot mean values of the y variable fit. This is done with stat_summary as below.
ggplot(pred, aes(x = error_rate, y = fit)) +
stat_summary(fun.y = mean, geom = "line", show.legend = FALSE) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = 0.2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = 0.1) +
ylab("RT")
Note: In the question code, the ribbon is plotted with alpha = 0.2 and the points with alpha = 0.1. Would it make more sense to have the points less transparent than the underlying prediction band? And therefore to swap the alpha values?

how to extract equation of a log log linear function on r [duplicate]

I have the following data in a data.frame called t
DayNum MeanVolume StdDev StdErr
1 13 207.0500 41.00045 5.125057
2 15 142.7625 27.87236 3.484045
3 18 77.5500 19.43928 2.429910
4 21 66.3750 20.56403 2.570504
5 26 67.0500 29.01576 3.626970
6 29 66.4750 25.94537 3.243171
7 33 76.9625 25.31374 3.164218
8 36 91.2875 37.01719 4.627149
9 40 102.0500 29.39898 3.674872
10 43 100.8250 24.22830 3.028538
11 47 120.5125 28.80592 3.600740
12 50 147.8875 35.82894 4.478617
13 54 126.7875 45.43204 5.679004
14 57 139.8500 56.01117 7.001397
15 60 179.1375 69.64526 8.705658
16 64 149.7625 39.10265 4.887831
17 68 229.5250 121.08411 15.135514
18 71 236.5125 76.23146 9.528933
19 75 243.2750 101.69474 12.711842
20 78 331.6750 141.25344 17.656680
21 82 348.2875 122.86359 15.357948
22 85 353.7750 187.24641 23.405801
23 89 385.4000 154.05826 19.257283
24 92 500.9875 263.43714 32.929642
25 95 570.2250 301.82686 37.728358
26 98 692.2250 344.71226 43.089032
27 102 692.8000 283.94120 35.492650
28 105 759.2000 399.19323 49.899153
29 109 898.2375 444.94289 55.617861
30 112 920.1000 515.79597 64.474496
I am trying to fit x = DayNum to y = MeanVolume in t.
Here is what I did:
Fit to data
model<-lm(log(t$MeanVolume) ~ t$DayNum, data=t)
Plot data
plot(MeanVolume~DayNum, data=t, ylab="Mean Volume (mm3)", xlim=c(0,120), ylim=c(0,1000))
arrows(t$DayNum, t$MeanVolume-t$StdErr, t$DayNum, t$MeanVolume+t$StdErr, length=0.01, angle=90, code=3)
Create fit data
t$pred<-exp(predict(model))
Plot fit
lines(t$DayNum,t$pred,col="blue")
On the other hand, if I use ggplot2 to do this by using
ggplot(data = t, mapping = aes(x = DayNum, y=MeanVolume)) +
geom_line() +
geom_point(size=3, color="blue") +
geom_smooth(method="glm", method.args=list(family=gaussian(link="log"))) +
labs(x="Days", y="Mean Volume (mm3)", title="Data") +
geom_errorbar(aes(ymin = MeanVolume - StdErr, ymax = MeanVolume + StdErr), width=.2)
I get the following plot
As you can see the fitted curve in the ggplot case is better than in the plot case. Why? Also I would like to fit parameters such as intercept and the slope of the exponential fit line. How can I extract them from ggplot call?
lm with log transformed y is not the same as glm with gaussian error distribution and log link (as to why check link in the comment by #Lyngbakr)
gz <- read.table("somet.txt")
gz <- as.data.frame(gz)
model_lm <- lm(log(MeanVolume) ~ DayNum, data = gz)
model_glm <- glm(MeanVolume ~ DayNum, data = gz, family = gaussian(link = "log"))
pred_lm <- exp(predict(model_lm))
pred_glm <- predict(model_glm, type = "response")
plot(MeanVolume ~ DayNum, data = gz, ylab = "Mean Volume (mm3)", xlim = c(0,120), ylim = c(0,1000))
arrows(gz$DayNum, gz$MeanVolume - gz$StdErr, gz$DayNum, gz$MeanVolume + gz$StdErr, length = 0.01, angle = 90, code = 3)
lines(gz$DayNum, pred_lm, col = "blue")
lines(gz$DayNum, pred_glm, col = "red")
legend("topleft", col = c("blue", "red"), lty = 1, legend = c("lm", "glm"))
as for the second part of the question:
library(ggplot2)
p = ggplot(data = gz, mapping = aes(x = DayNum, y=MeanVolume)) +
geom_line() +
geom_point(size = 3, color="blue") +
geom_smooth(method = "glm", method.args = list(family = gaussian(link = "log"))) +
labs(x = "Days", y = "Mean Volume (mm3)", title = "Data") +
geom_errorbar(aes(ymin = MeanVolume - StdErr, ymax = MeanVolume + StdErr), width=.2)
to extract the data from a ggplot one can use:
build = ggplot_build(p)
the data for the curve are in build$data[[3]]
p + geom_line(data = build$data[[3]], aes(x = x, y = y), lty = 2, color = "red", size = 1.5)
This data is the same as data in pred_glm - well its a bit more dense (more data points). As far as I am aware there is no method to extract the coefficients from the ggplot just the predictions, but you can always build the glm model as described above.

Estimating the Poisson distribution

I have a graph and I calculated the degree distribution and degree as follows:
library(igraph) # for these two functions
dd <- degree_distribution(graph)
d <- degree(graph)
From this, I estimated to Power Law, to see if my distribution follows the "Law of Power":
degree = 1:max(d)
probability = dd[-1]
nonzero.position = which(probability != 0)
probability = probability[nonzero.position]
degree = degree[nonzero.position]
reg = lm(log(probability) ~ log(degree))
cozf = coef(reg)
power.law.fit = function(x) exp(cozf[[1]] + cozf[[2]] * log(x))
From that, I plotted the points and power law using ggplot2.
Resulting in the following image:
df <- data.frame(x = degree, y = probability)
print(
ggplot(df, aes(x,y,colour="Distribuição"))+
geom_point(shape = 4) +
stat_function(fun = power.law.fit, geom = "line", aes(colour="Power Law"))+
labs(title = "Grafo", subtitle = "Distribuição dos Graus",
x="K", y="P(k)", colour="Legenda")+
scale_color_brewer(palette="Dark2")
)
As you can see, my distribution does not follow Power Law! I would like to estimate the Poisson distribution and plot on the same graph.
Even though I'm not sure that my distribution does not follow (or follow) Poisson, I would like to draw it together with Power Law. I have no idea how to estimate this distribution (Poisson) from the data, and calculate the average degree.
Can anyone help me?
The graph used to calculate the distribution and the degree is very large (700 thousand vertices), so I did not put the data of the graphs. The explanation of the answer can be based on any graph.
From ?dpois:
The Poisson distribution has density
p(x) = λ^x exp(-λ)/x!
for x = 0, 1, 2, … . The mean and variance are E(X) = Var(X) = λ.
So I'll generate some dummy data with a secret lambda:
mysecret <- ####
x <- data.frame(xes = rpois(50, mysecret))
> x$xes
[1] 0 2 2 1 1 4 1 1 0 2 2 2 1 0 0 1 2 3 2 4 2 1 0 3 2 1 3 1 2 1 5 0 2 3 2 1 0 1 2 3 0 1 2 2 0 3 2 2 2 3
> mean(x$xes)
[1] 1.66
> var(x$xes)
[1] 1.371837
So two good guesses for my secret lambda are 1.66 and 1.37. Let's try them:
library(ggplot2)
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density.., color = "Raw data"),
fill = "white", binwidth = 1, center = 0, size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on E(X)"),
fun.args = list(lambda = 1.66), geom = "line", size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on Var(X)"),
fun.args = list(lambda = 1.37), geom = "line", size = 1.5)
They're both pretty good. You can't really use the built-in stat_function or geom_density for generating these, since Poisson distributions are only defined for integers. The histogram and summary functions work well though, since they're only estimated at the data points themselves, not interpolated.
If you want more detail, you can use the MASS package:
MASS::fitdistr(x$xes, dpois, start = list(lambda = 1))
lambda
1.6601563
(0.1822258)
So let's try constructing from that:
library(dplyr)
df <- data_frame(xes = seq.int(max(x$xes)+1)-1,
dens.m = dpois(xes, 1.66),
dens.u = dpois(xes, 1.66+0.18),
dens.l = dpois(xes, 1.66-0.18))
> df
# A tibble: 6 x 4
xes dens.m dens.u dens.l
<dbl> <dbl> <dbl> <dbl>
1 0 0.19013898 0.15881743 0.22763769
2 1 0.31563071 0.29222406 0.33690378
3 2 0.26197349 0.26884614 0.24930880
4 3 0.14495866 0.16489230 0.12299234
5 4 0.06015785 0.07585046 0.04550717
6 5 0.01997240 0.02791297 0.01347012
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density..), color = "black",
fill = "white", binwidth = 1, center = 0, size = 1.5) +
geom_ribbon(data = df, aes(xes, ymin = dens.l, ymax = dens.u), fill = "grey50", alpha = 0.5) +
geom_line(data = df, aes(xes, dens.m, color = "Based on E(X)\n+/-1 SD of lambda"), size = 1.5)
Based on these two methods and visual interpretation, you should feel comfortable saying λ = 1.66+/-0.18.
For reference, my secret initial value was 1.5.

ggplot2: how to get robust confidence interval for predictions in geom_smooth?

consider this simple example
dataframe <- data_frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
> dataframe
# A tibble: 6 x 2
x y
<dbl> <dbl>
1 1 12
2 2 24
3 3 24
4 4 34
5 5 12
6 6 15
dataframe %>% ggplot(., aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x)
Here the standard errors are computed with the default option. However, I would like to use the robust variance-covariance matrix available in the package sandwich and lmtest
That is, using vcovHC(mymodel, "HC3")
Is there a way to get that in a simple way using the geom_smooth() function?
UPDATE: 2021-03-17 It was recently pointed out to me that the ggeffects package handles different VCOVs automatically, including the trickier HAC case that I originally demonstrated below. Quick example of the latter:
library(ggeffects)
library(sandwich) ## For HAC and other robust VCOVs
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
reg1 <- lm(y ~ x, data = d)
plot(ggpredict(reg1, "x", vcov.fun = "vcovHAC"))
#> Loading required namespace: ggplot2
## This gives you a regular ggplot2 object. So you can add layers as you
## normally would. E.g. If you'd like to compare with the original data...
library(ggplot2)
last_plot() +
geom_point(data = d, aes(x, y)) +
labs(caption = 'Shaded region indicates HAC 95% CI.')
Created on 2021-03-17 by the reprex package (v1.0.0)
My original answer follows below...
HC robust SEs (simple)
This is easily done now thanks to the estimatr package and its family of lm_robust functions. E.g.
library(tidyverse)
library(estimatr)
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
d %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm_robust', formula = y~x, fill="#E41A1C") + ## Robust (HC) SEs
geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
labs(
title = "Plotting HC robust SEs in ggplot2",
subtitle = "Regular SEs in grey for comparison"
) +
theme_minimal()
Created on 2020-03-08 by the reprex package (v0.3.0)
HAC robust SEs (a bit more legwork)
The one caveat is that estimatr does not yet offer support for HAC (i.e. heteroscedasticity and autocorrelation consistent) SEs a la Newey-West. However, it is possible to obtain these manually with the sandwich package... which is kind of what the original question was asking anyway. You can then plot them using geom_ribbon().
I'll say for the record that HAC SEs don't make much sense for this particular data set. But here's an example of how you could do it, riffing off this excellent SO answer on a related topic.
library(tidyverse)
library(sandwich)
d <- data.frame(x = c(1,2,3,4,5,6),
y = c(12,24,24,34,12,15))
reg1 <- lm(y~x, data = d)
## Generate a prediction DF
pred_df <- data.frame(fit = predict(reg1))
## Get the design matrix
X_mat <- model.matrix(reg1)
## Get HAC VCOV matrix and calculate SEs
v_hac <- NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) ## HAC VCOV (adjusted for small data sample)
#> Warning in meatHAC(x, order.by = order.by, prewhite = prewhite, weights =
#> weights, : more weights than observations, only first n used
var_fit_hac <- rowSums((X_mat %*% v_hac) * X_mat) ## Point-wise variance for predicted mean
se_fit_hac <- sqrt(var_fit_hac) ## SEs
## Add these to pred_df and calculate the 95% CI
pred_df <-
pred_df %>%
mutate(se_fit_hac = se_fit_hac) %>%
mutate(
lwr_hac = fit - qt(0.975, df=reg1$df.residual)*se_fit_hac,
upr_hac = fit + qt(0.975, df=reg1$df.residual)*se_fit_hac
)
pred_df
#> fit se_fit_hac lwr_hac upr_hac
#> 1 20.95238 4.250961 9.149822 32.75494
#> 2 20.63810 2.945392 12.460377 28.81581
#> 3 20.32381 1.986900 14.807291 25.84033
#> 4 20.00952 1.971797 14.534936 25.48411
#> 5 19.69524 2.914785 11.602497 27.78798
#> 6 19.38095 4.215654 7.676421 31.08548
## Plot it
bind_cols(
d,
pred_df
) %>%
ggplot(aes(x = x, y = y, ymin=lwr_hac, ymax=upr_hac)) +
geom_point() +
geom_ribbon(fill="#E41A1C", alpha=0.3, col=NA) + ## Robust (HAC) SEs
geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
labs(
title = "Plotting HAC SEs in ggplot2",
subtitle = "Regular SEs in grey for comparison",
caption = "Note: Do HAC SEs make sense for this dataset? Definitely not!"
) +
theme_minimal()
Created on 2020-03-08 by the reprex package (v0.3.0)
Note that you could also use this approach to manually calculate and plot other robust SE predictions (e.g. HC1, HC2,etc.) if you so wished. All you would need to do is use the relevant sandwich estimator. For instance, using vcovHC(reg1, type = "HC2") instead of NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) will give you an identical HC-robust CI to the first example that uses the estimatr package.
I am very new to this whole robust SE thing, but I was able to generate the following:
zz = '
x y
1 1 12
2 2 24
3 3 24
4 4 34
5 5 12
6 6 15
'
df <- read.table(text = zz, header = TRUE)
df
library(sandwich)
library(lmtest)
lm.model<-lm(y ~ x, data = df)
coef(lm.model)
se = sqrt(diag(vcovHC(lm.model, type = "HC3")))
fit = predict(lm.model)
predframe <- with(df,data.frame(x,
y = fit,
lwr = fit - 1.96 * se,
upr = fit + 1.96 * se))
library(ggplot2)
ggplot(df, aes(x = x, y = y))+
geom_point()+
geom_line(data = predframe)+
geom_ribbon(data = predframe, aes(ymin = lwr,ymax = upr), alpha = 0.3)

Resources