Plotting prediction intervals for mixed effects model - r

I have implemented a mixed effects model for my experiment for how error rate affects reaction time. I now want to calculate prediction intervals and then plot them.
Here is an example of my df
ppid error_rate RT pNum
1 1_1_4 2.865371 0.43339 1
2 1_1_77 11.459301 0.45000 1
3 1_1_80 2.865371 0.38320 1
4 1_2_26 3.820155 0.49990 1
5 1_2_31 2.865371 0.56680 1
6 1_2_32 3.820155 0.58330 1
7 1_2_33 2.865371 0.50000 1
8 1_2_40 3.820155 0.44980 1
9 1_2_43 2.865371 0.56660 1
10 1_2_54 11.459301 0.46670 1
11 1_2_63 2.865371 0.43350 1
12 1_2_64 2.865371 0.46680 1
13 1_2_71 2.865371 0.54990 1
14 1_2_76 2.865371 0.48350 1
15 1_2_85 2.865371 0.53340 1
16 1_2_88 3.820155 0.43340 1
17 1_2_89 3.820155 0.53320 1
18 1_3_0 3.820155 0.45080 1
19 1_3_1 2.865371 0.45022 1
20 1_3_19 2.865371 0.46651 1
I then implement the mixed effects model, generate some prediction intervals for each data point and then combine my original data with the predictions:
library(lme4)
library(merTools)
library(ggplot2)
fit <- lmer(formula = RT ~ error_rate + (1 + error_rate | pNum), data = data)
pred <- cbind(data, predictInterval(fit, data))
I then plot this using ggplot and get the following plot:
ggplot(pred) +
geom_line(aes(x = error_rate, y = fit)) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = .2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = .1) +
ylab("RT")
My plot makes sense to me: I have the black line indicating the predicted values for each error rate, and a shaded area which denotes the intervals. However I'm unsure why I'm getting the straight vertical lines in the middle of each error rate level within my data points? Also my horizontal prediction line seems wonky... does anybody know why this might be, and how to eradicate it? Many thanks!

One way to have a line connecting the error_rate values without the vertical lines, is to plot mean values of the y variable fit. This is done with stat_summary as below.
ggplot(pred, aes(x = error_rate, y = fit)) +
stat_summary(fun.y = mean, geom = "line", show.legend = FALSE) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = 0.2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = 0.1) +
ylab("RT")
Note: In the question code, the ribbon is plotted with alpha = 0.2 and the points with alpha = 0.1. Would it make more sense to have the points less transparent than the underlying prediction band? And therefore to swap the alpha values?

Related

Plotting smooth functions from my GAM in ggplot

I have created a GAM and set up the predictions but having trouble with how to plot any smooth functions from my model. Been trying to plot these in ggplot but having trouble with the arguments/aesthetics now I have added in a the month aswell, seening some people say to use geom_smooth() too but I'm not sure. If anyone can advise me on this that would be great, i had added my data, model and predictions below;
model
mod = gam(co2 ~ s(timeStep, k = 200, bs = "cs") + s(month, k = 12, bs = "cc"),
data = carbonD,
family = gaussian(link = "identity"))
predictions
#create predictions
preds = predict(mod, type = 'terms', se.fit = TRUE)
#combine our predictions with coefficients
fit = preds$fit + coef(mod)[1]
data snipet
carbonD
co2 month year timeStep
1 315.42 1 1959 1
2 316.31 2 1959 2
3 316.50 3 1959 3
4 317.56 4 1959 4
5 318.13 5 1959 5
6 318.00 6 1959 6
7 316.39 7 1959 7
8 314.65 8 1959 8
9 313.68 9 1959 9
10 313.18 10 1959 10
11 314.66 11 1959 11
12 315.43 12 1959 12
13 316.27 1 1960 13
14 316.81 2 1960 14
15 317.42 3 1960 15
There are two ways to plot your exact model in ggplot. One is to use geom_smooth, but you can't do this with two variables on the right hand side. Actually, in your case it would be possible because month is calculable from the time step, but let's ignore that for now and just plot your model predictions directly using a ribbon and a line.
First, load the required packages and create the model (note because we only have a snippet of your data, I have had to reduce the number of knots)
library(mgcv)
library(ggplot2)
mod = gam(co2 ~ s(timeStep, k = 4, bs = "cs") + s(month, k = 12, bs = "cc"),
data = carbonD,
family = gaussian(link = "identity"))
Now we create a little data frame of the values we want our predictions at, with 1000 points across the range of our data:
newdata <- data.frame(timeStep = seq(1, 15, length.out = 1000),
month = (seq(1, 15, length.out = 1000) - 1) %% 12 + 1)
Now we make our predictions and use the standard error fit to create an upper and lower confidence band.
pred <- predict(mod, newdata, type = 'response', se.fit = TRUE)
newdata$co2 <- pred$fit
newdata$lower <- pred$fit - 1.96 * pred$se.fit
newdata$upper <- pred$fit + 1.96 * pred$se.fit
Now we can plot our results:
ggplot(carbonD, aes(timeStep, co2)) +
geom_point() +
geom_ribbon(data = newdata, alpha = 0.3,
aes(ymin = lower, ymax = upper, fill = "confidence interval")) +
geom_line(data = newdata, aes(color = "GAM")) +
scale_fill_manual(values = "lightblue", name = NULL) +
scale_color_manual(values = "darkblue", name = NULL) +
theme_minimal(base_size = 16)
It is also possible to use your gam within geom_smooth directly, but you need to be able to express the model in terms of y and x, where x is the time step. You can get the month by subtracting 1 from the time step, getting this number modulo 12, and adding 1 again, so it is possible to avoid explicitly creating a prediction data frame, at the cost of making the plotting code more complex:
ggplot(carbonD, aes(timeStep, co2)) +
geom_point() +
geom_smooth(formula = y ~ s(x, k = 4, bs = "cs") +
s((x - 1) %% 12 + 1, k = 12, bs = "cc"),
method = "gam", size = 0.7,
method.args = list(family = gaussian(link = "identity")),
aes(color = "gam", fill = "confidence interval")) +
scale_fill_manual(values = "lightblue", name = NULL) +
scale_color_manual(values = "darkblue", name = NULL) +
theme_minimal(base_size = 16)
As a caveat to this, it is not clear to me that you should have both month and timestep, since one is just the modulus of the other. It may be better to use just timestep alone, or use year and month if you want to separate the long-term and seasonal effects.
The simplest way would be to use geom_smooth with LOESS: geom_smooth(method="loess", span=0.5) and play with the span parameter to get a more smooth or wiggly shape.

Plotting a conference interval on a poisson glm (ggplot)

I have a poisson glm and i am trying to plot 95% condidence intervals using ggplot. My issue lies when i use the geom_ribbon() argument. I think my model and CIs are all set up fine, its just the ggplot code that i cannot get to work, if anyone knows what ive done wrong in the geom_ribbon argument this could be great
model, CIs and plot code
#creating the poisson GLM model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#make predictions
model3_preds = predict(model3, type = 'response')
#create predictions for confidence intervals
predictions_model3 = predict(model3, aids, se.fit = TRUE, type = 'response')
#calculate 95% confidence intervals limit
upper_mod3 = predictions_model3$fit+1.96*predictions_model3$se.fit
lower_mod3 = predictions_model3$fit-1.96*predictions_model3$se.fit
#combining our predictions and confidence intervals into a df
predframe_model3 = data.frame(lwr = lower_mod3, upr = upper_mod3, data = aids$date, cases = aids$cases)
#plotting our model with 95% confidence intervals around the mean
ggplot(aids, aes(date, cases)) +
geom_ribbon(data = predframe_model3, aes(ymin = lwr, ymax = upr), fill = 'grey') +
geom_point() +
geom_line(aes(date, model3_preds), col = 'red')
aids data snipet if needed
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
In this case, you can plot directly with ggplot without a prediction data frame, using geom_smooth:
ggplot(aids, aes(date, cases)) +
geom_smooth(method = glm, formula = y ~ x, color = "red",
method.args = list(family = poisson)) +
geom_point()

Explain the code underlying a linear model in R visualised with ggplot

I am trying to understand how linear modelling can be used to as an alternative to the t-test when analysing gene expression data. For a single gene, I have a dataframe of 20 gene expression values altogether in group 1 (n=10) and group 2 (n=10).
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = c(rep(1, 10), rep(2, 10)))
The data can be (box)plotted using ggplot as shown below:
plot <- gexp %>%
ggplot(aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
plot
I wish to model the expression in groups 1 and 2 using the regression formula:
Y = Beta0 + (Beta1 x X) + e where Y is the expression I want to model and X represents the two groups that are encoded as 0 and 1 respectively. Therefore, the expression in group 1 (when x = 0) is equal to Beta0; and the expression in group 2 (when x = 1) is equal to Beta0 + Beta1.
If this is modelled with:
mod1 <- lm(expression ~ group, data = gexp)
mod1
The above code outputs an intercept of 2.75 and a slope of 1.58. It is the visualisation of the linear model that I don't understand. I would be grateful for a clear explanation of the below code:
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
I get why the data.frame values are the ones chosen (the value of 4.33 is the sum of the intercept, Beta0 and the slope, Beta1) , but it is the geom_abline arguments I do not understand. Why is the intercept calculation as shown? In the text I am using it states, '...we need to subtract the slope from the intercept when plotting the linear model because groups 1 and 2 are encoded as 0 and 1 in the model, but plotted as 1 and 2 on the figure.' I don't follow this point and would be grateful for an explanation, without getting too technical.
I believe your code is correct if the group variable was encoded as a factor.
library(ggplot2)
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = factor(c(rep(1, 10), rep(2, 10))))
plot <-
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
mod1 <- lm(expression ~ group, data = gexp)
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
Created on 2022-03-30 by the reprex package (v2.0.1)
To understand the difference between factors and integers in specifying linear models, you can have a look at the model matrix.
model.matrix(y ~ f, data = data.frame(f = 1:3, y = 1))
#> (Intercept) f
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
model.matrix(y ~ f, data = data.frame(f = factor(1:3), y = 1))
#> (Intercept) f2 f3
#> 1 1 0 0
#> 2 1 1 0
#> 3 1 0 1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$f
#> [1] "contr.treatment"
Created on 2022-03-30 by the reprex package (v2.0.1)
In the first model matrix, what you specify is what you get: you're modelling something as a function of the intercept and the f variable. In this model, you account for that f = 2 is twice as much as f = 1.
This works a little bit differently when f is a factor. A k-level factor gets split up in k-1 dummy variables, where each dummy variable encodes with 1 or 0 whether it deviates from the reference level (the first factor level). By modelling it in this way, you don't consider that the 2nd factor level might be twice the 1st factor level.
Because in ggplot2, the first factor level is displayed at position = 1 and not at position = 0 (how it is modelled), your calculated intercept is off. You need to subtract 1 * slope from the calculated intercept to get it to display right in ggplot2.

R loess prediction does not match ggplot geom_smooth(). Error in my prediction formula?

I am trying to predict, on my own, the loess values provided by ggplot geom_smooth(). I have attached links to my data and the output plot of the predictions.
Data can be found here. I followed an example provided from this post about loess prediction to reproduce the values from ggplot, so I think I am on the right track, but I am missing something.
library("ggplot2")
load(file="data5a.RData")
lsmod = loess(Flux~DA_SQ_KM, data=data5a, control=loess.control(surface="direct"))
xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))
pred = predict(lsmod,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(data5a, aes(DA_SQ_KM, Flux)) +
geom_point()+
geom_smooth(method="loess")+
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity",col="red")+
geom_smooth(method="lm",se=FALSE,col="green")+
theme(legend.position = "bottom")+
scale_y_log10()+
scale_x_log10()
Where is the error in my code for reproducing the data in the blue curve that is predicted by geom_smooth()?
Here is an image of the output within ggplot:
UPDATE 1:
I have included updated code here based on input provided by Roland. I have modified my code to use the mgcv::gam function since my data contains greater than 1000 points. The issue still remains that I cannot reproduce the model created by geom_smooth within ggplot. A new issue has also emerged with the confidence intervals.
library("ggplot2")
library("mgcv")
load(file="data5a.RData")
#Attempt to re-create the gam model myself
gammod = mgcv::gam(Flux~s(DA_SQ_KM, bs = "cs"),data=data5a)
xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.001,0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))
pred = predict(gammod ,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
gam.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(data5a, aes(DA_SQ_KM, Flux)) +
geom_point()+
geom_smooth(aes_auto(gam.DF), data=gam.DF, stat="identity",col="red")+
stat_smooth(method=mgcv::gam,formula = y ~ s(x, bs = "cs"),se=TRUE,col="purple")+
theme(legend.position = "bottom")+
scale_y_log10()+
scale_x_log10()
Here is the gam output within ggplot:
ggplot2 fits the model to the transformed variables if you use scale_* transformations:
DF <- data.frame(x = 1:3, y = c(10, 100, 1e3))
library(ggplot2)
p <- ggplot(DF, aes(x, y)) +
geom_point() +
scale_y_log10() +
stat_smooth(method = "lm", n = 3)
g <- ggplot_build(p)
g[["data"]][[2]]
# x y ymin ymax se PANEL group colour fill size linetype weight alpha
#1 1 1 1 1 0 1 -1 #3366FF grey60 1 1 1 0.4
#2 2 2 2 2 0 1 -1 #3366FF grey60 1 1 1 0.4
#3 3 3 3 3 0 1 -1 #3366FF grey60 1 1 1 0.4
Note the zero SEs, which indicate a perfect fit.
log10(predict(lm(y ~ x, data = DF)))
# 1 2 3
#NaN 2.568202 2.937016
predict(lm(log10(y) ~ x, data = DF))
#1 2 3
#1 2 3

Estimating the Poisson distribution

I have a graph and I calculated the degree distribution and degree as follows:
library(igraph) # for these two functions
dd <- degree_distribution(graph)
d <- degree(graph)
From this, I estimated to Power Law, to see if my distribution follows the "Law of Power":
degree = 1:max(d)
probability = dd[-1]
nonzero.position = which(probability != 0)
probability = probability[nonzero.position]
degree = degree[nonzero.position]
reg = lm(log(probability) ~ log(degree))
cozf = coef(reg)
power.law.fit = function(x) exp(cozf[[1]] + cozf[[2]] * log(x))
From that, I plotted the points and power law using ggplot2.
Resulting in the following image:
df <- data.frame(x = degree, y = probability)
print(
ggplot(df, aes(x,y,colour="Distribuição"))+
geom_point(shape = 4) +
stat_function(fun = power.law.fit, geom = "line", aes(colour="Power Law"))+
labs(title = "Grafo", subtitle = "Distribuição dos Graus",
x="K", y="P(k)", colour="Legenda")+
scale_color_brewer(palette="Dark2")
)
As you can see, my distribution does not follow Power Law! I would like to estimate the Poisson distribution and plot on the same graph.
Even though I'm not sure that my distribution does not follow (or follow) Poisson, I would like to draw it together with Power Law. I have no idea how to estimate this distribution (Poisson) from the data, and calculate the average degree.
Can anyone help me?
The graph used to calculate the distribution and the degree is very large (700 thousand vertices), so I did not put the data of the graphs. The explanation of the answer can be based on any graph.
From ?dpois:
The Poisson distribution has density
p(x) = λ^x exp(-λ)/x!
for x = 0, 1, 2, … . The mean and variance are E(X) = Var(X) = λ.
So I'll generate some dummy data with a secret lambda:
mysecret <- ####
x <- data.frame(xes = rpois(50, mysecret))
> x$xes
[1] 0 2 2 1 1 4 1 1 0 2 2 2 1 0 0 1 2 3 2 4 2 1 0 3 2 1 3 1 2 1 5 0 2 3 2 1 0 1 2 3 0 1 2 2 0 3 2 2 2 3
> mean(x$xes)
[1] 1.66
> var(x$xes)
[1] 1.371837
So two good guesses for my secret lambda are 1.66 and 1.37. Let's try them:
library(ggplot2)
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density.., color = "Raw data"),
fill = "white", binwidth = 1, center = 0, size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on E(X)"),
fun.args = list(lambda = 1.66), geom = "line", size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on Var(X)"),
fun.args = list(lambda = 1.37), geom = "line", size = 1.5)
They're both pretty good. You can't really use the built-in stat_function or geom_density for generating these, since Poisson distributions are only defined for integers. The histogram and summary functions work well though, since they're only estimated at the data points themselves, not interpolated.
If you want more detail, you can use the MASS package:
MASS::fitdistr(x$xes, dpois, start = list(lambda = 1))
lambda
1.6601563
(0.1822258)
So let's try constructing from that:
library(dplyr)
df <- data_frame(xes = seq.int(max(x$xes)+1)-1,
dens.m = dpois(xes, 1.66),
dens.u = dpois(xes, 1.66+0.18),
dens.l = dpois(xes, 1.66-0.18))
> df
# A tibble: 6 x 4
xes dens.m dens.u dens.l
<dbl> <dbl> <dbl> <dbl>
1 0 0.19013898 0.15881743 0.22763769
2 1 0.31563071 0.29222406 0.33690378
3 2 0.26197349 0.26884614 0.24930880
4 3 0.14495866 0.16489230 0.12299234
5 4 0.06015785 0.07585046 0.04550717
6 5 0.01997240 0.02791297 0.01347012
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density..), color = "black",
fill = "white", binwidth = 1, center = 0, size = 1.5) +
geom_ribbon(data = df, aes(xes, ymin = dens.l, ymax = dens.u), fill = "grey50", alpha = 0.5) +
geom_line(data = df, aes(xes, dens.m, color = "Based on E(X)\n+/-1 SD of lambda"), size = 1.5)
Based on these two methods and visual interpretation, you should feel comfortable saying λ = 1.66+/-0.18.
For reference, my secret initial value was 1.5.

Resources