Explain the code underlying a linear model in R visualised with ggplot - r

I am trying to understand how linear modelling can be used to as an alternative to the t-test when analysing gene expression data. For a single gene, I have a dataframe of 20 gene expression values altogether in group 1 (n=10) and group 2 (n=10).
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = c(rep(1, 10), rep(2, 10)))
The data can be (box)plotted using ggplot as shown below:
plot <- gexp %>%
ggplot(aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
plot
I wish to model the expression in groups 1 and 2 using the regression formula:
Y = Beta0 + (Beta1 x X) + e where Y is the expression I want to model and X represents the two groups that are encoded as 0 and 1 respectively. Therefore, the expression in group 1 (when x = 0) is equal to Beta0; and the expression in group 2 (when x = 1) is equal to Beta0 + Beta1.
If this is modelled with:
mod1 <- lm(expression ~ group, data = gexp)
mod1
The above code outputs an intercept of 2.75 and a slope of 1.58. It is the visualisation of the linear model that I don't understand. I would be grateful for a clear explanation of the below code:
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
I get why the data.frame values are the ones chosen (the value of 4.33 is the sum of the intercept, Beta0 and the slope, Beta1) , but it is the geom_abline arguments I do not understand. Why is the intercept calculation as shown? In the text I am using it states, '...we need to subtract the slope from the intercept when plotting the linear model because groups 1 and 2 are encoded as 0 and 1 in the model, but plotted as 1 and 2 on the figure.' I don't follow this point and would be grateful for an explanation, without getting too technical.

I believe your code is correct if the group variable was encoded as a factor.
library(ggplot2)
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = factor(c(rep(1, 10), rep(2, 10))))
plot <-
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
mod1 <- lm(expression ~ group, data = gexp)
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
Created on 2022-03-30 by the reprex package (v2.0.1)
To understand the difference between factors and integers in specifying linear models, you can have a look at the model matrix.
model.matrix(y ~ f, data = data.frame(f = 1:3, y = 1))
#> (Intercept) f
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
model.matrix(y ~ f, data = data.frame(f = factor(1:3), y = 1))
#> (Intercept) f2 f3
#> 1 1 0 0
#> 2 1 1 0
#> 3 1 0 1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$f
#> [1] "contr.treatment"
Created on 2022-03-30 by the reprex package (v2.0.1)
In the first model matrix, what you specify is what you get: you're modelling something as a function of the intercept and the f variable. In this model, you account for that f = 2 is twice as much as f = 1.
This works a little bit differently when f is a factor. A k-level factor gets split up in k-1 dummy variables, where each dummy variable encodes with 1 or 0 whether it deviates from the reference level (the first factor level). By modelling it in this way, you don't consider that the 2nd factor level might be twice the 1st factor level.
Because in ggplot2, the first factor level is displayed at position = 1 and not at position = 0 (how it is modelled), your calculated intercept is off. You need to subtract 1 * slope from the calculated intercept to get it to display right in ggplot2.

Related

Calculating odds ratios between deciles of data in R

Similar to: How to calculate and plot odds-ratios and their standard errors from a logistic regression in R?
But I would like to plot the Phenotypes separately in the plot.
Data (subset of 20,000 similar lines):
ID PHENO SCORE
1 1 -0.001
2 1 0.132
3 1 0.023
4 0 -0.20032
5 1 -0.002
6 0 0.012
7 1 -0.23
8 0 0.321
9 0 -0.21
10 0 -0.497
I have then run a glm logistic model on this data
I would like to put the scores into deciles or some meaningful division and then work out the Odds ratio of having the phenotype (1 is having the disease, 0 is controls) per division of score , ideally between cases and control, using R.
To decile I do:
library(dplyr)
#place each value into a decile
data$decile <- ntile(data, 10)
I then follow the question above but wanted the plot to show the cases and controls separately.
I would like to end up with an image like below (with case(1) vs control(0) from the PHENO column:
Any help would be appreciated.
First of all, I generated some random data to make it more reproducible. First, you could make your target and deciles a factor. To extract the odds ratios and confidence intervals, you could use coef and confint with exp. After you can take the mean of each ID and PHENO of your results. To create the graph you can use geom_pointrange like this:
# Generate random data
set.seed(7)
data <- data.frame(ID = rep(c(1:10), 2000),
PHENO = sample(c(0,1), replace=TRUE, size=20000),
SCORE = rnorm(20000, 0, 1))
library(dplyr)
library(ggplot2)
#place each value into a decile
data <- data %>% mutate(decile = ntile(SCORE, 10))
# convert PHENO and decile to factor
data$PHENO <- as.factor(data$PHENO)
data$decile <- as.factor(data$decile)
# model
fit <- glm(PHENO ~ decile, data=data, family='binomial')
# Extract odds ratio with intervals
results <- as.data.frame(exp(cbind(coef(fit), confint(fit))))
#> Waiting for profiling to be done...
# Change columnames results dataframe
colnames(results) <- c('odds_ratio', '2.5', '97.5')
# Add id column
results$ID <- c(1:10)
# Join data and results dataframe
data <- left_join(data, results, by = 'ID')
# Take mean
data_sum <- data %>%
group_by(decile, PHENO) %>%
summarise(odds_ratio = mean(odds_ratio),
`2.5` = mean(`2.5`),
`97.5` = mean(`97.5`))
#> `summarise()` has grouped output by 'decile'. You can override using the
#> `.groups` argument.
# plot
ggplot(data_sum, aes(x = decile, y = odds_ratio, ymin = `2.5`, ymax = `97.5`, color = PHENO, shape = PHENO)) +
geom_pointrange(position = position_dodge(width = 0.4)) +
scale_color_manual(values = c('blue', 'green')) +
scale_shape_manual(values = c(18, 16)) +
guides(shape = 'none') +
theme_classic() +
labs(x = 'Decile', y = 'Odds ratio', color = '')
Created on 2022-10-29 with reprex v2.0.2

Plotting prediction intervals for mixed effects model

I have implemented a mixed effects model for my experiment for how error rate affects reaction time. I now want to calculate prediction intervals and then plot them.
Here is an example of my df
ppid error_rate RT pNum
1 1_1_4 2.865371 0.43339 1
2 1_1_77 11.459301 0.45000 1
3 1_1_80 2.865371 0.38320 1
4 1_2_26 3.820155 0.49990 1
5 1_2_31 2.865371 0.56680 1
6 1_2_32 3.820155 0.58330 1
7 1_2_33 2.865371 0.50000 1
8 1_2_40 3.820155 0.44980 1
9 1_2_43 2.865371 0.56660 1
10 1_2_54 11.459301 0.46670 1
11 1_2_63 2.865371 0.43350 1
12 1_2_64 2.865371 0.46680 1
13 1_2_71 2.865371 0.54990 1
14 1_2_76 2.865371 0.48350 1
15 1_2_85 2.865371 0.53340 1
16 1_2_88 3.820155 0.43340 1
17 1_2_89 3.820155 0.53320 1
18 1_3_0 3.820155 0.45080 1
19 1_3_1 2.865371 0.45022 1
20 1_3_19 2.865371 0.46651 1
I then implement the mixed effects model, generate some prediction intervals for each data point and then combine my original data with the predictions:
library(lme4)
library(merTools)
library(ggplot2)
fit <- lmer(formula = RT ~ error_rate + (1 + error_rate | pNum), data = data)
pred <- cbind(data, predictInterval(fit, data))
I then plot this using ggplot and get the following plot:
ggplot(pred) +
geom_line(aes(x = error_rate, y = fit)) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = .2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = .1) +
ylab("RT")
My plot makes sense to me: I have the black line indicating the predicted values for each error rate, and a shaded area which denotes the intervals. However I'm unsure why I'm getting the straight vertical lines in the middle of each error rate level within my data points? Also my horizontal prediction line seems wonky... does anybody know why this might be, and how to eradicate it? Many thanks!
One way to have a line connecting the error_rate values without the vertical lines, is to plot mean values of the y variable fit. This is done with stat_summary as below.
ggplot(pred, aes(x = error_rate, y = fit)) +
stat_summary(fun.y = mean, geom = "line", show.legend = FALSE) +
geom_ribbon(aes(x = error_rate, ymin = lwr, ymax = upr), alpha = 0.2) +
geom_jitter(aes(x = error_rate, y = RT), alpha = 0.1) +
ylab("RT")
Note: In the question code, the ribbon is plotted with alpha = 0.2 and the points with alpha = 0.1. Would it make more sense to have the points less transparent than the underlying prediction band? And therefore to swap the alpha values?

R loess prediction does not match ggplot geom_smooth(). Error in my prediction formula?

I am trying to predict, on my own, the loess values provided by ggplot geom_smooth(). I have attached links to my data and the output plot of the predictions.
Data can be found here. I followed an example provided from this post about loess prediction to reproduce the values from ggplot, so I think I am on the right track, but I am missing something.
library("ggplot2")
load(file="data5a.RData")
lsmod = loess(Flux~DA_SQ_KM, data=data5a, control=loess.control(surface="direct"))
xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))
pred = predict(lsmod,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(data5a, aes(DA_SQ_KM, Flux)) +
geom_point()+
geom_smooth(method="loess")+
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity",col="red")+
geom_smooth(method="lm",se=FALSE,col="green")+
theme(legend.position = "bottom")+
scale_y_log10()+
scale_x_log10()
Where is the error in my code for reproducing the data in the blue curve that is predicted by geom_smooth()?
Here is an image of the output within ggplot:
UPDATE 1:
I have included updated code here based on input provided by Roland. I have modified my code to use the mgcv::gam function since my data contains greater than 1000 points. The issue still remains that I cannot reproduce the model created by geom_smooth within ggplot. A new issue has also emerged with the confidence intervals.
library("ggplot2")
library("mgcv")
load(file="data5a.RData")
#Attempt to re-create the gam model myself
gammod = mgcv::gam(Flux~s(DA_SQ_KM, bs = "cs"),data=data5a)
xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.001,0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))
pred = predict(gammod ,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
gam.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(data5a, aes(DA_SQ_KM, Flux)) +
geom_point()+
geom_smooth(aes_auto(gam.DF), data=gam.DF, stat="identity",col="red")+
stat_smooth(method=mgcv::gam,formula = y ~ s(x, bs = "cs"),se=TRUE,col="purple")+
theme(legend.position = "bottom")+
scale_y_log10()+
scale_x_log10()
Here is the gam output within ggplot:
ggplot2 fits the model to the transformed variables if you use scale_* transformations:
DF <- data.frame(x = 1:3, y = c(10, 100, 1e3))
library(ggplot2)
p <- ggplot(DF, aes(x, y)) +
geom_point() +
scale_y_log10() +
stat_smooth(method = "lm", n = 3)
g <- ggplot_build(p)
g[["data"]][[2]]
# x y ymin ymax se PANEL group colour fill size linetype weight alpha
#1 1 1 1 1 0 1 -1 #3366FF grey60 1 1 1 0.4
#2 2 2 2 2 0 1 -1 #3366FF grey60 1 1 1 0.4
#3 3 3 3 3 0 1 -1 #3366FF grey60 1 1 1 0.4
Note the zero SEs, which indicate a perfect fit.
log10(predict(lm(y ~ x, data = DF)))
# 1 2 3
#NaN 2.568202 2.937016
predict(lm(log10(y) ~ x, data = DF))
#1 2 3
#1 2 3

How to plot lm slope modeled using poly()?

I need to plot the relationship between x and y where polynomials of x predict y. This is done using the poly() function in order to ensure polynomials are orthogonal.
How do I plot this relationship considering linear, quadratic and cubic terms together ? The issue is the coefficients for the different terms are not scaled as x is.
I provide some example code below. I have tried reassigning the contrast values for each polynomial to x.
This solution gives impossible predicted values.
Thank you in advance for your help !
Best wishes,
Eric
Here is an example code:
x = sample(0:6,100,replace = TRUE)
y = (x*0.2) + (x^2*.05) + (x^3*0.001)
y = y + rnorm(100)
x = poly(x,3)
m = lm(y~x)
TAB = summary(m)$coefficients
### Reassigning the corresponding contrast values to each polynomial of x:
eq = function(x,TAB,start) {
#argument 'start' is used to determine the position of the linear coefficient, quadratic and cubic follow
pols = poly(x,3)
x1=pols[,1]; x2=pols[,2]; x3=pols[,3]
TAB[1,1] + x1[x]*TAB[start,1] + x2[x] * TAB[start+1,1] + x3[x] * TAB[start+2,1]
}
plot(eq(0:7,TAB,2))
Actually, you can use poly directly in formula for lm().
y ~ poly(x, 3) in lm() might be what you want.
For plot, I'll use ggplot2 package which has geom_smooth() function. It can draw the fitted curve. You should specify
method = "lm" argument
and the formula
library(tidyverse)
x <- sample(0:6,100,replace = TRUE)
y <- (x*0.2) + (x^2*.05) + (x^3*0.001)
eps <- rnorm(100)
(df <- data_frame(y = y + eps, x = x))
#> # A tibble: 100 x 2
#> y x
#> <dbl> <int>
#> 1 3.34 4
#> 2 1.23 5
#> 3 1.38 3
#> 4 -0.115 2
#> 5 1.94 5
#> 6 3.87 6
#> 7 -0.707 3
#> 8 0.954 3
#> 9 1.19 3
#> 10 -1.34 0
#> # ... with 90 more rows
Using your simulated data set,
df %>%
ggplot() + # this should be declared at first with the data set
aes(x, y) + # aesthetic
geom_point() + # data points
geom_smooth(method = "lm", formula = y ~ poly(x, 3)) # lm fit
If you want to remove the points: erase geom_point()
df %>%
ggplot() +
aes(x, y) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
transparency solution: control alpha less than 1
df %>%
ggplot() +
aes(x, y) +
geom_point(alpha = .3) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))

Estimating the Poisson distribution

I have a graph and I calculated the degree distribution and degree as follows:
library(igraph) # for these two functions
dd <- degree_distribution(graph)
d <- degree(graph)
From this, I estimated to Power Law, to see if my distribution follows the "Law of Power":
degree = 1:max(d)
probability = dd[-1]
nonzero.position = which(probability != 0)
probability = probability[nonzero.position]
degree = degree[nonzero.position]
reg = lm(log(probability) ~ log(degree))
cozf = coef(reg)
power.law.fit = function(x) exp(cozf[[1]] + cozf[[2]] * log(x))
From that, I plotted the points and power law using ggplot2.
Resulting in the following image:
df <- data.frame(x = degree, y = probability)
print(
ggplot(df, aes(x,y,colour="Distribuição"))+
geom_point(shape = 4) +
stat_function(fun = power.law.fit, geom = "line", aes(colour="Power Law"))+
labs(title = "Grafo", subtitle = "Distribuição dos Graus",
x="K", y="P(k)", colour="Legenda")+
scale_color_brewer(palette="Dark2")
)
As you can see, my distribution does not follow Power Law! I would like to estimate the Poisson distribution and plot on the same graph.
Even though I'm not sure that my distribution does not follow (or follow) Poisson, I would like to draw it together with Power Law. I have no idea how to estimate this distribution (Poisson) from the data, and calculate the average degree.
Can anyone help me?
The graph used to calculate the distribution and the degree is very large (700 thousand vertices), so I did not put the data of the graphs. The explanation of the answer can be based on any graph.
From ?dpois:
The Poisson distribution has density
p(x) = λ^x exp(-λ)/x!
for x = 0, 1, 2, … . The mean and variance are E(X) = Var(X) = λ.
So I'll generate some dummy data with a secret lambda:
mysecret <- ####
x <- data.frame(xes = rpois(50, mysecret))
> x$xes
[1] 0 2 2 1 1 4 1 1 0 2 2 2 1 0 0 1 2 3 2 4 2 1 0 3 2 1 3 1 2 1 5 0 2 3 2 1 0 1 2 3 0 1 2 2 0 3 2 2 2 3
> mean(x$xes)
[1] 1.66
> var(x$xes)
[1] 1.371837
So two good guesses for my secret lambda are 1.66 and 1.37. Let's try them:
library(ggplot2)
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density.., color = "Raw data"),
fill = "white", binwidth = 1, center = 0, size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on E(X)"),
fun.args = list(lambda = 1.66), geom = "line", size = 1.5) +
stat_summary(fun.y = dpois, aes(x = xes, y = xes, color = "Density based on Var(X)"),
fun.args = list(lambda = 1.37), geom = "line", size = 1.5)
They're both pretty good. You can't really use the built-in stat_function or geom_density for generating these, since Poisson distributions are only defined for integers. The histogram and summary functions work well though, since they're only estimated at the data points themselves, not interpolated.
If you want more detail, you can use the MASS package:
MASS::fitdistr(x$xes, dpois, start = list(lambda = 1))
lambda
1.6601563
(0.1822258)
So let's try constructing from that:
library(dplyr)
df <- data_frame(xes = seq.int(max(x$xes)+1)-1,
dens.m = dpois(xes, 1.66),
dens.u = dpois(xes, 1.66+0.18),
dens.l = dpois(xes, 1.66-0.18))
> df
# A tibble: 6 x 4
xes dens.m dens.u dens.l
<dbl> <dbl> <dbl> <dbl>
1 0 0.19013898 0.15881743 0.22763769
2 1 0.31563071 0.29222406 0.33690378
3 2 0.26197349 0.26884614 0.24930880
4 3 0.14495866 0.16489230 0.12299234
5 4 0.06015785 0.07585046 0.04550717
6 5 0.01997240 0.02791297 0.01347012
ggplot(x, aes(xes)) +
geom_histogram(aes(y = ..density..), color = "black",
fill = "white", binwidth = 1, center = 0, size = 1.5) +
geom_ribbon(data = df, aes(xes, ymin = dens.l, ymax = dens.u), fill = "grey50", alpha = 0.5) +
geom_line(data = df, aes(xes, dens.m, color = "Based on E(X)\n+/-1 SD of lambda"), size = 1.5)
Based on these two methods and visual interpretation, you should feel comfortable saying λ = 1.66+/-0.18.
For reference, my secret initial value was 1.5.

Resources