For a very basic demonstration, I'm trying to show that the log transformation linear model is the best one for a given set of data. To demonstrate that I'm looking to compare it to standard lm, square root etc, to show that graphically, the log transform of the linear model fits best as compared to the other 2. The question is, how do create multiple overlapping different lm lines in one plot,? If I could label them that would also be great?
Here is sample true data with starter ggplot
library(tidyverse)
p=runif(100,1,100)
q=6+3*log(p)+rnorm(100)
sample <- data.frame(p,q)
ggplot(data = sample) +
geom_point(mapping = aes(x = p, y = q))
You could compute the lines yourself, e.g. like this:
# Make a tibble containing name of transform and the actual function
transforms <- tibble(Transform = c("log", "sqrt", "linear"),
Function = list(log, sqrt, function(x) x))
# Compute the regression coefs and turn it into a tidy table
lm_df <- transforms %>%
group_by(Transform) %>%
group_modify(~ {
lm(q ~ .x$Function[[1]](p), data = sample) %>%
broom::tidy() %>%
select(term, estimate) %>%
pivot_longer(estimate) %>%
mutate(Function = .x$Function)
})
> lm_df
# A tibble: 6 x 5
# Groups: Transform [3]
Transform term name value Function
<chr> <chr> <chr> <dbl> <list>
1 linear (Intercept) estimate 12.6 <fn>
2 linear .x$Function[[1]](p) estimate 0.0834 <fn>
3 log (Intercept) estimate 5.89 <fn>
4 log .x$Function[[1]](p) estimate 2.99 <fn>
5 sqrt (Intercept) estimate 9.35 <fn>
6 sqrt .x$Function[[1]](p) estimate 1.11 <fn>
# Evaluate the functions at different x values
lm_df <- lm_df %>%
pivot_wider(names_from = term, values_from = value) %>%
rename("Intercept" = `(Intercept)`, "Slope" = `.x$Function[[1]](p)`) %>%
group_modify(~ {
tibble(
y = .x$Intercept + .x$Slope * .x$Function[[1]](seq(0, max(sample$p))),
x = seq(0, max(sample$p))
)
})
> lm_df
# A tibble: 300 x 3
# Groups: Transform [3]
Transform y x
<chr> <dbl> <int>
1 linear 12.6 0
2 linear 12.7 1
3 linear 12.8 2
4 linear 12.9 3
5 linear 12.9 4
6 linear 13.0 5
7 linear 13.1 6
8 linear 13.2 7
9 linear 13.3 8
10 linear 13.4 9
# ... with 290 more rows
# Plot the functions
ggplot() +
geom_point(data = sample, mapping = aes(x = p, y = q)) +
geom_line(data = lm_df, aes(x = x, y = y, color = Transform))
This doesn't handle the labeling (you could use annotate() to add labels manually), but:
gg0 <- ggplot(data = sample, aes(x=p, y=q)) + geom_point()
gg0 + geom_smooth(method="lm", formula=y~x) +
geom_smooth(method="lm", formula=y~log(x), colour="red") +
geom_smooth(method="lm", formula=y~sqrt(x), colour="purple")
Related
I am trying to understand how linear modelling can be used to as an alternative to the t-test when analysing gene expression data. For a single gene, I have a dataframe of 20 gene expression values altogether in group 1 (n=10) and group 2 (n=10).
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = c(rep(1, 10), rep(2, 10)))
The data can be (box)plotted using ggplot as shown below:
plot <- gexp %>%
ggplot(aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
plot
I wish to model the expression in groups 1 and 2 using the regression formula:
Y = Beta0 + (Beta1 x X) + e where Y is the expression I want to model and X represents the two groups that are encoded as 0 and 1 respectively. Therefore, the expression in group 1 (when x = 0) is equal to Beta0; and the expression in group 2 (when x = 1) is equal to Beta0 + Beta1.
If this is modelled with:
mod1 <- lm(expression ~ group, data = gexp)
mod1
The above code outputs an intercept of 2.75 and a slope of 1.58. It is the visualisation of the linear model that I don't understand. I would be grateful for a clear explanation of the below code:
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
I get why the data.frame values are the ones chosen (the value of 4.33 is the sum of the intercept, Beta0 and the slope, Beta1) , but it is the geom_abline arguments I do not understand. Why is the intercept calculation as shown? In the text I am using it states, '...we need to subtract the slope from the intercept when plotting the linear model because groups 1 and 2 are encoded as 0 and 1 in the model, but plotted as 1 and 2 on the figure.' I don't follow this point and would be grateful for an explanation, without getting too technical.
I believe your code is correct if the group variable was encoded as a factor.
library(ggplot2)
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = factor(c(rep(1, 10), rep(2, 10))))
plot <-
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
mod1 <- lm(expression ~ group, data = gexp)
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
Created on 2022-03-30 by the reprex package (v2.0.1)
To understand the difference between factors and integers in specifying linear models, you can have a look at the model matrix.
model.matrix(y ~ f, data = data.frame(f = 1:3, y = 1))
#> (Intercept) f
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
model.matrix(y ~ f, data = data.frame(f = factor(1:3), y = 1))
#> (Intercept) f2 f3
#> 1 1 0 0
#> 2 1 1 0
#> 3 1 0 1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$f
#> [1] "contr.treatment"
Created on 2022-03-30 by the reprex package (v2.0.1)
In the first model matrix, what you specify is what you get: you're modelling something as a function of the intercept and the f variable. In this model, you account for that f = 2 is twice as much as f = 1.
This works a little bit differently when f is a factor. A k-level factor gets split up in k-1 dummy variables, where each dummy variable encodes with 1 or 0 whether it deviates from the reference level (the first factor level). By modelling it in this way, you don't consider that the 2nd factor level might be twice the 1st factor level.
Because in ggplot2, the first factor level is displayed at position = 1 and not at position = 0 (how it is modelled), your calculated intercept is off. You need to subtract 1 * slope from the calculated intercept to get it to display right in ggplot2.
I have two GAMs which have the same predictor variables but different independent variables. I would like to combine the two GAMs to a set of plots where the smooth component (partial residuals) of each predictor variable are in the same panel (differentiated with e.g. color). Reproducible example:
# Required packages
require(mgcv)
require(mgcViz)
# Dataset
data("swiss")
# GAM models
fit1 <- mgcv::gam(Fertility ~ s(Examination) + s(Education), data = swiss)
fit2 <- mgcv::gam(Agriculture ~ s(Examination) + s(Education), data = swiss)
# Converting GAM objects to a gamViz objects
viz_fit1 <- mgcViz::getViz(fit1)
viz_fit2 <- mgcViz::getViz(fit2)
# Make plotGAM objects
trt_fit1 <- plot(viz_fit1, allTerms = T) + l_fitLine()
trt_fit2 <- plot(viz_fit2, allTerms = T) + l_fitLine()
# Print plots
print(trt_fit1, pages = 1)
print(trt_fit2, pages = 1)
Plot of fit1 looks like this:
And fit2 like this:
So I would like to combine the two Examinations into one panel, and the two Educations into another one, showing the independent variable (from different GAMs) with different color/linetype.
You could also do this using my {gratia} 📦 and the compare_smooths() function:
library("gratia")
library("mgcv")
# Dataset
data("swiss")
# GAM models
fit1 <- gam(Fertility ~ s(Examination) + s(Education),
data = swiss, method = "REML")
fit2 <- gam(Agriculture ~ s(Examination) + s(Education),
data = swiss, method = "REML")
# create and object that contains the info to compare smooths
comp <- compare_smooths(fit1, fit2)
# plot
draw(comp)
This produces
The output from compare_smooth() is a nested data frame (tibble)
r$> comp
# A tibble: 4 × 5
model smooth type by data
<chr> <chr> <chr> <chr> <list>
1 fit1 s(Education) TPRS NA <tibble [100 × 3]>
2 fit2 s(Education) TPRS NA <tibble [100 × 3]>
3 fit1 s(Examination) TPRS NA <tibble [100 × 3]>
4 fit2 s(Examination) TPRS NA <tibble [100 × 3]>
So if you want to do customising of the plot etc, you'll need to know how to work with nested data frames or just do
library("tidyr")
unnest(comp, data)
which gets you:
r$> unnest(comp, data)
# A tibble: 400 × 8
model smooth type by est se Education Examination
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 fit1 s(Education) TPRS NA 1.19 3.48 1 NA
2 fit1 s(Education) TPRS NA 1.37 3.20 1.53 NA
3 fit1 s(Education) TPRS NA 1.56 2.94 2.05 NA
4 fit1 s(Education) TPRS NA 1.75 2.70 2.58 NA
5 fit1 s(Education) TPRS NA 1.93 2.49 3.10 NA
6 fit1 s(Education) TPRS NA 2.11 2.29 3.63 NA
7 fit1 s(Education) TPRS NA 2.28 2.11 4.15 NA
8 fit1 s(Education) TPRS NA 2.44 1.95 4.68 NA
9 fit1 s(Education) TPRS NA 2.59 1.82 5.20 NA
10 fit1 s(Education) TPRS NA 2.72 1.71 5.73 NA
# … with 390 more rows
To create your own plots then, we proceed from the unnested data frames and add the confidence interval
ucomp <- unnest(comp, data) %>%
add_confint()
Then plot each panel in turn
library("ggplot2")
library("dplyr")
p_edu <- ucomp |>
filter(smooth == "s(Education)") |> # <-- only one comparison at a time
ggplot(aes(x = Education, y = est)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = model),
alpha = 0.2) +
geom_line(aes(colour = model)) +
scale_fill_brewer(palette = "Set1") + # <-- change fill scale
scale_colour_brewer(palette = "Set1") + # <-- change colour scale
geom_rug(data = swiss, # <-- rug
mapping = aes(x = Education, y = NULL),
sides = "b", alpha = 0.4) +
labs(title = "s(Education)", y = "Estimate",
colour = "Model", fill = "Model")
p_exam <- ucomp |>
filter(smooth == "s(Examination)") |>
ggplot(aes(x = Examination, y = est)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = model),
alpha = 0.2) +
geom_line(aes(colour = model)) +
scale_fill_brewer(palette = "Set1") + # <-- change fill scale
scale_colour_brewer(palette = "Set1") + # <-- change colour scale
geom_rug(data = swiss, # <-- rug
mapping = aes(x = Examination, y = NULL),
sides = "b", alpha = 0.4) +
labs(title = "s(Examination)", y = "Estimate",
colour = "Model", fill = "Model")
Now use the {patchwork} package to put the plots together
library("patchwork")
p_edu + p_exam + plot_layout(guides = "collect")
which produces
This is all using {ggplot2} so you'll need to look at other scales if you want more control over the colours ?scale_fill_manual for example or provide other ready-made discrete scales if you want to use an existing palette.
I could make some of this easier in {gratia} - I could allow users to provide a scale to be used for the colour and fill, and also if they supply the raw data I could draw the rugs too.
If you want them in the same plot, you can pull the data from your fit with trt_fit1[["plots"]][[1]]$data$fit and plot them yourself. I looked at the plot style from the mgcViz github. You can add a second axis or scale as necessary.
library(tidyverse)
exam_dat <-
bind_rows(trt_fit1[["plots"]][[1]]$data$fit %>% mutate(fit = "Fit 1"),
trt_fit2[["plots"]][[1]]$data$fit %>% mutate(fit = "Fit 2"))
ggplot(data = exam_dat, aes(x = x, y = y, colour = fit)) +
geom_line() +
labs(x = "Examination", y = "s(Examination)") +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
To simply get them on the same panel, you could use gridExtra as fit1 and fit2 have a ggplot object.
gridExtra::grid.arrange(
trt_fit1[["plots"]][[2]]$ggObj,
trt_fit2[["plots"]][[2]]$ggObj,
nrow = 1)
Created on 2022-02-18 by the reprex package (v2.0.1)
I'm using ggplot to visualize many linear regressions and facet them by groups. I'd like geom_smooth() to show the trend line as one color if P < 0.05, a different color if P < 0.10, and not show it at all if P ≥ 0.10.
I managed to do this using a loop to extract P-values from lm() for each regression, then join them with the data used for plotting. Then I add another column of color names to pass to aes(), determined conditionally from the P-values, and use scale_color_identity() to achieve my goal.
Here's an example:
library(tidyverse)
#make mtcars a tibble and cyl a factor, for convenience
mtcars1 <- as_tibble(mtcars) %>% dplyr::mutate(cyl = as.factor(cyl))
#initialize a list to store p-values from lm() for each level of factor
p.list <- vector(mode = "list", length = length(levels(mtcars1$cyl)))
names(p.list) <- levels(mtcars1$cyl)
#loop to calculate p-values for each level of mtcars$cyl
for(i in seq_along(levels(mtcars1$cyl))){
mtcars.sub <- mtcars1 %>% dplyr::filter(cyl == levels(.$cyl)[i])
lm.pval <- mtcars.sub %>%
dplyr::distinct(cyl) %>%
dplyr::mutate(P =
summary(lm(mpg ~ disp, data = mtcars.sub))$coefficients[2,4] ##extract P-value
)
p.list[[i]] <- lm.pval
}
#join p-values to dataset and add column to use with scale_color_identity()
mtcars.p <- mtcars1 %>% dplyr::left_join(dplyr::bind_rows(p.list, .id = "cyl"), by = "cyl") %>%
dplyr::mutate(p.color = ifelse(P < 0.05, "black",
ifelse(P < 0.10, "lightblue", NA)))
#plot
ggplot(data = mtcars.p, aes(x = disp, y = mpg)) +
geom_smooth(method = "lm",
se = FALSE,
aes(color = p.color)) +
geom_point() +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
This seems like too many initial steps for something that should be relatively easy. Are these steps necessary, or is there a more efficient way of doing this? Can ggplot or any other packages out there do this on their own, without having to first extract p-values from lm()?
After specifying your regression function, you can include the line function within ggplot:
myline<-lm(mpg ~ disp, data = mtcars)
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_abline(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue')+
geom_point(color='red') +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
The same as above, you can use this geom_smooth() command as well:
geom_smooth(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue',se=F,method='lm')+
We may simplify the steps with a group by operation and also instead of extracting each component, the output can be in a tibble with tidy from broom
library(broom)
library(dplyr)
library(tidyr)
mtcars1 %>%
group_by(cyl) %>%
summarise(out = list(tidy(lm(mpg ~ disp, data = cur_data())))) %>%
unnest(out)
-output
# A tibble: 6 x 6
cyl term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) 40.9 3.59 11.4 0.00000120
2 4 disp -0.135 0.0332 -4.07 0.00278
3 6 (Intercept) 19.1 2.91 6.55 0.00124
4 6 disp 0.00361 0.0156 0.232 0.826
5 8 (Intercept) 22.0 3.35 6.59 0.0000259
6 8 disp -0.0196 0.00932 -2.11 0.0568
I need to plot the relationship between x and y where polynomials of x predict y. This is done using the poly() function in order to ensure polynomials are orthogonal.
How do I plot this relationship considering linear, quadratic and cubic terms together ? The issue is the coefficients for the different terms are not scaled as x is.
I provide some example code below. I have tried reassigning the contrast values for each polynomial to x.
This solution gives impossible predicted values.
Thank you in advance for your help !
Best wishes,
Eric
Here is an example code:
x = sample(0:6,100,replace = TRUE)
y = (x*0.2) + (x^2*.05) + (x^3*0.001)
y = y + rnorm(100)
x = poly(x,3)
m = lm(y~x)
TAB = summary(m)$coefficients
### Reassigning the corresponding contrast values to each polynomial of x:
eq = function(x,TAB,start) {
#argument 'start' is used to determine the position of the linear coefficient, quadratic and cubic follow
pols = poly(x,3)
x1=pols[,1]; x2=pols[,2]; x3=pols[,3]
TAB[1,1] + x1[x]*TAB[start,1] + x2[x] * TAB[start+1,1] + x3[x] * TAB[start+2,1]
}
plot(eq(0:7,TAB,2))
Actually, you can use poly directly in formula for lm().
y ~ poly(x, 3) in lm() might be what you want.
For plot, I'll use ggplot2 package which has geom_smooth() function. It can draw the fitted curve. You should specify
method = "lm" argument
and the formula
library(tidyverse)
x <- sample(0:6,100,replace = TRUE)
y <- (x*0.2) + (x^2*.05) + (x^3*0.001)
eps <- rnorm(100)
(df <- data_frame(y = y + eps, x = x))
#> # A tibble: 100 x 2
#> y x
#> <dbl> <int>
#> 1 3.34 4
#> 2 1.23 5
#> 3 1.38 3
#> 4 -0.115 2
#> 5 1.94 5
#> 6 3.87 6
#> 7 -0.707 3
#> 8 0.954 3
#> 9 1.19 3
#> 10 -1.34 0
#> # ... with 90 more rows
Using your simulated data set,
df %>%
ggplot() + # this should be declared at first with the data set
aes(x, y) + # aesthetic
geom_point() + # data points
geom_smooth(method = "lm", formula = y ~ poly(x, 3)) # lm fit
If you want to remove the points: erase geom_point()
df %>%
ggplot() +
aes(x, y) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
transparency solution: control alpha less than 1
df %>%
ggplot() +
aes(x, y) +
geom_point(alpha = .3) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.
Example:
library(plyr)
library(ggplot2)
set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
val = c(1:10, 100, 101))
df
# fact val
# 1 b 1
# 2 a 2
# 3 a 3
# 4 a 4
# 5 b 5
# 6 a 6
# 7 b 7
# 8 b 8
# 9 b 9
# 10 a 10
# 11 b 100
# 12 a 101
by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))
by.fact.df
# fact 0% 25% 50% 75% 100%
# 1 a 2 3.25 5.0 9.00 101
# 2 b 1 5.50 7.5 8.75 100
# What I can do...with faults (a) and (b) above
ggplot(by.fact.df,
aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity")
# What I want...
ggplot(df, aes(x = fact, y = val)) +
geom_boxplot()
What I can do...with faults (a) and (b) mentioned above:
What I would like to obtain, but still leverage pre-computation via plyr (or other method):
Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?
Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.
UPDATE
I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:
df %>%
group_by(fact) %>%
do(ldply(boxplot.stats(.$val), data.frame))
Source: local data frame [6 x 3]
Groups: fact
fact .id X..i..
1 a stats 2
2 a stats 4
3 a stats 10
4 a stats 13
5 a stats 16
6 a n 9
Here's my answer, using built-in functions quantile and boxplot.stats.
geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below
#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
labels <-names(quantile(x))
#replacing the upper whisker to geom_boxplot
quantiles[5] <-boxplot.stats(x)$stats[5]
res <-data.frame(rbind(quantiles))
names(res) <-labels
res$out <-boxplot.stats(x)$out
return(res)
}
Code to calculate the stats and plot it
library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>%
ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity") + geom_point()
To get the correct statistics, you have to do some more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not draw the outliers. So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately. The following function (basically a simplified version of stat_boxplot) is probably not the most efficient, but it gives the desired result:
box.df <- df %>% group_by(fact) %>% do({
stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
}
outlier_values = .$val[outliers]
if (length(outlier_values) == 0) outlier_values <- NA_real_
res <- as.list(t(stats))
names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
res$out <- outlier_values
as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
##
## fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1 a 2 3.25 5.0 9.00 10 101
## 2 b 1 5.50 7.5 8.75 9 100
ggplot(box.df, aes(x = fact, y = out, middle = median,
ymin = lower.whisker, ymax = upper.whisker,
lower = lower.hinge, upper = upper.hinge)) +
geom_boxplot(stat = "identity") +
geom_point()