I am creating very basic plots to visualize trends in a dataset. I'm using the gam smoother with geom_smooth, but this is over-fitting the data at several sites. For example, in the very first facet, the smoother over-fits the red data points.
Is there a way to adjust my code to set a max # of inflection points to focus on the overarching trends in the dataset (e.g., max. 2 or 3 inflection points).
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs"),
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
EDIT (2022-12-19): following some of the comments raised by #GavinSimpson, I've tried to make some visuals of how changing k changes the shape of geom_smooth curve.
For these plots, I've tried to create an example dataset that has a rough quadratic section followed by a roughly linear section. Ideally, I want to find a k value that would capture the inflection point at the 'peak' of the quadratic section and the transition from quadratic to linear (x ~= 5 & 10).
library(ggpubr)
library(ggplot)
df <- data.frame(
x1 = c(1:19),
y1 = c(1.1, 4, 6.9, 8.5, 10, 7.2, 3.7, 2.2, 1, #quadratic
0.4, 1.6, 2.5, 3.7, 4.6, 5.8, 6.7, 7.9, 8.8, 10)) #linear
pk <- ggplot(data = df,
aes(y = y1, x = x1)) +
geom_point(col = "red") +
theme_bw()
# vary k from 2 to not specified
ggarrange(nrow = 3, ncol = 2,
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 2)) + ggtitle("k=2"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 3)) + ggtitle("k=3"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 4)) + ggtitle("k=4"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 5)) + ggtitle("k=5"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 6)) + ggtitle("k=6"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) + ggtitle("k unconstrained"))
The first plot throws an error message:
Warning message:
In smooth.construct.cr.smooth.spec(object, data, knots) :
basis dimension, k, increased to minimum possible
Indicating that R is increasing the k from 2 to 3.
Only when k = 5 do we start to see the smoother begin to capture the two main inflection points in the data (x ~= 5 & 10). But what does k = 5 mean? It also seems that it's added another slight inflection point around x ~= 16.
At k = 6, the SE is reduced and the line comes even closer to each data point, and it looks nearly identical to the figure when k is not specified at all, which might be ok in this situation, but for my dataset, when k isn't specified there is a lot of over-fitting when I really just want to see the most important inflection points.
So still no clear answer on how to specify inflection points.
Thanks to the suggestion from #tpetzoldt, I was able to identify how to set the max. number of inflection points in the gam formula.
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs", k = 2), #k = number of inflection points
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
Related
I am trying to fit association-dissociation SPR kinetics data for a protein and small molecule for two concentrations using ggplot2. The data is here.
The time variable indicates the time in seconds, the sample variable indicates the two concentrations (32nM and 8nM), and the values variable is the readout.
I have imported the data and running the following code to plot:
# LINE PLOT
ggplot(data) +
geom_point(aes(x = time, y = values), size = 1, color = "black") +
geom_smooth(aes(x = time, y = values, color = sample), method = "loess", se = F) +
scale_x_continuous(expand = c(0, 0), limits = c(0, NA)) +
#scale_y_continuous(expand = c(0, 0), limits = c(0, 60)) +
scale_color_npg(breaks = c("2nM", "4nM", "8nM", "16nM", "32nM")) +
theme_linedraw() +
labs(x = "Time (seconds)",
y = "Response Units") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Here is the plot:
As you can see that the fit did not work using method = "loess". I need something like this(there are 5 concentrations here):
The fitting requires 1:1 Langmuir model but I am not sure how I can do that in ggplot. Can someone please help me?
Here is the equation:
This is from the pbm package that fits this kind of plots.
Your data are smooth enough that you need only use geom_line, not geom_smooth:
df %>%
ggplot(aes(time, values, color = sample)) +
geom_line(size = 2, na.rm = TRUE) +
geom_point(color = 'black', size = 1) +
theme_linedraw(base_size = 16) +
xlim(c(0, 400))
Edit
It is possible to fit the results to the data using non-linear least squares, employing the binding1to1 function from pbm, but it requires a bit of method tweaking to get the model to fit. It would probably be better to create a model then plot the predictions rather than using geom_smooth. However, if you really wanted to, you could do:
df %>%
ggplot(aes(time, values, color = sample)) +
geom_smooth(method = nls, se = FALSE, n = 1000,
formula = y ~ binding1to1(x, 123, 32e-9, kon, koff, rmax),
method.args = list(
start = list(kon = 2000, koff = 0.02, rmax = 2e4),
control = nls.control(minFactor = 1e-6, maxiter = 1000)
),
data = df[df$time > 0 & df$sample == "32nM",]) +
geom_smooth(method = nls, se = FALSE, n = 1000,
formula = y ~ binding1to1(x, 123, 8e-9, kon, koff, rmax),
method.args = list(
start = list(kon = 3000, koff = 0.02, rmax = 2e4),
control = nls.control(minFactor = 1e-9, maxiter = 10000)
),
data = df[df$time > 0 & df$sample == "8nM",]) +
geom_point(color = 'black', size = 1) +
theme_linedraw(base_size = 16) +
xlim(c(0, 400))
If you want to actually fit a model from which to extract the parameters and plot, you can do:
library(tidyverse)
library(pbm)
df <- read.csv("SPR.csv") %>%
filter(time >= 0) %>%
mutate(sample = as.numeric(gsub("\\D+", "", sample)) * 1e-9,
values = values * 1e-3) %>%
group_by(sample) %>%
mutate(tmax = time[which.max(values)])
fit_fun <- function(time, tmax, sample, kon, koff, rmax) {
unlist(Map(function(time, tmax, sample) {
binding1to1(time, tmax, sample, kon, koff, rmax)
}, time, tmax, sample))
}
mod <- nls(values ~ fit_fun(time, tmax, sample, kon, koff, rmax),
data = df,
start = list(kon = 3000, koff = 0.02, rmax = 2),
control = nls.control(minFactor = 1e-9, maxiter = 10000))
This gives us a model with the best fitting values for the various parameters:
mod
#> Nonlinear regression model
#> model: values ~ fit_fun(time, tmax, sample, kon, koff, rmax)
#> data: df
#> kon koff rmax
#> 8.925e+05 2.521e-03 5.445e-02
#> residual sum-of-squares: 5.219e-05
#>
#> Number of iterations to convergence: 536
#> Achieved convergence tolerance: 5.155e-07
We can then predict the output of the model over the range of our input variables:
pred_df <- expand.grid(time = 0:400, sample = c(8, 32) * 1e-9,
tmax = df$tmax[1])
pred_df$values <- predict(mod, pred_df)
And we can plot it like this:
df %>%
ggplot(aes(time, values, color = factor(sample))) +
geom_line(data = pred_df, size = 1) +
geom_point(color = 'black', size = 1) +
theme_linedraw(base_size = 16) +
xlim(c(0, 400))
This is a follow up question to Combine ggflags with linear regression in ggplot2
I have a plot like below with a log-linear model for x and y for certain countries that I have made in R with ggplot2 and ggflags:
The problem is when I want to print out the regression equation, the R2 and the p-value with the help of stat_regline_equation and stat_cor, I get values for a linear model and not the log-linear model I want to use.
How can I solve this?
library(ggplot2)
library(ggflags)
library(ggpubr)
library(SciViews)
set.seed(123)
Data <- data.frame(
country = c("at", "be", "dk", "fr", "it"),
x = runif(5),
y = runif(5)
)
ggplot(Data, aes(x = x, y = y, country = country, size = 11)) +
geom_flag() +
scale_country() +
scale_size(range = c(10, 10)) +
geom_smooth(aes(group = 1), method = "lm", , formula = y ~ log(x), se = FALSE, size = 1) +
stat_regline_equation(label.y = 0.695,
aes(group = 1, label = ..eq.label..), size = 5.5) +
stat_cor(aes(group = 1,
label =paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.y = 0.685, size = 5.5, digits= 1)
edit: I have also tried to use ln(x) instead of log(x) but I do not get any results when printing out the coefficient from that either.
There are four things you need to do:
Provide your regression formula to the formula argument of stat_regline_equation
Use sub to change "x" to "log(x)" in eq.label
Change the x aesthetic of stat_cor to log(x)
Fix the x limits inside coord_cartesian to compensate
ggplot(Data, aes(x = x, y = y, country = country, size = 11)) +
geom_flag() +
scale_country() +
scale_size(range = c(10, 10)) +
geom_smooth(aes(group = 1), method = "lm", , formula = y ~ log(x),
se = FALSE, size = 1) +
stat_regline_equation(label.y = 0.695, label.x = 0.25,
aes(group = 1, label = sub("x", "log(x)", ..eq.label..)),
size = 5.5,
formula = y ~ log(x),
check_overlap = TRUE, output.type = "latex") +
stat_cor(aes(group = 1, x = log(x),
label =paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.x = 0.25,
label.y = 0.65, size = 5.5, digits= 1, check_overlap = TRUE) +
coord_cartesian(xlim = c(0.2, 1))
With the following ggplot2 code in R:
require(ggplot2)
df <- data.frame(x = rep(c(0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 6.5, 7.0),4), y = c(0.15,0.17,0.07,0.17,0.01,0.15,0.18,0.04,-0.06,-0.08,0,0.03,-0.27,-0.93,0.04,0.12,0.08,0.15,0.04,0.15,0.03,0.09,0.11,0.13,-0.11,-0.32,-0.7,-0.78,0.07,0.04,0.06,0.12,-0.15,0.05,-0.08,0.14,-0.02,-0.14,-0.24,-0.32,-0.78,-0.81,-0.04,-0.25,-0.09,0.02,-0.13,-0.2,-0.04,0,0.02,-0.05,-0.19,-0.37,-0.57,-0.81))
ggplot(df, aes(x = x, y = y)) + geom_point() +
stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE, aes(color = "black")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), size = 1, se = FALSE, aes(color = "green")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, se = FALSE, aes(color = "orange")) +
stat_smooth(method = "gam", formula = y ~ s(x), size = 1, se = FALSE, aes(color = "blue")) +
theme(legend.justification=c(1,1),legend.position=c(0.45,0.45),legend.title=element_blank()) +
scale_color_manual(values=c("black","green","orange","blue"), labels=c("linear","quadratic","cubic","smooth"))
the legend is fine; but for some reason, three of the four curves are not colored as intended: the orange curve should be green, the blue curve should be orange, and the green curve should be blue. What am I missing?
The strings you use in the aesthetic call are arbitrary (even though you have called them after colours). They will be converted into a factor column internally in ggplot, and the levels of the factor are determined alphabetically. The factor levels are mapped across to the vector of values in the scale_color_manual call in the order they are put.
So you can just use "a" through "d" as arbitrary strings for the color aesthetic to keep track of them and control their ordering:
ggplot(df, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ x,
size = 1, se = FALSE, aes(color = "a")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2),
size = 1, se = FALSE, aes(color = "b")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 3),
size = 1, se = FALSE, aes(color = "c")) +
stat_smooth(method = "gam", formula = y ~ s(x),
size = 1, se = FALSE, aes(color = "d")) +
theme(legend.justification = c(1, 1),
legend.position = c(0.45, 0.45),
legend.title = element_blank()) +
scale_color_manual(values = c("black", "green", "orange", "blue"),
labels = c("linear", "quadratic", "cubic", "smooth"))
From the online help (my emphasis):
values
a set of aesthetic values to map data values to. The values will be matched in order (usually alphabetical) with the limits of the scale, or with breaks if provided. If this is a named vector, then the values will be matched based on the names instead. Data values that don't match will be given na.value.
So try
values=c("linear"="black","quadratic"="green","cubic"="orange","smooth"="blue")
or something similar. I can't check my code as you haven't provided your input data.
I am working on experimental data, where I applied different stressors over time (Day of experiment) to seastars. I measured the weight at different time points as well as their righting time (activity). I used GAM to model the data over time. When plotting the model and the mean values for each time point, for the weight, the model is underestimating the actual measured mean weight. And for the righting time the model shows a sinusoidal pattern, but the data don't support such pattern. It would be amazing if someone could explain to me why that is and if there is a possibility to fit the model better to my data. Thank you so much for your help.
My data look like this for weight:
Day.in.experiment Treatment Individual Weight.sea.star..g.
numeric factor factor numeric
And like this for the righting time:
Day.in.experiment Treatment Individual Time..sec.
numeric factor factor numeric
This is my code for the weight model:
model_weight <- bam (Weight.sea.star..g. ~ Treatment +
s (Day.in.experiment, k = 3) +
s (Day.in.experiment, by = Treatment, k = 3) +
s (Individual, bs = "re"),
family = gaussian, data = data_subset)
r1 <- start_value_rho (model_weight, plot = TRUE, lag = 2)
model_weight_autocorrelation <- bam (Weight.sea.star..g. ~ Treatment +
s (Day.in.experiment, k = 3) +
s (Day.in.experiment, by = Treatment, k = 3) +
s (Individual, bs = "re"),
family = gaussian, data = data_subset,
rho = r1, AR.start = data_subset$start.event)
And this is the code for the activity model:
model_righting <- bam (Time..sec. ~ Treatment +
s (Day.in.experiment, k = 7) +
s (Day.in.experiment, by = Treatment, k = 7) +
s (Individual, bs = "re") +
s (Individual, Day.in.experiment, bs = "re"),
family = gaussian, data = data_subset)
r1 <- start_value_rho (model_righting, plot = TRUE, lag = 2)
model_righting_autocorrelation <- bam (Time..sec. ~ Treatment +
s (Day.in.experiment, k = 7) +
s (Day.in.experiment, by = Treatment, k = 7) +
s (Individual, bs = "re")+
s (Individual, Day.in.experiment, bs = "re"),
family = gaussian, data = data_subset,
rho = r1, AR.start = data_subset$start.event)
In order to plot the model, I extracted the fitted data first and then plotted it with ggplot (). I provide the code for weight here, but I proceeded the very same way for righting time. To plot the mean values of the actual data I used another file called data_weight_mean.
data_model_weight_autocorrelation <- visreg (model_weight_autocorrelation, "Day.in.experiment", by = "Treatment",
breaks = c ("NoHW", "THW", "TEHW"),
gg = TRUE, overlay = TRUE, jitter = TRUE, lwd = 0.5,
rug = FALSE, partial = FALSE, plot = FALSE)
ggplot () +
scale_fill_manual (breaks = c ("NoHW", "THW", "TEHW"),
values = c ("cadetblue3", "aquamarine3","goldenrod3"),
labels = c ("No Heatwave", "Ambient Heatwave", "Extended Heatwave")) +
scale_color_manual (breaks = c ("NoHW", "THW", "TEHW"),
values = c ("cadetblue3", "aquamarine3","goldenrod3"),
labels = c ("No Heatwave", "Ambient Heatwave", "Extended Heatwave")) +
geom_ribbon (data = data_model_weight_autocorrelation$fit, aes (x = Day.in.experiment,
ymin = visregLwr, ymax = visregUpr,
fill = Treatment), alpha = 0.4, colour = "grey", size = 0.2) +
geom_line (data = data_model_weight_autocorrelation$fit, aes (x = Day.in.experiment, y = visregFit,
colour = Treatment),
size = 1.1) +
geom_point (data = data_weight_mean, aes (x = Day.in.experiment, y = Weight.sea.star..g., colour = Treatment),
size = 1) +
scale_y_continuous (limits = c (-2, 12),
breaks = c (0, 4, 8, 12),
labels = c (0, 4, 8, 12)) +
labs (x = "\nDay of Experiment", y = expression (atop ("Weight [g]")),
colour = "Treatment", fill = "Treatment") +
theme_bw () +
theme (panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_blank (),
panel.grid.minor.y = element_blank (),
legend.position = c (0.15, 0.85),
legend.direction = "vertical",
legend.title = element_text (colour = "black", size = 18),
legend.text = element_text (colour = "black", size = 16),
axis.line = element_line (),
axis.title = element_text (colour = "black", size = 24),
axis.text = element_text (colour = "black", size = 18))
While completing a project for understanding central limit theorem for exponential distribution, I ran into an annoying error message when plotting simulated vs theoretical distributions. When I run the code below, I get an error: 'mapping' is not used by stat_function().
By mapping I assume the error is referring to the aes parameter, which I later map to color red using scale_color_manual in order to show it in a legend.
My question is two-fold: why is this error happening? and is there a more efficient way to create a legend without using scale_color_manual?
Thank you!
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
It's not an error, it's a warning:
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
#> Warning: `mapping` is not used by stat_function()
Created on 2020-05-01 by the reprex package (v0.3.0)
You can suppress the warning by calling geom_line(stat = "function") rather than stat_function():
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
geom_line(stat = "function", fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
Created on 2020-05-01 by the reprex package (v0.3.0)
In my opinion, the warning is erroneous, and an issue has been filed about this problem: https://github.com/tidyverse/ggplot2/issues/3611
However, it's not that easy to solve, and therefore as of now the warning is there.
I'm unable to recreate your issue -- when I run your code a plot is generated (below), which suggests the issue is likely to do you with your environment. A general 'solution' is to clear your workspace using the menu dropdown or similar: Session -> Clear workspace..., then re-run your code.
For refactoring the color issue, you can simplify scale_color_manual to
scale_color_manual("Legend", values = c('blue','red')), but how it is now, is a bit better in my view. Anything beyond that has more to do with changing the data structure and mapping.
Apologies, I don't have the rep to make a comment.