How to add a logarithmic nonlinear fit to ggplot? - r

I'd like to fit the logarithmic curve through my data using nls.
library(dplyr)
library(ggplot2)
a <- 3
b <- 2
Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
y = a + b*log(x))
Y %>%
ggplot(aes(x = x, y = y)) +
geom_point(shape = 19, size = 2) +
geom_smooth(method = "nls",
formula = y ~ p1 + p2*log(x),
start = list(a = a, b = b),
se = FALSE,
control = list(maxiter = 100))
This gives me an error:
Error in method(formula, data = data, weights = weight, ...) :
number of iterations exceeded maximum of 100
What is going wrong?

Here's some text I copied and pasted after doing ?nls:
Warning
Do not use nls on artificial "zero-residual" data.
The nls function uses a relative-offset convergence criterion that compares the numerical imprecision at the current parameter estimates to the residual sum-of-squares. This performs well on data of the form
y = f(x, θ) + eps
(with var(eps) > 0). It fails to indicate convergence on data of the form
y = f(x, θ)
because the criterion amounts to comparing two components of the round-off error. If you wish to test nls on artificial data please add a noise component, as shown in the example below.
That inspired me to try this:
> library(dplyr)
> library(ggplot2)
> a <- 3
> b <- 2
> Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
+ y = a + b*log(x)*(1 + rnorm(length(x), sd=0.001)))
> Y %>%
+ ggplot(aes(x = x, y = y)) +
+ geom_point(shape = 19, size = 2) +
+ geom_smooth(method = "nls",
+ formula = y ~ p1 + p2*log(x),
+ start = list(p1 = a, p2 = b),
+ se = FALSE,
+ control = list(maxiter = 100))
Note: your code had start = list(a=a, b=b) which is a typo because a and b aren't defined in your formula. Aside from that, adding the *(1 + rnorm(length(x), sd=0.001)) is the only thing I did.
The resulting graph made it seem like everything worked fine.
I'd generally recommend doing the fit separately, however, and then plotting it with predict. That way you can always check the quality of the fit to see if it worked before plotting.
> fit <- nls(data=Y, formula = y ~ p1 + p2*log(x), start = list(p1 = a, p2 = b))
> summary(fit)
Formula: y ~ p1 + p2 * log(x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 3.001926 0.001538 1952 4.14e-13 ***
p2 1.999604 0.001114 1795 5.78e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003619 on 4 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.623e-08
> new_x = data.frame(x=seq(from=0.2, to=10, length.out=100))
> ggplot(data=Y, aes(x=x, y=y)) +
geom_point() +
geom_line(data=new_x,
aes(x=new_x, y=predict(fit, newdata=new_x)),
color='blue')

Related

How do I overlay a set of functions onto a ggplot2 graph when the function relies on the parameters of the inherited datafame?

Effectively what I want to do is more efficiently generate a graph like this
using reproducible data that I am providing.
I have plotted experimental data points of {[Substrate concentration], ([Substrate Concentration] / velocity) grouped by [Inhibitor Concentration]. Each of the lines is a continuous function that accepts the parameters Substrate_Conc and Inhibitor_Conc.
This is somewhat difficult for me to have to plot a function that relies on two inherited parameters.
For instance it is easy to plot a function that relies on one inherited parameter x, but x cannot be specified as an argument to my knowledge.
fx <- function(x){return(1 / x)}
ggplot() + theme_bw() + xlim(c(1, 100)) + stat_function(fun = fx, col = "black") + labs(x = "x", y = "y")
This simply plots the graph of 1 / x, but I cannot specify x as a parameter
In my case I need to specify x = Substrate_Conc and also another parameter Inhibitor_Conc which I have grouped by.
My code:
library(dplyr) # For %>% operator
library(ggplot2) # for plots
library(scales) # for pretty_breaks()
library(stats) # For nls() function
# Construct a dataframe of experimental Inhibitor Concentration with corresponding substrate concentrations
# And with corresponding velocities
df <- data.frame(Inhibitor_Conc = rep(c(0,6,12,24), each = 5),
Substrate_Conc = rep(c(2,5,10,20,50), times = 4),
velocity = c(0.0552, 0.1128, 0.1476, 0.1799, 0.2261,0.0242, 0.0690, 0.0774, 0.1905,0.1861,
0.0231, 0.0420,0.0979, 0.1329,0.1722,0.0138,0.0393,0.0855,0.1042,0.1562))
# Compute S_over_V for the Woolf_Hanes Plot
df <- df %>%
mutate(S_over_V = Substrate_Conc/velocity)
# Fit a function to describe this dataset. The Competitive Model is the best model to describe this enzyme kinetics dataset.
# But I will get around to fitting the other kinetic models such as Uncompetitive for comparison.
Competitive_Fit <- nls(data = df, formula = velocity ~ (V_Max * Substrate_Conc) / (K_M * (1 + Inhibitor_Conc/K_I) + Substrate_Conc),
start = list(K_M = median(df$Substrate_Conc), K_I = median(df$Substrate_Conc), V_Max = max(df$velocity)),
control = nls.control(maxiter = 100))
# Print a summary of the fit: It is pretty good.
print(summary(Competitive_Fit))
# Extract the parameters. First entry is K_M, following by K_I, followed by V_Max
Parameters <- coefficients(Competitive_Fit)
Competitive_Woolf_H_Function <- function(S, K_M, K_I, V_Max, I)
{
# Where S is the substrate concentration, and I is the inhibitor concentration
predicted_S_over_V = ((K_M / V_Max) + ((1 / V_Max)*S) + (K_M / (V_Max * K_I)) * I)
return(predicted_S_over_V)
}
# First construct a pretty Woolf Hanes Plot without any fitting:
# Construct a Woolf Hanes Plot simply a plot a {(Substrate Concentration) / (Velocity)} vs Substrate Concentration
Woolf_Hanes_Plot <- ggplot(data = df, aes(x = Substrate_Conc, y = S_over_V)) + theme_bw() +
geom_point(aes(group = factor(Inhibitor_Conc), col = factor(Inhibitor_Conc), shape = factor(Inhibitor_Conc))) +
geom_vline(xintercept = 0, col = "black") + geom_hline(yintercept = 0, col = "black") +
labs(x = "Substrate Concentration (mM)", y = "[Substrate] / Velocity", title = "Woolf Hanes Plot of Enzymatic Inhibtion") +
scale_x_continuous(breaks = scales::pretty_breaks(n=20), limits = c(-50,50)) +
scale_y_continuous(breaks = scales::pretty_breaks(n=20)) +
scale_color_manual(values = c("red", "green", "blue", "purple"), name = "Inhibitor Concentration:") +
scale_shape_manual(values = c(15,16,17,3), name = "Inhibitor Concentration:")
# Add the fitted lines to the plot.
# Very redundant calls to the Competitive_Woolf_H_Function
# With the only difference being the Inhibitor Concentration is different
Woolf_Hanes_Plot_Fitted <- Woolf_Hanes_Plot + stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 0),
col = "red") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 6),
col = "green") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 12),
col = "blue") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 24),
col = "violet")
print(Woolf_Hanes_Plot_Fitted)
Effectively what I have done is I have created a dataframe of Inhibitor_Conc, Substrate_Conc, and Velocity. I fit the model of Competitive Inhibition to this Enzyme Kinetics Data. I used nonlinear regression to extract the unknown parameters K_M, K_I, and V_Max that describe my dataset.
I then used those calculated parameters to create a function which returns values for the Woolf_Hanes Version of the competitive Inhibition function. The hallmark of this function is that whatever the parameters K_M, K_I, V_Max are the lines as function of Substrate Concentration will always be parallel.
The most inefficient part of my code is when I call stat_function 4 times, and manually specify I = 0 and then I = 6, I = 12, and I = 24. That's not very efficient calling stat_function 4 times with each of those inhibitor concentrations especially when the inhibitor concentrations are already in my plotted dataframe. I did this, though to demonstrate my desired output. How can I do this passing the Substrate_Conc and the Inhibitor_Conc into the function I call in stat_function()? Thank you!
The output of the nonlinear model yields the parameters:
Formula: velocity ~ (V_Max * Substrate_Conc)/(K_M * (1 + Inhibitor_Conc/K_I) +
Substrate_Conc)
Parameters:
Estimate Std. Error t value Pr(>|t|)
K_M 6.81741 1.40083 4.867 0.000145 ***
K_I 7.64747 1.92212 3.979 0.000971 ***
V_Max 0.24724 0.01475 16.763 5.24e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01489 on 17 degrees of freedom
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.661e-06

Using ggplot2 in R creating multiple smoothed/fitted lines

I am having trouble producing a figure in R using ggplots. No stats are needed - I just need the visual representation of my data. I have 7 participants, and I want to plot a line for each participant through a scatterplot. The slope and shape of the line is different for each participant, however on average is somewhat exponential.
I have used the below code in R, however I am only getting liner models. When changing the method to loess, the lines are too wriggly. Can someone please help me make this more presentable? Essentially I'm after a line of best fit for each participant, yet still need to be able to use the function fullrange = FALSE.
Furthermore, should I be using stat_smooth or geom_smooth? Is there a difference.
ggplot(data, aes(x=x, y=y, group = athlete)) +
geom_point() +
stat_smooth(method = "lm", se=FALSE, fullrange = FALSE)
Thanks in advance for any help!
I don't have your data, so I'll just do this with the mpg dataset.
As you've noted, you can use geom_smooth() and specify a method such as "loess". Know that you can pass on arguments to the methods as you would if you were using the function behind it.
With loess, the smoothing parameter is span. You can play around with this until you're happy with the results.
data(mpg)
g <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point()
g + geom_smooth(se = F, method = 'loess', span = .8) + ggtitle("span 0.8")
g + geom_smooth(se = F, method = 'loess', span = 1) + ggtitle("span 1")
There is, to my knowledge, no built-in method for achieving this, but you can do it with some manual plotting. First, since you expect an exponential relationship, it might make sense to run a linear regression using log(y) as the response (I'll be using u and v, in order not to confuse them with the x and y aesthetics in the graph):
tb1 = tibble(
u = rep(runif(100, 0, 5), 3),
a = c(rep(-.5, 100), rep(-1, 100), rep(-2, 100)),
v = exp(a*u + rnorm(3*100, 0, .1))
) %>% mutate(a = as.factor(a))
lm1 = lm(log(v) ~ a:u, tb1)
summary(lm1)
gives you:
Call:
lm(formula = log(v) ~ a:u, data = tb1)
Residuals:
Min 1Q Median 3Q Max
-0.263057 -0.069510 -0.001262 0.062407 0.301033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.013696 0.012234 -1.12 0.264
a-2:u -1.996670 0.004979 -401.04 <2e-16 ***
a-1:u -1.001412 0.004979 -201.14 <2e-16 ***
a-0.5:u -0.495636 0.004979 -99.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1002 on 296 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9983
F-statistic: 6.025e+04 on 3 and 296 DF, p-value: < 2.2e-16
Under "Coefficients" you can find the intercept and the "slopes" for the curves (actually the exponential factors). You can see that they closely match the factors we used for generating the data.
To plot the fitting curves, you can use the "predicted" values, produced from your linear model using predict:
ggplot(tb1, aes(u, v, colour=a)) +
geom_point() +
geom_line(data=tb1 %>% mutate(v = exp(predict(lm1))))
If you want to have the standard error ribbons, it's a little more work, but still possible:
p1 = predict(lm1, se.fit=T)
tb2 = tibble(
u = tb1$u,
a = tb1$a,
v = exp(p1$fit),
vmin = exp(p1$fit - 1.96*p1$se.fit),
vmax = exp(p1$fit + 1.96*p1$se.fit)
)
ggplot(tb2, aes(u, v, colour=a)) +
geom_ribbon(aes(fill=a, ymin=vmin, ymax=vmax), colour=NA, alpha=.25) +
geom_line(size=.5) +
geom_point(data=tb1)
produces:

How to display standardized y-scores in interaction plot

I am trying to plot a two-way interaction of standardized data in R using the package "interplot". However, the displayed y-scores are not standardized anymore. Why is that and how can I fix that?
I have tried to change the y-limits and to use the "scale_y_continuous()" function.
# generate data
x <- rnorm(100, 0, 1)
y <- x + rnorm(100, 0, 1)
z <- y + rnorm(100, 0, 1)
df <- as.data.frame(cbind(x,y,z))
# build model with interaction term
model1 <- glm(y ~ x*z, data=df)
# plot interaction
require(interplot)
interplot(model1, var1 = "x",var2 = "z", ci = 0.95, predPro = TRUE,
var2_vals = c(-1, 1), hist=F) + xlim(-3, 3) +
theme_classic()
I expect the y-scale to display values between -3 and +3, since the scores are standardized. However, the displayed y-values are between 20 and 80.
With the help of ?interplot example :
set.seed(123)
# generate data
x <- rnorm(100, 0, 1)
y <- x + rnorm(100, 0, 1)
z <- y + rnorm(100, 0, 1)
df <- as.data.frame(cbind(x,y,z))
# build model with interaction term
model1 <- glm(y ~ x*z, data=df)
# lm(y ~ x*z, data=df) # glm => is a linear model
# plot interaction
require(interplot, quietly = TRUE, warn.conflicts = FALSE)
interplot(model1, var1 = "x",var2 = "z", ci = 0.95,
predPro = TRUE, var2_vals = c(-1,1)) +
xlim(-3, 3) +
xlab("x values") +
ylab("Estimated Coefficient for z") +
ggtitle('Estimated Coefficient of z by x conditionally to y in c(-1,1)') +
theme_classic()
interplot(model1, var1 = "x",var2 = "z", ci = 0.95) +
xlim(-3, 3) +
xlab("x values") +
ylab("Estimated Coefficient for z") +
ggtitle('Estimated Coefficient of z by x') +
theme_classic()
#> Warning: Removed 28 rows containing missing values (geom_path).

How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case
I am calibrating a machine with a couple of standards (a, b, c) with
accepted values x. My machine measures y for these standards, with a
certain error (standard deviation of 1 in this example).
Then I measure replicates of a sample, yielding ynew. Now I want to
convert these values to the accepted measurement scale (the x-axis).
To do this, I can of course do some linear algebra and convert the slope and
intercept that I got from my standard measurements to a reversed slope and
intercept as follows
This works nicely to convert the input values, but how do I get proper estimates of the errors?
In R, I've tried the following:
library(broom) # for tidy lm
library(ggplot2) # for plotting
library(dplyr) # to allow piping
# find confidence value
cv <- function(x, level = 95) {
qt(1 - ((100 - level) / 100) / 2, df = length(x) - 1) * sd(x) / sqrt(length(x))
}
# find confidence interval
ci <- function(x, level = 95) {
xbar <- mean(x)
xci <- cv(x, level = level)
c(fit = xbar, lwr = xbar - xci, upr = xbar + xci)
}
set.seed(1337)
# create fake data
dat <- data.frame(id = rep(letters[1:3], 20),
x = rep(c(1, 7, 10), 20)) %>%
mutate(y = rnorm(n(), -20 + 1.25 * x, 1))
# generate linear model
mod <- lm(y ~ x, dat)
# tidy
mod_aug <- augment(mod)
# these are the new samples that my machine measures
ynew <- rnorm(10, max(dat$y) + 3)
# predict new x-value based on y-value that is outside of range
## predict(mod, newdata = data.frame(y = ynew), interval = "predict")
# Error in eval(predvars, data, env) : object 'x' not found
# or tidy
## augment(mod, newdata = data.frame(y = ynew))
# 50 row df that doesn't make sense
# found this function that should do the job, but it doesn't extrapolate
## approx(x = mod$fitted.values, y = dat$x, xout = ynew)$y
# [1] NA NA NA NA NA NA NA NA NA NA
# this one from Hmisc does allow for extrapolation
with_approx <- Hmisc::approxExtrap(x = mod_aug$.fitted, y = mod_aug$x, xout = ynew)$y
# but in case of lm, isn't using the slope and intercept of a model okay too?
with_itc_slp <- (- coef(mod)[1] / coef(mod)[2]) + (1 / coef(mod)[2] * ynew)
# this would be the 95% prediction interval of the model at the average
# sample position. Could also use "confidence" but this is more correct?
avg_prediction <- predict(mod,
newdata = data.frame(x = mean(with_itc_slp)),
interval = "prediction")
# plot it
ggplot(dat, aes(x = x, y = y, col = id)) +
geom_point() +
geom_hline(yintercept = ynew, col = "gray") +
geom_smooth(aes(group = 1), method = "lm", se = F, fullrange = T,
col = "lightblue") +
geom_smooth(aes(group = 1), method = "lm") +
# 95% CI of the new sample
annotate("pointrange", x = 1, y = mean(ynew),
ymin = ci(ynew)[2], ymax = ci(ynew)[3], col = "green") +
# 95% prediction interval of the linear model at the average transformed
# x-position
annotate("pointrange", x = mean(with_approx), y = mean(ynew),
ymin = avg_prediction[2], ymax = avg_prediction[3], col = "green") +
# transformed using approx
annotate("point", x = with_approx, y = ynew, size = 3, col = "blue",
shape = 1) +
# transformed using intercept and slope
annotate("point", x = with_itc_slp, y = ynew, size = 3, col = "red",
shape = 2) +
# it's pretty
coord_fixed()
resulting in this plot:
Now how do I go from these 95% CIs in the y-direction to transformed sample
x-values with a confidence interval in the x-direction?

2 polynomial regressions in a ggplot() graph

This is my Dataset:
As you can see, there are two quantitative variables (X, Y) and 1 categorical variable (molar, with two factors: M1, M2).
I would like to represent in one single graph two polynomial regressions and their respective prediction intervals: one for the M1 factor and one for the M2 factor. Each polynomial regression has its own degree (M1 is a 4 degree polynomial regression, and M2 is a 6 degree).
I want to use ggplot() function (which is in package ggplot2 in R). I have actually performed this figure but with all data merged (I mean, with no distinction between factors). This is the code I used:
# Fit a linear model
m <- lm(Y ~ X+I(X^2)+I(X^3)+I(X^4), data = Dataset)
# cbind the predictions to Dataset
mpi <- cbind(Dataset, predict(m, interval = "prediction"))
ggplot(mpi, aes(x = X)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y = Y)) +
geom_line(aes(y = fit), colour = "blue", size = 1)
With this result:
So, I would like to have two different-grade polynomial regressions (one for the M1 and one for the M2), taking into account their respective predictions intervals. Which would be the exact code?
UPDATE - New code! I run this code with no success:
M1=subset(Dataset,Dataset$molar=="M1",select=X:Y)
M2=subset(Dataset,Dataset$molar=="M2",select=X:Y)
M1.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M1",select=X:Y))
M2.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M2",select=X:Y))
newdf <- data.frame(x = seq(0, 1, c(408,663)))
M1.P <- cbind(data=subset(Dataset,Dataset$molar=="M1",select=X:Y), predict(M1.R, interval = "prediction"))
M2.P <- cbind(data=subset(Dataset,Dataset$molar=="M2",select=X:Y), predict(M2.R, interval = "prediction"))
p = cbind(as.data.frame(rbind(M1.P, M2.P)), f = factor(rep(1:2, c(408,663)), x = rep(newdf$x, 2))
mdf = with(Dataset, data.frame(x = rep(x, 2), y = c(subset(Dataset,Dataset$molar=="M1",select=Y), subset(Dataset,Dataset$molar=="M2",select=Y),
f = factor(rep(1:2, c(408,663))))
ggplot(mdf, aes(x = x, y = y, colour = f)) + geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
These are the messages I get now:
[98] WARNING: Warning in if (n < 0L) stop("wrong sign in 'by' argument") :
the condition has length > 1 and only the first element will be used
Warning in if (n > .Machine$integer.max) stop("'by' argument is much too small") :
the condition has length > 1 and only the first element will be used
Warning in 0L:n :
numerical expression has 2 elements: only the first used
Warning in if (by > 0) pmin(x, to) else pmax(x, to) :
the condition has length > 1 and only the first element will be used
[99] WARNING: Warning in predict.lm(M1.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[100] WARNING: Warning in predict.lm(M2.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[101] ERROR: <text>
I think I am closer but still can't see it. Help!
Here is one way. If you have more than two models/levels in the factor you should look into code that will work over the levels of the factor and fit the models that way.
Anyway, first some dummy data:
set.seed(100)
x <- runif(100)
y1 <- 2 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) + rnorm(100)
y2 <- -1 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) +
(-0.3 * x^5) + (2.4 * x^6) + rnorm(100)
df <- data.frame(x, y1, y2)
Fit our two models:
m1 <- lm(y1 ~ poly(x, 4), data = df)
m2 <- lm(y2 ~ poly(x, 6), data = df)
Now precict at some new locations x and stick it together with x and f, a factor indexing the model, into a tidy format:
newdf <- data.frame(x = seq(0, 1, length = 100))
p1 <- predict(m1, newdata = newdf, interval = "prediction")
p2 <- predict(m2, newdata = newdf, interval = "prediction")
p <- cbind(as.data.frame(rbind(p1, p2)), f = factor(rep(1:2, each = 100)),
x = rep(newdf$x, 2))
Melt the original data into tidy form
mdf <- with(df, data.frame(x = rep(x, 2), y = c(y1, y2),
f = factor(rep(1:2, each = 100))))
Draw the plot, using colour to distinguish the models/data
ggplot(mdf, aes(x = x, y = y, colour = f)) +
geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
This gets us

Resources