ggplot exponential smooth with tuning parameter inside exp - r

ggplot provides various "smoothing methods" or "formulas" that determine the form of the trend line. However it is unclear to me how the parameters of the formula are specified and how I can get the exponential formula to fit my data. In other words how to tell ggplot that it should fit the parameter inside the exp.
df <- data.frame(x = c(65,53,41,32,28,26,23,19))
df$y <- c(4,3,2,8,12,8,20,15)
x y
1 65 4
2 53 3
3 41 2
4 32 8
5 28 12
6 26 8
7 23 20
8 19 15
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(x)) +
geom_point()
p
Problematic fit:
However if the parameter inside the exponential is fit then the form of the trend line becomes reasonable:
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(-0.09 * x)) +
geom_point()
p

Here is an approach with method nls instead of glm.
You can pass additional parameters to nls with a list supplied in method.args =. Here we define starting values for the a and r coefficients to be fit from.
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 10, r = -0.01)),
color = "black") +
geom_point()
As discussed in the comments, the best way to get the coefficients on the graph is by fitting the model outside the ggplot call.
model.coeff <- coef(nls( y ~ a * exp(r * x), data = df, start = c(a = 50, r = -0.04)))
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 50, r = -0.04)),
color = "black") +
geom_point() +
geom_text(x = 40, y = 15,
label = as.expression(substitute(italic(y) == a %.% italic(e)^(r %.% x),
list(a = format(unname(model.coeff["a"]),digits = 3),
r = format(unname(model.coeff["r"]),digits = 3)))),
parse = TRUE)

Firstly, to pass additional parameters to the function passed to the method param of geom_smooth, you can pass a list of named parameters to method.args.
Secondly, the problem you're seeing is that glm is placing the coefficient in front of the whole term: y ~ coef * exp(x) instead of inside: y ~ exp(coef * x) like you want. You could use optimization to solve the latter outside of glm, but you can fit it into the GLM paradigm by a transformation: a log link. This works because it's like taking the equation you want to fit, y = exp(coef * x), and taking the log of both sides, so you're now fitting log(y) = coef * x, which is equivalent to what you want to fit and works with the GLM paradigm. (This ignores the intercept. It also ends up in transformed link units, but it's easy enough to convert back if you like.)
You can run this outside of ggplot to see what the models look like:
df <- data.frame(
x = c(65,53,41,32,28,26,23,19),
y <- c(4,3,2,8,12,8,20,15)
)
bad_model <- glm(y ~ exp(x), family = gaussian(link = 'identity'), data = df)
good_model <- glm(y ~ x, family = gaussian(link = 'log'), data = df)
# this is bad
summary(bad_model)
#>
#> Call:
#> glm(formula = y ~ exp(x), family = gaussian(link = "identity"),
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.7143 -2.9643 -0.8571 3.0357 10.2857
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.714e+00 2.437e+00 3.986 0.00723 **
#> exp(x) -3.372e-28 4.067e-28 -0.829 0.43881
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 41.57135)
#>
#> Null deviance: 278.00 on 7 degrees of freedom
#> Residual deviance: 249.43 on 6 degrees of freedom
#> AIC: 56.221
#>
#> Number of Fisher Scoring iterations: 2
# this is better
summary(good_model)
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "log"), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.745 -2.600 0.046 1.812 6.080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.93579 0.51361 7.663 0.000258 ***
#> x -0.05663 0.02054 -2.757 0.032997 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 12.6906)
#>
#> Null deviance: 278.000 on 7 degrees of freedom
#> Residual deviance: 76.143 on 6 degrees of freedom
#> AIC: 46.728
#>
#> Number of Fisher Scoring iterations: 6
From here, you can reproduce what geom_smooth is going to do: make a sequence of x values across the domain and use the predictions as the y values for the line:
# new data is a sequence across the domain of the model
new_df <- data.frame(x = seq(min(df$x), max(df$x), length = 501))
# `type = 'response'` because we want values for y back in y units
new_df$bad_pred <- predict(bad_model, newdata = new_df, type = 'response')
new_df$good_pred <- predict(good_model, newdata = new_df, type = 'response')
library(tidyr)
library(ggplot2)
new_df %>%
# reshape to long form for ggplot
gather(model, y, contains('pred')) %>%
ggplot(aes(x, y)) +
geom_line(aes(color = model)) +
# plot original points on top
geom_point(data = df)
Of course, it's a lot easier to let ggplot handle all that for you:
ggplot(df, aes(x, y)) +
geom_smooth(
method = 'glm',
formula = y ~ x,
method.args = list(family = gaussian(link = 'log'))
) +
geom_point()

Related

R Different Prediction Result for Formula Containing "%>%"

In R, when I use "predict" to get a confidence interval for a certain x (x=42) under the model: y = (centered x) + (centered x)^2. I found two possible ways:
model1 = lm(y ~ scale(x, center=T, scale=F) + I( (scale(x, center=T, scale=F))^2 ), data=data)
model2 = lm(y ~ (x %>% scale(center=T, scale=F)) + I( (x %>% scale(center=T, scale=F))^2 ), data=data)
The summary results for the two models are the same. But when I ran:
predict(model1, data.frame(x=42), interval="confidence", level=0.95)
predict(model2, data.frame(x=42), interval="confidence", level=0.95)
The results are different. I am wondering why. Does R treat the above two formulas differently because of the usage of "%>%"?
The dataset is a practice dataset from Kutner's textbook SENIC.txt, y is the 11th column, x is the 12th column.
The issue here is with scale and not %>%. scale returns a matrix which seems to affect the outcome.
One solution is to write vector --> vector equivalent of scale and use it:
library("magrittr") # for `%>%`
set.seed(1)
dataset = data.frame(x = rnorm(30))
dataset[["y"]] = 1 + (3 * dataset[["x"]]) + rnorm(30, mean = 0, sd = 0.1)
scale_vector = function(x, ...){
stopifnot(inherits(x, "numeric"))
scale(x, ...)[, 1]
}
lm(y ~ scale_vector(x, center=T, scale=F) + I( (scale_vector(x, center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ scale_vector(x, center = T, scale = F) + I((scale_vector(x,
#> center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> scale_vector(x, center = T, scale = F)
#> 2.99296
#> I((scale_vector(x, center = T, scale = F))^2)
#> -0.01645
lm(y ~ (x %>% scale_vector(center=T, scale=F)) + I( (x %>% scale_vector(center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ (x %>% scale_vector(center = T, scale = F)) +
#> I((x %>% scale_vector(center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> x %>% scale_vector(center = T, scale = F)
#> 2.99296
#> I((x %>% scale_vector(center = T, scale = F))^2)
#> -0.01645
Besides, if you do not mind using tidyverse, this might be cleaner:
library("magrittr") # for `%>%`
set.seed(1)
dataset = tibble::tibble(x = rnorm(30),
y = 1 + (3 * x) + rnorm(30, mean = 0, sd = 0.1)
)
dataset %>%
dplyr::mutate(x_scaled = scale(x, center = TRUE, scale = FALSE)[, 1]) %>%
lm(y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Call:
#> lm(formula = y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Coefficients:
#> (Intercept) x_scaled I(x_scaled^2)
#> 1.27423 2.99296 -0.01645

Cannot plot p-value on simple logistic regression

I am trying to plot a simple logistic regression in R.
I am following this tutorial to conduct the logistic regression and calculate a P-value (https://mgimond.github.io/Stats-in-R/Logistic.html). I am trying to use ggplot2 and ggpmisc to plot the regression. I have been trying to use this guide (http://cran.nexr.com/web/packages/ggpmisc/vignettes/user-guide-1.html#stat_fit_glance) to stat_fit_glance to display a p-value
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('input.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_glance(method = "glm", method.args = list(formula = formula), geom = "text", aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))
ggsave("output.pdf")
The output however comes out as
> source("C:/Users/Deven/Desktop/logistic/script.R")
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_glance()`:
object of type 'closure' is not subsettable
I have also tried stat_cor from ggpubr, but that seem to be generating different p-values and R^2 values from what I have calculated.
UPDATE BASED ON COMMENTS:
+ stat_poly_eq(formula = y ~ x, method="glm", aes(x = ancestry, y = variable, label = paste(..p.value.label..,sep = "~~~~")),parse = TRUE) fails due to
1: Computation failed in `stat_poly_eq()`:
Method 'glm' not yet implemented.
If I remove method it defaults to a linear regresssion (and gives p values that do not correspond to a logistic regression).
SECOND UPDATE
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x), mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",after_stat(x_estimate),after_stat(x_p.value))))
ggsave("variable.pdf")
yields the following error:
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_tidy()`:
no applicable method for 'tidy' applied to an object of class "c('glm', 'lm')"
YET ANOTHER UPDATE
library(ggplot2)
library(ggpmisc)
da =read.table('data.txt', header=TRUE)
model = glm(variable ~ ancestry,family=binomial,data=da)
summary(model)
ggplot(da,aes(x = ancestry,y = variable)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
ggsave("test.pdf")
works in theory, but the p-value it gives me is very different from the p value that I calculated manually (which corresponds to the one from lrm(variable ~ ancestry, dataset))...
Not sure at all what is going on here...
There is a table on ggpmisc help page that specifies what can be applied to each type of models.
You have a glm, so glance() from tidy will not give you a p-value. Using an example:
library(ggplot2)
library(ggpmisc)
da = MASS::Pima.tr
da$label = as.numeric(da$type=="Yes")
model = glm(label ~ bmi,family=binomial,data=da)
summary(model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.11156 0.92806 -4.430 9.41e-06 ***
bmi 0.10482 0.02738 3.829 0.000129 ***
You can see glance will not give you a p-value :
broom::glance(model)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 256. 199 -120. 244. 251. 240. 198 200
You need to use tidy() and as #JonSpring mentioned in the comment, provide the formula, so something like this:
ggplot(da,aes(x = bmi,y = label)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
Thank you for all the help, but unfortunately nothing automated worked, so I came up with this instead
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('data.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
M1 <- glm(variable ~ ancestry, dataset, family = binomial)
M1
M1$null.deviance
M1$deviance
modelChi <- M1$null.deviance - M1$deviance
pseudo.R2 <- modelChi / M1$null.deviance
pseudo.R2
test <-lrm(variable ~ ancestry, dataset)
Chidf <- M1$df.null - M1$df.residual
chisq.prob <- 1 - pchisq(modelChi, Chidf)
chisq.prob
#plot logistic regression curve
all_variable <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + annotate("text", x=-Inf, y=Inf, hjust = 0, vjust = 2.5, label=paste("p-value: ",signif(chisq.prob, digits = 3),"\nR2: ",signif(pseudo.R2, digits = 3),sep="") )+
ggtitle("Title not relevant to Stack Overflow")
ggsave("variable.pdf")

How to plot lm slope modeled using poly()?

I need to plot the relationship between x and y where polynomials of x predict y. This is done using the poly() function in order to ensure polynomials are orthogonal.
How do I plot this relationship considering linear, quadratic and cubic terms together ? The issue is the coefficients for the different terms are not scaled as x is.
I provide some example code below. I have tried reassigning the contrast values for each polynomial to x.
This solution gives impossible predicted values.
Thank you in advance for your help !
Best wishes,
Eric
Here is an example code:
x = sample(0:6,100,replace = TRUE)
y = (x*0.2) + (x^2*.05) + (x^3*0.001)
y = y + rnorm(100)
x = poly(x,3)
m = lm(y~x)
TAB = summary(m)$coefficients
### Reassigning the corresponding contrast values to each polynomial of x:
eq = function(x,TAB,start) {
#argument 'start' is used to determine the position of the linear coefficient, quadratic and cubic follow
pols = poly(x,3)
x1=pols[,1]; x2=pols[,2]; x3=pols[,3]
TAB[1,1] + x1[x]*TAB[start,1] + x2[x] * TAB[start+1,1] + x3[x] * TAB[start+2,1]
}
plot(eq(0:7,TAB,2))
Actually, you can use poly directly in formula for lm().
y ~ poly(x, 3) in lm() might be what you want.
For plot, I'll use ggplot2 package which has geom_smooth() function. It can draw the fitted curve. You should specify
method = "lm" argument
and the formula
library(tidyverse)
x <- sample(0:6,100,replace = TRUE)
y <- (x*0.2) + (x^2*.05) + (x^3*0.001)
eps <- rnorm(100)
(df <- data_frame(y = y + eps, x = x))
#> # A tibble: 100 x 2
#> y x
#> <dbl> <int>
#> 1 3.34 4
#> 2 1.23 5
#> 3 1.38 3
#> 4 -0.115 2
#> 5 1.94 5
#> 6 3.87 6
#> 7 -0.707 3
#> 8 0.954 3
#> 9 1.19 3
#> 10 -1.34 0
#> # ... with 90 more rows
Using your simulated data set,
df %>%
ggplot() + # this should be declared at first with the data set
aes(x, y) + # aesthetic
geom_point() + # data points
geom_smooth(method = "lm", formula = y ~ poly(x, 3)) # lm fit
If you want to remove the points: erase geom_point()
df %>%
ggplot() +
aes(x, y) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))
transparency solution: control alpha less than 1
df %>%
ggplot() +
aes(x, y) +
geom_point(alpha = .3) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3))

Drawing the glm decision boundary with ggplot's stat_smooth() function returns wrong line

I want to plot the decision boundary after I fit a logistic regression model to my data. I use ggplot and stat_smooth() function to define the decision boundary line. However the plot returned is wrong. For a reproducible example, see below:
#-----------------------------------------------------------------------------------------------------
# CONSTRUCT THE DATA
#-----------------------------------------------------------------------------------------------------
X.1_Y.1 <- rnorm(1000, mean = 1.5, sd= 0.3)
X.2_Y.1 <- rnorm(1000, mean = 1.5, sd= 5)
X.1_Y.0 <- rnorm(99000, mean = 0, sd = 1)
X.2_Y.0 <- rnorm(99000, mean = 0, sd = 1)
data <- data.table(X.1 = c(X.1_Y.1 , X.1_Y.0),
X.2 = c(X.2_Y.1 , X.2_Y.0),
Y = c(rep(1, 1000) , rep(0, 99000 ))
)
#-----------------------------------------------------------------------------------------------------
# FIT A LOGISTIC MODEL ON THE DATA
#-----------------------------------------------------------------------------------------------------
model <- glm(Y ~ X.1 + X.2, data, family = "binomial")
summary(model)
#Call:
# glm(formula = Y ~ ., family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.6603 -0.1194 -0.0679 -0.0384 4.6263
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.04055 0.06636 -91.02 <2e-16 ***
# X.1 1.60828 0.03854 41.73 <2e-16 ***
# X.2 0.43272 0.01673 25.87 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 11200.3 on 99999 degrees of freedom
#Residual deviance: 8218.5 on 99997 degrees of freedom
#AIC: 8224.5
#-------------------------------------------------------------------------------------------------------
# DEFINE AND DRAW THE DECISION BOUNDARY
#-------------------------------------------------------------------------------------------------------
# 0 = -6.04 + 1.61 * X.1 + 0.44 * X2 => X2 = 6.04/0.44 - 1.61/0.44 * X.1
setDT(data)
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_smooth(formula = x.2 ~ 6.04/0.44 - (1.61/0.44) * X.1, color = "blue", size = 2) +
coord_equal() +
theme_economist()
This returns the following plot:
You can easily see that the line drawn is wrong. According to the formula X.2 should be 6.04/0.44 when X.1 = 0 which clearly is not the case in this plot.
Could you tell me where my code errs and how to correct it?
Your advice will be appreciated.
If you are trying to plot a line on your graph that you fit yourself, you should not be using stat_smooth, you should be using stat_function. For example
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_function(fun=function(x) {6.04/0.44 - (1.61/0.44) * x}, color = "blue", size = 2) +
coord_equal()

How to add a logarithmic nonlinear fit to ggplot?

I'd like to fit the logarithmic curve through my data using nls.
library(dplyr)
library(ggplot2)
a <- 3
b <- 2
Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
y = a + b*log(x))
Y %>%
ggplot(aes(x = x, y = y)) +
geom_point(shape = 19, size = 2) +
geom_smooth(method = "nls",
formula = y ~ p1 + p2*log(x),
start = list(a = a, b = b),
se = FALSE,
control = list(maxiter = 100))
This gives me an error:
Error in method(formula, data = data, weights = weight, ...) :
number of iterations exceeded maximum of 100
What is going wrong?
Here's some text I copied and pasted after doing ?nls:
Warning
Do not use nls on artificial "zero-residual" data.
The nls function uses a relative-offset convergence criterion that compares the numerical imprecision at the current parameter estimates to the residual sum-of-squares. This performs well on data of the form
y = f(x, θ) + eps
(with var(eps) > 0). It fails to indicate convergence on data of the form
y = f(x, θ)
because the criterion amounts to comparing two components of the round-off error. If you wish to test nls on artificial data please add a noise component, as shown in the example below.
That inspired me to try this:
> library(dplyr)
> library(ggplot2)
> a <- 3
> b <- 2
> Y <- data_frame(x = c(0.2, 0.5, 1, 2, 5, 10),
+ y = a + b*log(x)*(1 + rnorm(length(x), sd=0.001)))
> Y %>%
+ ggplot(aes(x = x, y = y)) +
+ geom_point(shape = 19, size = 2) +
+ geom_smooth(method = "nls",
+ formula = y ~ p1 + p2*log(x),
+ start = list(p1 = a, p2 = b),
+ se = FALSE,
+ control = list(maxiter = 100))
Note: your code had start = list(a=a, b=b) which is a typo because a and b aren't defined in your formula. Aside from that, adding the *(1 + rnorm(length(x), sd=0.001)) is the only thing I did.
The resulting graph made it seem like everything worked fine.
I'd generally recommend doing the fit separately, however, and then plotting it with predict. That way you can always check the quality of the fit to see if it worked before plotting.
> fit <- nls(data=Y, formula = y ~ p1 + p2*log(x), start = list(p1 = a, p2 = b))
> summary(fit)
Formula: y ~ p1 + p2 * log(x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 3.001926 0.001538 1952 4.14e-13 ***
p2 1.999604 0.001114 1795 5.78e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003619 on 4 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.623e-08
> new_x = data.frame(x=seq(from=0.2, to=10, length.out=100))
> ggplot(data=Y, aes(x=x, y=y)) +
geom_point() +
geom_line(data=new_x,
aes(x=new_x, y=predict(fit, newdata=new_x)),
color='blue')

Resources