constrained multiple linear regression in R - r

Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?

You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")

Related

Calculating odds ratio from glm output

It is my first time doing logistic regressions and I am currently trying to teach myself how to find the odds ratio. I got my coefficients from r as shown below.
(Intercept) totalmins
0.2239254 1.2424020
To exponentiate the regression coefficient I did the following:
exp1.242/exp1.242+1 = 0.77
Really not sure if this is the correct process or not.
Any advice on how I would go about calculating odds ratio would be greatly appreciated
detection- 1/0 data if animal was detected at site
total mins- time animal spent at site
here's the output
glm(formula = detection ~ totalmins, family = binomial(link = "logit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.81040 -0.63571 0.00972 0.37355 1.16771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.49644 0.81818 -1.829 0.0674 .
totalmins 0.21705 0.08565 2.534 0.0113
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 41.194 on 33 degrees of freedom
Residual deviance: 21.831 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 25.831
Number of Fisher Scoring iterations: 8
This model evaluates the log odds of detecting an animal at the site based on the time in minutes that the animal spent on the site. The model output indicates:
log odds(animal detected | time on site) = -1.49644 + 0.21705 * minutes animal on site
To convert to odds ratios, we exponentiate the coefficients:
odds(animal detected) = exp(-1.49644) * exp(0.21705 * minutes animal on site)
Therefore, the odds and probability of detection if the animal spends 0 minutes on site is e(-1.49644) or 0.2239. The odds ratio of detection if an animal is on site for X minutes is calculated as follows. We'll model odds ratios for minutes 0 through 10, and calculate the associated probability of detection.
# odds of detection if animal on site for X minutes
coef_df <- data.frame(intercept=rep(-1.49644,11),
slopeMinutes=rep(0.21705,11),
minutesOnSite=0:10)
coef_df$minuteValue <- coef_df$minutesOnSite * coef_df$slopeMinutes
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$minuteValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
...and the output:
> coef_df[,c(3:4,6:8)]
minutesOnSite intercept_exp slope_exp odds probability
1 0 0.2239 1.000 0.2239 0.1830
2 1 0.2239 1.242 0.2782 0.2177
3 2 0.2239 1.544 0.3456 0.2569
4 3 0.2239 1.918 0.4294 0.3004
5 4 0.2239 2.383 0.5335 0.3479
6 5 0.2239 2.960 0.6629 0.3986
7 6 0.2239 3.678 0.8235 0.4516
8 7 0.2239 4.569 1.0232 0.5057
9 8 0.2239 5.677 1.2712 0.5597
10 9 0.2239 7.053 1.5793 0.6123
11 10 0.2239 8.763 1.9622 0.6624
>
See also How to get probability from GLM output for another example using space shuttle autolander data from the MASS package.

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

How can I compare regression coefficients across three (or more) groups using R?

Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)

Regression on constant

I need to run a regression on a constant. In Eviews, I don't need to put any thing as a predictor when I run a regression on constant.I don't know how to do that in R. Does any one knows what should I write in this commnd?
fit= lm(r~?)
You can specify a constant as 1 in a formula:
r <- 1:5
fit <- lm(r ~ 1)
summary(fit)
# Call:
# lm(formula = r ~ 1)
#
# Residuals:
# 1 2 3 4 5
# -2.00e+00 -1.00e+00 2.22e-16 1.00e+00 2.00e+00
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.0000 0.7071 4.243 0.0132 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.581 on 4 degrees of freedom
Note that you don't need lm to get this result:
mean(r)
#[1] 3
sd(r)/sqrt(length(r))
#[1] 0.7071068
However, you might want to use lm in order to have a Null model against which you can compare other models ...
Edit:
Since you comment that you need "the p-value", I suggest to use a t-test instead.
t.test(r)
# One Sample t-test
#
#data: r
#t = 4.2426, df = 4, p-value = 0.01324
#alternative hypothesis: true mean is not equal to 0
#95 percent confidence interval:
# 1.036757 4.963243
#sample estimates:
#mean of x
# 3
This is equivalent, but more efficient computationally.

Polynomial regression line through origin with equation in calibration curve

I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.
Here's my data:
ecoli_ug_ml A420 rpt
1 0 0.000 1
2 10 0.129 1
3 20 0.257 1
4 30 0.379 1
5 40 0.479 1
6 50 0.579 1
7 60 0.673 1
8 70 0.758 1
9 80 0.838 1
10 90 0.912 1
11 100 0.976 1
12 0 0.000 2
13 10 0.126 2
14 20 0.257 2
15 30 0.382 2
16 40 0.490 2
17 50 0.592 2
18 60 0.684 2
19 70 0.772 2
20 80 0.847 2
21 90 0.917 2
22 100 0.977 2
23 0 0.000 3
24 10 0.125 3
25 20 0.258 3
26 30 0.376 3
27 40 0.488 3
28 50 0.582 3
29 60 0.681 3
30 70 0.768 3
31 80 0.846 3
32 90 0.915 3
33 100 0.977 3
My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2
When I view this, the fit of this line to the points is spectacularly good.
When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:
coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))
(Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1
-1.979021e-03 1.374789e-02
poly(ecoli_ug_ml, 2, raw = TRUE)2
-3.964258e-05
Therefore, I assume the plot is actually not quite right either.
So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.
I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!
To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.979e-03 1.926e-03 -1.028 0.312
# poly(ecoli_ug_ml, 2, raw = T)1 1.375e-02 8.961e-05 153.419 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05 8.631e-07 -45.932 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
# F-statistic: 8.343e+04 on 2 and 30 DF, p-value: < 2.2e-16
So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.
You can force a fit without the intercept this way (note the -1 in the formula):
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# poly(ecoli_ug_ml, 2, raw = T)1 1.367e-02 5.188e-05 263.54 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05 6.396e-07 -61.05 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 3.4e+05 on 2 and 31 DF, p-value: < 2.2e-16
Note that the coefficients do not change appreciably.
EDIT (Response to OP's comment)
The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))
The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."
The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.
Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.
I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:
....+
stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...
You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.
coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
(Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2
0.5466667 1.7772858 -0.2011251
So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.
Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:
mod <- lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
geom_point(shape=ecoli$rpt) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )

Resources