In this experiment, four different diets were tried on animals. Then researchers measured their effects on blood coagulation time.
## Data :
coag diet
1 62 A
2 60 A
3 63 A
4 59 A
5 63 B
6 67 B
7 71 B
8 64 B
9 65 B
10 66 B
11 68 C
12 66 C
13 71 C
14 67 C
15 68 C
16 68 C
17 56 D
18 62 D
19 60 D
20 61 D
21 63 D
22 64 D
23 63 D
24 59 D
I am trying to fit a linear model for coag~diet by using the function lm in R
Results should look like the following:
> modelSummary$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.100000e+01 1.183216 5.155441e+01 9.547815e-23
dietB 5.000000e+00 1.527525 3.273268e+00 3.802505e-03
dietC 7.000000e+00 1.527525 4.582576e+00 1.805132e-04
dietD -1.071287e-14 1.449138 -7.392579e-15 1.000000e+00
My code thus far does not look like results:
coagulation$x1 <- 1*(coagulation$diet=="B")
coagulation$x2 <- 1*(coagulation$diet=="C")
coagulation$x3 <- 1*(coagulation$diet=="D")
modelSummary <- lm(coag~1+x1+x2+x3, data=coagulation)
"diet" is a character variable and is treated as a factor. So you may leave out the dummy coding and just do:
summary(lm(coag ~ diet, data=coagulation))$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.100000e+01 1.183216 5.155441e+01 9.547815e-23
# dietB 5.000000e+00 1.527525 3.273268e+00 3.802505e-03
# dietC 7.000000e+00 1.527525 4.582576e+00 1.805132e-04
# dietD 2.991428e-15 1.449138 2.064281e-15 1.000000e+00
Even if "diet" were a numeric variable and you want R to treat it as a categorical rather than a continuous variable no dummy coding is needed, you would just add it as + factor(diet) into the formula.
As you see, also 1 + is redundant since lm calculates the (Intercept) by default. To omit the intercept, you may do 0 + (or - 1).
That presentation is a property of summary(modelSummary) (class summary.lm), not modelSummary (class lm).
summary(modelSummary)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.100000e+01 1.183216 5.155441e+01 9.547815e-23
# x1 5.000000e+00 1.527525 3.273268e+00 3.802505e-03
# x2 7.000000e+00 1.527525 4.582576e+00 1.805132e-04
# x3 2.991428e-15 1.449138 2.064281e-15 1.000000e+00
You may also consider coding diet in this manner
coagulation$diet <- factor(coagulation$diet)
modelSummary<-lm(coag~diet,coagulation)
summary(modelSummary)
Call:
lm(formula = coag ~ diet, data = coagulation)
Residuals:
Min 1Q Median 3Q Max
-5.00 -1.25 0.00 1.25 5.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***
dietB 5.000e+00 1.528e+00 3.273 0.003803 **
dietC 7.000e+00 1.528e+00 4.583 0.000181 ***
dietD 2.991e-15 1.449e+00 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
I am trying to do robust multiple regression for a dataset where a few outliers don't allow me to see the underlying patterns through the usual linear models.
I am using the function lmrob in the package robustbase, and I was surprised by the number of significant relationships that I found. I decided to try the method with random data, this is the code:
library(robustbase)
set.seed(4)
ax<-data.frame(a1=rnorm(20,3),
a2=rnorm(20,5),
a3=rnorm(20,4),
a4=rnorm(20,6),
a5=rnorm(20,2))
axm<-lmrob(a1~a2*a3*a4*a5,data=ax)
summary(axm)
And the output:
Call:
lmrob(formula = a1 ~ a2 * a3 * a4 * a5, data = ax)
\--> method = "MM"
Residuals:
1 2 3 4 5 6 7 8 9 10 11 12 13
-34.740270 -0.049493 -0.044379 0.002770 0.219825 0.041285 0.156152 -0.072825 0.034824 -0.014757 -0.088263 -0.185045 -0.079679
14 15 16 17 18 19 20
-0.045121 -0.007576 0.008813 0.010451 0.015716 0.060781 0.040187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1160.5907 94.0095 -12.35 0.000247 ***
a2 205.6910 15.8689 12.96 0.000204 ***
a3 327.9787 24.2161 13.54 0.000172 ***
a4 193.2384 15.7300 12.29 0.000252 ***
a5 734.2203 49.8960 14.71 0.000124 ***
a2:a3 -57.6229 4.0533 -14.22 0.000142 ***
a2:a4 -33.5644 2.6130 -12.85 0.000212 ***
a3:a4 -54.1622 4.0438 -13.39 0.000180 ***
a2:a5 -138.8395 9.2697 -14.98 0.000116 ***
a3:a5 -198.4961 12.3168 -16.12 8.67e-05 ***
a4:a5 -123.0895 8.2792 -14.87 0.000119 ***
a2:a3:a4 9.3344 0.6659 14.02 0.000150 ***
a2:a3:a5 37.1371 2.2502 16.50 7.89e-05 ***
a2:a4:a5 23.0014 1.5152 15.18 0.000110 ***
a3:a4:a5 32.9766 2.0388 16.18 8.55e-05 ***
a2:a3:a4:a5 -6.0817 0.3660 -16.62 7.68e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robust residual standard error: 0.4039
Multiple R-squared: 0.9861, Adjusted R-squared: 0.934
Convergence in 5 IRWLS iterations
Robustness weights:
observation 1 is an outlier with |weight| = 0 ( < 0.005);
9 weights are ~= 1. The remaining 10 ones are
2 3 5 7 8 11 12 13 14 19
0.9986 0.9989 0.9732 0.9864 0.9970 0.9957 0.9810 0.9965 0.9989 0.9979
Algorithmic parameters:
tuning.chi bb tuning.psi refine.tol rel.tol scale.tol solve.tol eps.outlier
1.548e+00 5.000e-01 4.685e+00 1.000e-07 1.000e-07 1.000e-10 1.000e-07 5.000e-03
eps.x warn.limit.reject warn.limit.meanrw
1.150e-09 5.000e-01 5.000e-01
nResample max.it best.r.s k.fast.s k.max maxit.scale trace.lev mts compute.rd
500 50 2 1 200 200 0 1000 0
fast.s.large.n
2000
psi subsampling cov compute.outlier.stats
"bisquare" "nonsingular" ".vcov.avar1" "SM"
seed : int(0)
According to this, I understand that the other random variables are related to the first one, and have high predictive power over it, which makes no sense.
What is happening here? I am doing the regression wrong?
Edit: I put a seed for which the p values are extremely low for replicability.
I think I may have found the explanation to such high p-values: turns out that the MM estimate with small sample sizes (as in my example with 20) is prone to type 1 error. One of the authors of the robustbase package has published an article proposing an alternative estimate for this cases, but I'm afraid that for my data it doesn't work so much better.
I am learning to work with apply family functions and R loops.
I am working with a basic data set table that has y (outcome variable) column and x (predictor variable) column with 100 rows.
I have already used the lm() function to run a regression for the data.
Model.1<-lm(y~x, data = data)
Coefficients:
(Intercept) x
13.87 4.89
summary(Model.1)
Residuals:
Min 1Q Median 3Q Max
-4.1770 -1.7005 -0.0011 1.5625 6.4893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.87039 0.95625 14.51 <2e-16 ***
x 4.88956 0.09339 52.35 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.195 on 98 degrees of freedom
Multiple R-squared: 0.9655, Adjusted R-squared: 0.9651
F-statistic: 2741 on 1 and 98 DF, p-value: < 2.2e-16
anova(Model.1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 13202 13202.5 2740.9 < 2.2e-16 ***
Residuals 98 472 4.8
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
attributes(Model.1)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
$class
[1] "lm"
I know want to randomly sample 100 observations from my table "y" and "x" table. This is the function I created to run the random sample with replacement
draw_100<-function(){
random_100=sample(data, 100, replace = TRUE)
}
Running random_100 gives me these outputs
random_100
x x.1 y
1 8.112187 8.112187 53.69602
2 8.403589 8.403589 53.79438
3 9.541786 9.541786 58.48542
4 8.989281 8.989281 57.08601
5 6.965905 6.965905 46.62331
6 10.167800 10.167800 63.91487
7 10.683152 10.683152 65.84915
8 10.703093 10.703093 66.24738
9 8.337231 8.337231 51.87687
10 13.106177 13.106177 75.94588
11 10.726036 10.726036 65.19384
12 8.601641 8.601641 51.95095
13 10.338696 10.338696 62.92599
14 5.771682 5.771682 42.14190
15 6.161545 6.161545 46.36998
16 9.874543 9.874543 63.67148
17 8.540996 8.540996 58.85341
18 9.866002 9.866002 63.26319
19 8.622546 8.622546 57.05820
20 9.539929 9.539929 64.76654
21 9.498090 9.498090 61.38521
22 8.206142 8.206142 53.43508
23 8.245825 8.245825 58.29646
24 12.192542 12.192542 76.17440
25 6.955028 6.955028 49.73094
26 10.237639 10.237639 65.71210
27 10.927818 10.927818 67.18048
28 8.536011 8.536011 52.97402
29 9.574403 9.574403 60.53908
30 9.507752 9.507752 58.40020
31 5.838214 5.838214 41.93612
32 10.702791 10.702791 64.54986
33 6.704084 6.704084 46.88057
34 12.914798 12.914798 78.99422
35 16.607947 16.607947 96.60247
36 8.334241 8.334241 55.32263
37 12.287914 12.287914 71.46411
38 11.214098 11.214098 68.53254
39 7.722161 7.722161 50.81632
40 14.065276 14.065276 80.31033
41 10.402173 10.402173 64.36506
42 10.984727 10.984727 64.25032
43 8.491214 8.491214 58.36475
44 9.120864 9.120864 61.24240
45 10.251654 10.251654 60.56177
46 4.497277 4.497277 33.20243
47 11.384417 11.384417 68.61502
48 14.033980 14.033980 83.95417
49 9.909422 9.909422 62.27733
50 8.692219 8.692219 55.73567
51 12.864750 12.864750 79.08818
52 9.886267 9.886267 65.87693
53 10.457541 10.457541 61.36505
54 13.395296 13.395296 76.01832
55 10.343134 10.343134 60.84247
56 10.233329 10.233329 65.12074
57 10.756491 10.756491 70.05930
58 9.287774 9.287774 57.65071
59 11.704419 11.704419 72.65211
60 13.075236 13.075236 77.87956
61 12.066161 12.066161 69.34647
62 10.044714 10.044714 65.80648
63 13.331926 13.331926 80.72634
64 10.816099 10.816099 67.11356
65 10.377846 10.377846 63.14035
66 11.824583 11.824583 67.51041
67 7.114326 7.114326 51.80456
68 9.752344 9.752344 59.36107
69 10.869720 10.869720 67.97186
70 10.366262 10.366262 66.28012
71 10.656127 10.656127 67.86625
72 6.246312 6.246312 45.95457
73 8.003875 8.003875 49.29802
74 11.541176 11.541176 67.89918
75 11.799510 11.799510 73.15802
76 9.787112 9.787112 62.90187
77 13.187445 13.187445 80.26162
78 13.019787 13.019787 75.69156
79 3.854378 3.854378 35.82556
80 11.724234 11.724234 71.79034
81 6.953864 6.953864 45.72355
82 12.822231 12.822231 76.93698
83 9.285428 9.285428 59.61610
84 10.259240 10.259240 62.37958
85 10.613086 10.613086 63.91694
86 8.547155 8.547155 54.72216
87 15.069100 15.069100 86.23767
88 7.816772 7.816772 51.41676
89 13.854272 13.854272 88.10100
90 9.495968 9.495968 61.61393
91 9.881453 9.881453 65.24259
92 7.475875 7.475875 50.80777
93 13.286219 13.286219 81.15708
94 9.703433 9.703433 60.75532
95 5.415999 5.415999 42.55981
96 12.997555 12.997555 78.12987
97 11.893787 11.893787 68.97691
98 5.228217 5.228217 37.38417
99 8.392504 8.392504 54.81151
100 8.077527 8.077527 51.47045
I am having an road block, using this new random sample of 100 values and fitting a regression model to it to extract the coefficient and standard error?
I thought I may need to use the supply() function but I truly believe I am overthinking this. Because when I the regression model with my R object with the store random sample and it was identical to Model.1. I am off.
Model.2<-lm(y~x, data = random_100)
Call:
lm(formula = y ~ x, data = random_100)
Coefficients:
(Intercept) x
13.87 4.89
Coefficient and slop were identical to Model.1
Call:
lm(formula = y ~ x, data = random_100)
Residuals:
Min 1Q Median 3Q Max
-4.1770 -1.7005 -0.0011 1.5625 6.4893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.87039 0.95625 14.51 <2e-16 ***
x 4.88956 0.09339 52.35 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.195 on 98 degrees of freedom
Multiple R-squared: 0.9655, Adjusted R-squared: 0.9651
F-statistic: 2741 on 1 and 98 DF, p-value: < 2.2e-16
I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.
Here's my data:
ecoli_ug_ml A420 rpt
1 0 0.000 1
2 10 0.129 1
3 20 0.257 1
4 30 0.379 1
5 40 0.479 1
6 50 0.579 1
7 60 0.673 1
8 70 0.758 1
9 80 0.838 1
10 90 0.912 1
11 100 0.976 1
12 0 0.000 2
13 10 0.126 2
14 20 0.257 2
15 30 0.382 2
16 40 0.490 2
17 50 0.592 2
18 60 0.684 2
19 70 0.772 2
20 80 0.847 2
21 90 0.917 2
22 100 0.977 2
23 0 0.000 3
24 10 0.125 3
25 20 0.258 3
26 30 0.376 3
27 40 0.488 3
28 50 0.582 3
29 60 0.681 3
30 70 0.768 3
31 80 0.846 3
32 90 0.915 3
33 100 0.977 3
My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2
When I view this, the fit of this line to the points is spectacularly good.
When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:
coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))
(Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1
-1.979021e-03 1.374789e-02
poly(ecoli_ug_ml, 2, raw = TRUE)2
-3.964258e-05
Therefore, I assume the plot is actually not quite right either.
So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.
I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!
To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.979e-03 1.926e-03 -1.028 0.312
# poly(ecoli_ug_ml, 2, raw = T)1 1.375e-02 8.961e-05 153.419 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05 8.631e-07 -45.932 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
# F-statistic: 8.343e+04 on 2 and 30 DF, p-value: < 2.2e-16
So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.
You can force a fit without the intercept this way (note the -1 in the formula):
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# poly(ecoli_ug_ml, 2, raw = T)1 1.367e-02 5.188e-05 263.54 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05 6.396e-07 -61.05 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 3.4e+05 on 2 and 31 DF, p-value: < 2.2e-16
Note that the coefficients do not change appreciably.
EDIT (Response to OP's comment)
The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))
The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."
The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.
Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.
I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:
....+
stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...
You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.
coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
(Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2
0.5466667 1.7772858 -0.2011251
So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.
Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:
mod <- lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
geom_point(shape=ecoli$rpt) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )
I want to run a regression with a bunch of independent variables from my dataset. There are a lot of predictors, so I do not want to write them all out. Is there a notation to span multiple columns so I don't have to type each?
My attempt was doing this (where my predictors are column 20 to 43):
modelAllHexSubscales = lm(HHdata$garisktot~HHdata[,20:43])
Obviously, this does not work because HHdata[,20:43] is a matrix of data, whereas I really need it to see the data as HHdata[,20]+HHdata[,21] etc.
Here's another alternative:
# if garisktot is in columns 20:43
modelAllHexSubscales <- lm(garisktot ~ ., data=HHdata[,20:43])
# if it isn't
modelData <- data.frame(HHdata["garisktot"],HHdata[,20:43])
modelAllHexSubscales <- lm(garisktot ~ ., data=modelData)
Generate a formula by pasting column names first.
f <- as.formula(paste('garisktot ~', paste(colnames(HHdata)[20:43], collapse='+')))
modelAllHexSubscales <- lm(f, HHdata)
Have you tried to do it directly, as in
> y
[1] 10 19 30 42 51 59 72 78
> X
[,1] [,2]
[1,] 1 1.0
[2,] 2 3.0
[3,] 3 5.5
[4,] 4 7.0
[5,] 5 9.0
[6,] 6 11.0
[7,] 7 13.0
[8,] 8 16.0
> summary(lm(y ~ X))
Call:
lm(formula = y ~ X)
Residuals:
1 2 3 4 5 6 7 8
-0.1396 -1.2774 0.9094 1.4472 0.3094 -1.8283 1.0340 -0.4547
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.647 2.004 -1.321 0.24366
X1 15.436 3.177 4.859 0.00464 **
X2 -2.649 1.535 -1.726 0.14490
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.363 on 5 degrees of freedom
Multiple R-squared: 0.9978, Adjusted R-squared: 0.9969
F-statistic: 1124 on 2 and 5 DF, p-value: 2.32e-07