A priori contrasts for lm() in R - r

I'm having trouble with setting a priori contrasts and would like to ask for some help. The following code should give two orthogonal contrasts to the factor level "d".
Response <- c(1,3,2,2,2,2,2,2,4,6,5,5,5,5,5,5,4,6,5,5,5,5,5,5)
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
contrasts(A) <- cbind("d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
summary.lm(aov(Response~A))
What I get is:
Call:
aov(formula = Response ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.000e+00 -3.136e-16 -8.281e-18 -8.281e-18 1.000e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0000 0.1091 36.661 < 2e-16 ***
Ad vs h -1.0000 0.1543 -6.481 2.02e-06 ***
Ad vs c 2.0000 0.1543 12.961 1.74e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5345 on 21 degrees of freedom
Multiple R-squared: 0.8889, Adjusted R-squared: 0.8783
F-statistic: 84 on 2 and 21 DF, p-value: 9.56e-11
But I expect the Estimate of (Intercept) to be 5.00, as it should be equal to the level d, right? Also the other estimates look strange...
I know I can get the correct values easier using relevel(A, ref="d") (where they are displayed correctly), but I am interested in learning the correct formulation to test own hypotheses. If I run a similar example with the folowing code (from a website), it works as expected:
irrigation<-factor(c(rep("Control",10),rep("Irrigated 10mm",10),rep("Irrigated20mm",10)))
biomass<-1:30
contrastmatrix<-cbind("10 vs 20"=c(0,1,-1),"c vs 10"=c(-1,1,0))
contrasts(irrigation)<-contrastmatrix
summary.lm(aov(biomass~irrigation))
Call:
aov(formula = biomass ~ irrigation)
Residuals:
Min 1Q Median 3Q Max
-4.500e+00 -2.500e+00 3.608e-16 2.500e+00 4.500e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.5000 0.5528 28.04 < 2e-16 ***
irrigation10 vs 20 -10.0000 0.7817 -12.79 5.67e-13 ***
irrigationc vs 10 10.0000 0.7817 12.79 5.67e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.028 on 27 degrees of freedom
Multiple R-squared: 0.8899, Adjusted R-squared: 0.8817
F-statistic: 109.1 on 2 and 27 DF, p-value: 1.162e-13
I would really appreciate some explanation for this.
Thanks, Jeremias

I think the problem is in the understanding of contrasts (You may ?contrasts for detail). Let me explain in detail:
If you use the default way for factor A,
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
> contrasts(A)
d h
c 0 0
d 1 0
h 0 1
thus the model lm gives you are
Mean(Response) = Intercept + beta_1 * I(d = 1) + beta_2 * I(h = 1)
summary.lm(aov(Response~A))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.000 0.189 10.6 7.1e-10 ***
Ad 3.000 0.267 11.2 2.5e-10 ***
Ah 3.000 0.267 11.2 2.5e-10 ***
So for group c, the mean is just intercept 2, for group d , the mean is 2 + 3 = 5, same for group h.
What if you use your own contrast:
contrasts(A) <- cbind("d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
A
[1] c c c c c c c c d d d d d d d d h h h h h h h h
attr(,"contrasts")
d vs h d vs c
c 0 -1
d 1 1
h -1 0
The model you fit turns out to be
Mean(Response) = Intercept + beta_1 * (I(d = 1) - I(h = 1)) + beta_2 * (I(d = 1) - I(c = 1))
= Intercept + (beta_1 + beta_2) * I(d = 1) - beta_2 * I(c = 1) - beta_1 * I(h = 1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.000 0.109 36.66 < 2e-16 ***
Ad vs h -1.000 0.154 -6.48 2.0e-06 ***
Ad vs c 2.000 0.154 12.96 1.7e-11 ***
So for group c, the mean is 4 - 2 = 2, for group d, the mean is 4 - 1 + 2 = 5, for group h, the mean is 4 - (-1) = 5.
==========================
Update:
The easiest way to do your contrast is to set the base (reference) level to be d.
contrasts(A) <- contr.treatment(3, base = 2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.00e+00 1.89e-01 26.5 < 2e-16 ***
A1 -3.00e+00 2.67e-01 -11.2 2.5e-10 ***
A3 -4.86e-17 2.67e-01 0.0 1
If you want to use your contrast:
Response <- c(1,3,2,2,2,2,2,2,4,6,5,5,5,5,5,5,4,6,5,5,5,5,5,5)
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
mat<- cbind(rep(1/3, 3), "d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(A) <- my.contrast
summary.lm(aov(Response~A))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.00e+00 1.09e-01 36.7 < 2e-16 ***
Ad vs h 7.69e-16 2.67e-01 0.0 1
Ad vs c 3.00e+00 2.67e-01 11.2 2.5e-10 ***
Reference: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

Related

Linear regression on dynamic groups in R

I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

How can I compare nls parameters across groups in R

I have a data set like this
iu
sample
obs
1.5625
s
0.312
1.5625
s
0.302
3.125
s
0.335
3.125
s
0.333
6.25
s
0.423
6.25
s
0.391
12.5
s
0.562
12.5
s
0.56
25
s
0.84
25
s
0.843
50
s
1.202
50
s
1.185
100
s
1.408
100
s
1.338
200
s
1.42
200
s
1.37
1.5625
t
0.317
1.5625
t
0.313
3.125
t
0.345
3.125
t
0.343
6.25
t
0.413
6.25
t
0.404
12.5
t
0.577
12.5
t
0.557
25
t
0.863
25
t
0.862
50
t
1.22
50
t
1.197
100
t
1.395
100
t
1.364
200
t
1.425
200
t
1.415
I want to use R to recreate SAS code below. I believe this SAS code means a nonlinear fit is performed for each subsets, where three parameters are the same and one parameter is different.
proc nlin data=assay;
 model obs=D+(A-D)/(1+(iu/((cs∗(sample=“S”)
+Ct∗(sample=“T”))))∗∗(B));
 parms D=1 B=1 Cs=1 Ct=1 A=1;
run;
So I write something like this then get
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c[sample]) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25, 25), d = 1.4))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
But without[sample], the model can be calculated
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25), d = 1.4))
summary(nlm_1)
Formula: obs ~ (a - d)/(1 + (iu/c)^b) + d
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.31590 0.00824 38.34 <2e-16 ***
b 1.83368 0.06962 26.34 <2e-16 ***
c 25.58422 0.55494 46.10 <2e-16 ***
d 1.44777 0.01171 123.63 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02049 on 28 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.721e-06
I don't get it, could some one tell me what's wrong with my code, and how can I achieve my goal with R? Thanks!
Thanks to #akrun. After I converting csf_1$sample to factor, I finally get what I wanted.
csf_1[, 2] <- as.factor(c(rep("s", 16), rep("t", 16)))
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c[sample]) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25, 25), d = 1.4))
summary(nlm_1)
Formula: obs ~ (a - d)/(1 + (iu/c[sample])^b) + d
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.315874 0.008102 38.99 <2e-16 ***
b 1.833303 0.068432 26.79 <2e-16 ***
c1 26.075317 0.656779 39.70 <2e-16 ***
c2 25.114050 0.632787 39.69 <2e-16 ***
d 1.447901 0.011518 125.71 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02015 on 27 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.225e-06

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

How is Pr(>|t|) in a linear regression in R calculated?

What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584

contrasts in anova

I understand the contrasts from previous posts and I think I am doing the right thing but it is not giving me what I would expect.
x <- c(11.80856, 11.89269, 11.42944, 12.03155, 10.40744,
12.48229, 12.1188, 11.76914, 0, 0,
13.65773, 13.83269, 13.2401, 14.54421, 13.40312)
type <- factor(c(rep("g",5),rep("i",5),rep("t",5)))
type
[1] g g g g g i i i i i t t t t t
Levels: g i t
When I run this:
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.514 1.729 6.659 2.33e-05 ***
typei -4.240 2.445 -1.734 0.109
typet 2.222 2.445 0.909 0.381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
Here my reference is my type "g", so my typei is the difference between type "g" and type "i", and my typet is the difference between type "g" and type "t".
I wanted to see two more contrasts here, the difference between typei+typeg and type "t" and difference between type "i" and type "t"
so the contrasts
> contrasts(type) <- cbind( c(-1,-1,2),c(0,-1,1))
> summary.lm(aov(x~type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.8412 0.9983 10.860 1.46e-07 ***
type1 -0.6728 1.4118 -0.477 0.642
type2 4.2399 2.4453 1.734 0.109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
When I try to do the second contrast by changing my reference I get different results. I am not understanding what is wrong with my contrast.
Refence: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
mat <- cbind(rep(1/3, 3), "g+i vs t"=c(-1/2, -1/2, 1),"i vs t"=c(0, -1, 1))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(type) <- my.contrast
summary.lm(aov(x ~ type))
my.contrast
> g+i vs t i vs t
[1,] -1.3333 1
[2,] 0.6667 -1
[3,] 0.6667 0
> contrasts(type) <- my.contrast
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.274 -0.414 0.097 0.663 5.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.841 0.998 10.86 1.5e-07 ***
typeg+i vs t 4.342 2.118 2.05 0.063 .
typei vs t 6.462 2.445 2.64 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.87 on 12 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.271
F-statistic: 3.6 on 2 and 12 DF, p-value: 0.0594

Resources