Regression without intercept in R and Stata - r

Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?

I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)

Related

Missing standard errors and confidence for mixed model and ggeffects

I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?

Are regression coefficients same as effect size?

I am trying to implement and analyse a full factorial experiment in R but I don't understand why the results presented in the book are different. Here are the problem details:
I tried to use the least square model to estimate the effects of different factors such as gap, power and flow rate but the effect sizes mentioned in the book are completely different :
My implementation of the problem in R and the results are as follows:
et_rate = c(550, 669, 633, 642, 1037, 749, 1075, 729,
604, 650, 601, 635, 1052, 868, 1063, 860)
gap = factor(rep(1:2, times = 8))
flw_rate = factor(rep(1:2, each = 2, times = 4))
pwr = factor(rep(1:2, each = 4, times= 2))
df <- data.frame(gap, flw_rate, pwr, et_rate)
md3 <- lm(et_rate ~ .^3, data = df)
summary(md3)
And my results are:
Call:
lm(formula = et_rate ~ .^3, data = df)
Residuals:
Min 1Q Median 3Q Max
-65.50 -11.12 0.00 11.12 65.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 577.00 33.56 17.193 1.33e-07 ***
gap2 82.50 47.46 1.738 0.12036
flw_rate2 40.00 47.46 0.843 0.42382
pwr2 467.50 47.46 9.850 9.50e-06 ***
gap2:flw_rate2 -61.00 67.12 -0.909 0.39000
gap2:pwr2 -318.50 67.12 -4.745 0.00145 **
flw_rate2:pwr2 -15.50 67.12 -0.231 0.82317
gap2:flw_rate2:pwr2 22.50 94.92 0.237 0.81859
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.46 on 8 degrees of freedom
Multiple R-squared: 0.9661, Adjusted R-squared: 0.9364
F-statistic: 32.56 on 7 and 8 DF, p-value: 2.896e-05
Show in New WindowClear OutputExpand/Collapse Output
Call:
lm(formula = et_rate ~ gap * pwr, data = df)
Coefficients:
(Intercept) gap2 pwr2 gap2:pwr2
597.0 52.0 459.7 -307.2
I was expecting the coefficients of my model to be equal to the effect size estimate but they are completely different than mentioned in the solution in the book. Am I mistaken in my approach to get the effect sizes?
1) Using Helmert contrasts as in #Roland's comment:
options(contrasts = c(unordered = "contr.helmert", ordered = "contr.poly"))
md3 <- lm(et_rate ~ .^3, df)
2 * coef(md3)
giving the effects shown in the question as:
(Intercept) gap1 flw_rate1 pwr1
1552.125 -101.625 7.375 306.125
gap1:flw_rate1 gap1:pwr1 flw_rate1:pwr1 gap1:flw_rate1:pwr1
-24.875 -153.625 -2.125 5.625
2) Using md3 from above, this also gives the effects shown in the question:
mm <- model.matrix(md3)
crossprod(mm, df$et_rate) / 8
giving:
[,1]
(Intercept) 1552.125
gap1 -101.625
flw_rate1 7.375
pwr1 306.125
gap1:flw_rate1 -24.875
gap1:pwr1 -153.625
flw_rate1:pwr1 -2.125
gap1:flw_rate1:pwr1 5.625
3) This gives the coded factors table shown in the question:
coded <- mm[1:8, 2:4]
coded
giving:
gap1 flw_rate1 pwr1
1 -1 -1 -1
2 1 -1 -1
3 -1 1 -1
4 1 1 -1
5 -1 -1 1
6 1 -1 1
7 -1 1 1
8 1 1 1
coded could also be obtained using the following where the indexing picks out the main effect columns:
H2 <- matrix(c(-1, -1, -1, 1), 2)
kronecker(kronecker(H2, H2), H2)[, c(2:3, 5)]
The Total column in the question sums the two replicates:
Total <- rowSums(matrix(df$et_rate, 8))
Total
## [1] 1154 1319 1234 1277 2089 1617 2138 1589
and in terms of Total and coded we can get the effects:
coef(lm(Total ~ .^3, as.data.frame(coded)))
## (Intercept) gap1 flw_rate1 pwr1
## 1552.125 -101.625 7.375 306.125
## gap1:flw_rate1 gap1:pwr1 flw_rate1:pwr1 gap1:flw_rate1:pwr1
## -24.875 -153.625 -2.125 5.625

Renaming integers within a data.frame

Using R...
I have a data.frame with five variables.
One of the variables colr has values ranging from 1 to 5.
Defined as an integer with values 1, 2, 3, 4, and 5.
Problem: I would like to build a regression model where the values within colr, the integers 1,2,3,4, and 5 are reported as independent variables with the following names.
1 = Silver,
2 = Blue,
3 = Pink,
4 = Other than Silver, Blue or Pink,
5 = Color Not Reported.
Question: Is there a way to extract or rename these values in a way that is different from the following (as this process does not rename, eg. 1 to Silver in the summary regression output):
lm(dependent variable ~ + I(colr.f == 1) +
I(colr.f == 2) +
I(colr.f == 3) +
I(colr.f == 4) +
I(colr.f == 5),
data = df)
I am open to any method that would allow me to create and name these different values independently but would prefer to see if there is a way to do so using the tidyverse or dplyr as this is something I have to do frequently when building multivariate models.
Thank you for any help.
If you have this:
df <- data.frame(int = sample(5, 20, TRUE), value = rnorm(20))
df
#> int value
#> 1 3 -0.62042198
#> 2 4 0.85009260
#> 3 5 -1.04971518
#> 4 1 -2.58255471
#> 5 1 0.62357772
#> 6 4 0.00286785
#> 7 4 -0.05981318
#> 8 4 0.72961261
#> 9 4 -0.03156315
#> 10 1 -2.05486209
#> 11 5 1.77099554
#> 12 1 1.02790956
#> 13 1 -0.70354012
#> 14 1 0.27353731
#> 15 2 -0.04817215
#> 16 2 0.17151374
#> 17 5 -0.54824346
#> 18 2 0.41123284
#> 19 5 0.05466070
#> 20 1 -0.41029986
You can do this:
library(tidyverse)
df <- df %>% mutate(color = factor(c("red", "green", "orange", "blue", "pink"))[int])
df
#> int value color
#> 1 3 -0.62042198 orange
#> 2 4 0.85009260 blue
#> 3 5 -1.04971518 pink
#> 4 1 -2.58255471 red
#> 5 1 0.62357772 red
#> 6 4 0.00286785 blue
#> 7 4 -0.05981318 blue
#> 8 4 0.72961261 blue
#> 9 4 -0.03156315 blue
#> 10 1 -2.05486209 red
#> 11 5 1.77099554 pink
#> 12 1 1.02790956 red
#> 13 1 -0.70354012 red
#> 14 1 0.27353731 red
#> 15 2 -0.04817215 green
#> 16 2 0.17151374 green
#> 17 5 -0.54824346 pink
#> 18 2 0.41123284 green
#> 19 5 0.05466070 pink
#> 20 1 -0.41029986 red
Which allows a regression like this:
lm(value ~ color, data = df) %>% summary()
#>
#> Call:
#> lm(formula = value ~ color, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.03595 -0.33687 -0.00447 0.46149 1.71407
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2982 0.4681 0.637 0.534
#> colorgreen -0.1200 0.7644 -0.157 0.877
#> colororange -0.9187 1.1466 -0.801 0.436
#> colorpink -0.2413 0.7021 -0.344 0.736
#> colorred -0.8448 0.6129 -1.378 0.188
#>
#> Residual standard error: 1.047 on 15 degrees of freedom
#> Multiple R-squared: 0.1451, Adjusted R-squared: -0.0829
#> F-statistic: 0.6364 on 4 and 15 DF, p-value: 0.6444
Created on 2020-02-16 by the reprex package (v0.3.0)
I'm not sure I'm understanding your question the right way, but can't you just use
library(dplyr)
df <- df %>%
mutate(color=factor(colr.f, levels=c(1:5), labels=c("silver", "blue", "pink", "not s, b, p", "not reported"))
and then just run the regression on color only.
/edit for clarification. Making up some data:
df <- data.frame(
x=rnorm(100),
color=factor(rep(c(1,2,3,4,5), each=20),
labels=c("Silver", "Blue", "Pink", "Not S, B, P", "Not reported")),
y=rnorm(100, 4))
m1 <- lm(y~x+color, data=df)
m2 <- lm(y~x+color-1, data=df)
summary(m1)
Call:
lm(formula = y ~ x + color, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93238 0.19312 20.362 <2e-16 ***
x 0.13588 0.09856 1.379 0.171
colorBlue -0.07862 0.27705 -0.284 0.777
colorPink -0.02167 0.27393 -0.079 0.937
colorNot S, B, P 0.15238 0.27221 0.560 0.577
colorNot reported 0.14139 0.27230 0.519 0.605
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.0268, Adjusted R-squared: -0.02496
F-statistic: 0.5177 on 5 and 94 DF, p-value: 0.7623
summary(m2)
Call:
lm(formula = y ~ x + color - 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.13588 0.09856 1.379 0.171
colorSilver 3.93238 0.19312 20.362 <2e-16 ***
colorBlue 3.85376 0.19570 19.692 <2e-16 ***
colorPink 3.91071 0.19301 20.262 <2e-16 ***
colorNot S, B, P 4.08477 0.19375 21.083 <2e-16 ***
colorNot reported 4.07377 0.19256 21.156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9551
F-statistic: 355.5 on 6 and 94 DF, p-value: < 2.2e-16
The first model is a model with intercept, therefore one of the factor levels must be dropped to avoid perfect multicollinearity. In this case, the "effect" of silver is the value of the intercept, while the "effect" of the other colors is the intercept coefficient value + their respective coefficient value.
The second model is estimated without intercept (without constant), so you can see the individual effects. However, you should probably know what you are doing before estimating the model without intercept.
With base R.
labels <- c("Silver", "Blue", "Pink", "Other Color", "Color Not Reported")
df$colr.f2 <- factor(colr.f, labels = labels, levels = seq_along(labels))

how we can find a smooth function to our data?

Suppose I have this small data T
69 59 100 70 35 1
matplot(t(T[1,]), type="l",xaxt="n")
I want find a polynomial which is fit to data. (even over fit is ok)
is there any way that I can do it in R?
First the data.
y <- scan(text = '69 59 100 70 35 1')
x <- seq_along(y)
Now a 2nd degree polynomial fit. This is fit with lm.
fit <- lm(y ~ poly(x, 2))
summary(fit)
#
#Call:
#lm(formula = y ~ poly(x, 2))
#
#Residuals:
# 1 2 3 4 5 6
# 7.0000 -20.6571 17.8286 0.4571 -6.7714 2.1429
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 55.667 6.848 8.128 0.00389 **
#poly(x, 2)1 -52.829 16.775 -3.149 0.05130 .
#poly(x, 2)2 -46.262 16.775 -2.758 0.07028 .
#---
#Signif. codes:
#0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 16.78 on 3 degrees of freedom
#Multiple R-squared: 0.8538, Adjusted R-squared: 0.7564
#F-statistic: 8.761 on 2 and 3 DF, p-value: 0.05589
Finally, the plot of both the original data and of the fitted values.
newy <- predict(fit, data.frame(x))
plot(y, type = "b")
lines(x, newy, col = "red")

Calculate T statistics for beta in linear regression model

i have the following equation for calculating the t statistics of a simple linear regression model.
t= beta1/SE(beta1)
SE(beta1)=sqrt((RSS/var(x1))*(1/n-2))
If i want to do this for an simple example wit R, i am not able to get the same results as the linear model in R.
x <- c(1,2,4,8,16)
y <- c(1,2,3,4,5)
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
-0.74194 0.01613 0.53226 0.56452 -0.37097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.50000 0.44400 3.378 0.0431 *
x 0.24194 0.05376 4.500 0.0205 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6558 on 3 degrees of freedom
Multiple R-squared: 0.871, Adjusted R-squared: 0.828
F-statistic: 20.25 on 1 and 3 DF, p-value: 0.02049
If i do this by hand i get a other value.
var(x)
37.2
sum(resid(mod)^2)
1.290323
beta1=0.24194
SE(beta1)=sqrt((1.290323/37.2)*(1/3))
SE(beta1)=0.1075269
So t= 0.24194/0.1075269=2.250042
So why is my calculation exact the half of the value from R? Has it something to do with one/two tailed tests? The value for t(0.05/2) is 3.18
Regards,
Jan
The different result was caused by a missing term in your formula for se(beta). It should be:
se(beta) = sqrt((1 / (n - 2)) * rss / (var(x) * (n - 1)))
The formula is usually written out as:
se(beta) = sqrt((1 / (n - 2)) * rss / sum((x - mean(x)) ^ 2))
rather than in terms of var(x).
For the sake of completeness, here's also the computational check:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-10-30
x <- c(1, 2, 4, 8, 16)
y <- c(1, 2, 3, 4, 5)
n <- length(x)
mod <- lm(y ~ x)
summary(mod)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Residuals:
#> 1 2 3 4 5
#> -0.74194 0.01613 0.53226 0.56452 -0.37097
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.50000 0.44400 3.378 0.0431 *
#> x 0.24194 0.05376 4.500 0.0205 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6558 on 3 degrees of freedom
#> Multiple R-squared: 0.871, Adjusted R-squared: 0.828
#> F-statistic: 20.25 on 1 and 3 DF, p-value: 0.02049
mod_se_b <- summary(mod)$coefficients[2, 2]
rss <- sum(resid(mod) ^ 2)
se_b <- sqrt((1 / (n - 2)) * rss / (var(x) * (n - 1)))
all.equal(se_b, mod_se_b)
#> [1] TRUE

Resources