I am trying to implement and analyse a full factorial experiment in R but I don't understand why the results presented in the book are different. Here are the problem details:
I tried to use the least square model to estimate the effects of different factors such as gap, power and flow rate but the effect sizes mentioned in the book are completely different :
My implementation of the problem in R and the results are as follows:
et_rate = c(550, 669, 633, 642, 1037, 749, 1075, 729,
604, 650, 601, 635, 1052, 868, 1063, 860)
gap = factor(rep(1:2, times = 8))
flw_rate = factor(rep(1:2, each = 2, times = 4))
pwr = factor(rep(1:2, each = 4, times= 2))
df <- data.frame(gap, flw_rate, pwr, et_rate)
md3 <- lm(et_rate ~ .^3, data = df)
summary(md3)
And my results are:
Call:
lm(formula = et_rate ~ .^3, data = df)
Residuals:
Min 1Q Median 3Q Max
-65.50 -11.12 0.00 11.12 65.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 577.00 33.56 17.193 1.33e-07 ***
gap2 82.50 47.46 1.738 0.12036
flw_rate2 40.00 47.46 0.843 0.42382
pwr2 467.50 47.46 9.850 9.50e-06 ***
gap2:flw_rate2 -61.00 67.12 -0.909 0.39000
gap2:pwr2 -318.50 67.12 -4.745 0.00145 **
flw_rate2:pwr2 -15.50 67.12 -0.231 0.82317
gap2:flw_rate2:pwr2 22.50 94.92 0.237 0.81859
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.46 on 8 degrees of freedom
Multiple R-squared: 0.9661, Adjusted R-squared: 0.9364
F-statistic: 32.56 on 7 and 8 DF, p-value: 2.896e-05
Show in New WindowClear OutputExpand/Collapse Output
Call:
lm(formula = et_rate ~ gap * pwr, data = df)
Coefficients:
(Intercept) gap2 pwr2 gap2:pwr2
597.0 52.0 459.7 -307.2
I was expecting the coefficients of my model to be equal to the effect size estimate but they are completely different than mentioned in the solution in the book. Am I mistaken in my approach to get the effect sizes?
1) Using Helmert contrasts as in #Roland's comment:
options(contrasts = c(unordered = "contr.helmert", ordered = "contr.poly"))
md3 <- lm(et_rate ~ .^3, df)
2 * coef(md3)
giving the effects shown in the question as:
(Intercept) gap1 flw_rate1 pwr1
1552.125 -101.625 7.375 306.125
gap1:flw_rate1 gap1:pwr1 flw_rate1:pwr1 gap1:flw_rate1:pwr1
-24.875 -153.625 -2.125 5.625
2) Using md3 from above, this also gives the effects shown in the question:
mm <- model.matrix(md3)
crossprod(mm, df$et_rate) / 8
giving:
[,1]
(Intercept) 1552.125
gap1 -101.625
flw_rate1 7.375
pwr1 306.125
gap1:flw_rate1 -24.875
gap1:pwr1 -153.625
flw_rate1:pwr1 -2.125
gap1:flw_rate1:pwr1 5.625
3) This gives the coded factors table shown in the question:
coded <- mm[1:8, 2:4]
coded
giving:
gap1 flw_rate1 pwr1
1 -1 -1 -1
2 1 -1 -1
3 -1 1 -1
4 1 1 -1
5 -1 -1 1
6 1 -1 1
7 -1 1 1
8 1 1 1
coded could also be obtained using the following where the indexing picks out the main effect columns:
H2 <- matrix(c(-1, -1, -1, 1), 2)
kronecker(kronecker(H2, H2), H2)[, c(2:3, 5)]
The Total column in the question sums the two replicates:
Total <- rowSums(matrix(df$et_rate, 8))
Total
## [1] 1154 1319 1234 1277 2089 1617 2138 1589
and in terms of Total and coded we can get the effects:
coef(lm(Total ~ .^3, as.data.frame(coded)))
## (Intercept) gap1 flw_rate1 pwr1
## 1552.125 -101.625 7.375 306.125
## gap1:flw_rate1 gap1:pwr1 flw_rate1:pwr1 gap1:flw_rate1:pwr1
## -24.875 -153.625 -2.125 5.625
Related
Random coefficient Poisson models are rather difficult to fit, there tends to be some variability in parameter estimates between lme4 and glmmADMB. But in my case:
# Packages
library(lme4)
library(glmmADMB)
#Open my dataset
myds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/my_glmm_dataset.csv")
str(myds)
# 'data.frame': 526 obs. of 10 variables:
# $ Bioma : chr "Pampa" "Pampa" "Pampa" "Pampa" ...
# $ estacao : chr "verao" "verao" "verao" "verao" ...
# $ ciclo : chr "1°" "1°" "1°" "1°" ...
# $ Hour : int 22 23 0 1 2 3 4 5 6 7 ...
# $ anthill : num 23.5 23.5 23.5 23.5 23.5 ...
# $ formigueiro: int 2 2 2 2 2 2 2 2 2 2 ...
# $ ladenant : int 34 39 29 25 20 31 16 28 21 12 ...
# $ unladen : int 271 258 298 317 316 253 185 182 116 165 ...
# $ UR : num 65.7 69 71.3 75.8 78.1 ...
# $ temp : num 24.3 24.3 24 23.7 23.1 ...
I have a number of insects (ladenant) in the function of biome(Bioma), temperature (temp) and humidity(UR), but formiguieros is pseudoreplication. Then I try to model the relationship using lme4 and glmmADMB.
First I try lme4:
m.laden.1 <- glmer(ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro), data = DataBase, family = poisson(link = "log"))
summary(m.laden.1)
# Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
# Family: poisson ( log )
# Formula: ladenant ~ Bioma + poly(temp, 2) + UR + (1 | formigueiro)
# Data: DataBase
# AIC BIC logLik deviance df.resid
# 21585.9 21615.8 -10786.0 21571.9 519
# Scaled residuals:
# Min 1Q Median 3Q Max
# -10.607 -4.245 -1.976 2.906 38.242
# Random effects:
# Groups Name Variance Std.Dev.
# formigueiro (Intercept) 0.02049 0.1432
# Number of obs: 526, groups: formigueiro, 5
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.7379495 0.0976701 7.556 4.17e-14 ***
# BiomaTransition 1.3978383 0.0209623 66.684 < 2e-16 ***
# BiomaPampa -0.1256759 0.0268164 -4.687 2.78e-06 ***
# poly(temp, 2)1 7.1035195 0.2079550 34.159 < 2e-16 ***
# poly(temp, 2)2 -7.2900687 0.2629908 -27.720 < 2e-16 ***
# UR 0.0302810 0.0008029 37.717 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Correlation of Fixed Effects:
# (Intr) BmTrns BimPmp p(,2)1 p(,2)2
# BiomaTrnstn -0.586
# BiomaPampa -0.199 0.352
# ply(tmp,2)1 -0.208 0.267 0.312
# ply(tmp,2)2 -0.191 0.085 -0.175 -0.039
# UR -0.746 0.709 0.188 0.230 0.316
# optimizer (Nelder_Mead) convergence code: 0 (OK)
# Model is nearly unidentifiable: very large eigenvalue
# - Rescale variables?
Second I try glmmADMB:
m.laden.2 <- glmmadmb(ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro), data = DataBase, family = "poisson", link = "log")
summary(m.laden.2)
# Call:
# glmmadmb(formula = ladenant ~ Bioma + poly(temp, 2) + UR + (1 |
# formigueiro), data = DataBase, family = "poisson", link = "log")
# AIC: 12033.9
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.52390 0.26923 5.66 1.5e-08 ***
# BiomaTransition 0.23967 0.08878 2.70 0.0069 **
# BiomaPampa 0.09680 0.05198 1.86 0.0626 .
# poly(temp, 2)1 -0.38754 0.55678 -0.70 0.4864
# poly(temp, 2)2 -1.16028 0.39608 -2.93 0.0034 **
# UR 0.01560 0.00261 5.97 2.4e-09 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Number of observations: total=526, formigueiro=5
# Random effect variance(s):
# Group=formigueiro
# Variance StdDev
# (Intercept) 0.07497 0.2738
Despite the between different statistical packages and estimations, the models are completely different significance levels in the case of Biome variable. My question is,
where is any other approach that can be used to compare the results and choose a final model?
Thanks in advance.
I got a little bit carried away. tl;dr as pointed out in comments, it's hard to get glmmADMB to work with a Poisson model, but a model with overdispersion (e.g. negative binomial) is clearly a lot better. Furthermore, you should probably incorporate some aspect of random slopes in the model ...
Packages (colorblindr is optional):
library(lme4)
library(glmmADMB)
library(glmmTMB)
library(broom.mixed)
library(tidyverse) ## ggplot2, dplyr, tidyr, purrr ...
library(colorblindr) ## remotes::install_github("clauswilke/colorblindr")
theme_set(theme_bw())
Get data: standardize input variables so we can easily make a sensible coefficient plot
## read.csv doesn't work for me out of the box, locale/encoding issues
myds <- readr::read_csv("my_glmm_dataset.csv") %>%
mutate(across(formigueiro, as.factor),
across(c(UR, temp), ~ drop(scale(.))))
Formulas and models:
form <- ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro)
## random slopes, all independent (also tried with correlations (| instead
## of ||), but fails)
form_x <- ladenant ~ Bioma + poly(temp,2) + UR + (1 + UR + poly(temp,2) || formigueiro)
glmer_pois <- glmer(form, data = myds, family = poisson(link = "log"))
## fails
glmmADMB_pois <- try(glmmadmb(form, data = myds, family = "poisson"))
## fails ("Parameters were estimated, but standard errors were not:
## the most likely problem is that the curvature at MLE was zero or negative"
glmmTMB_pois <- glmmTMB(form, data = myds, family = poisson)
glmer_nb2 <- glmer.nb(form, data = myds)
glmmADMB_nb2 <- glmmadmb(form, data = myds, family = "nbinom2")
glmmTMB_nb2 <- update(glmmTMB_pois, family = "nbinom2")
glmmTMB_nb1 <- update(glmmTMB_pois, family = "nbinom1")
glmmTMB_nb2ext <- update(glmmTMB_nb2, formula = form_x)
Put it all together:
modList <- tibble::lst(glmer_pois, glmmTMB_pois, glmer_nb2, glmmADMB_nb2, glmmTMB_nb2,
glmmTMB_nb1, glmmTMB_nb2ext)
bbmle::AICtab(modList)
dAIC df
glmmTMB_nb2ext 0.0 11
glmer_nb2 27.0 8
glmmADMB_nb2 27.1 8
glmmTMB_nb2 27.1 8
glmmTMB_nb1 79.5 8
glmmTMB_pois 16658.0 7
glmer_pois 16658.0 7
The 'nb2' models are all OK, but the random-slopes model is considerably better.
Coefficient plots, including the fixed effects from all methods:
tt <- (purrr::map_dfr(modList, tidy, effects = "fixed", conf.int = TRUE,
.id = "model") |>
dplyr::filter(term != "(Intercept)") |>
tidyr::separate(model, into = c("platform", "distrib"))
)
ggplot(tt, aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high,
colour = platform, shape = distrib)) +
geom_pointrange(position = position_dodge(width = 0.25)) +
geom_vline(xintercept = 0, lty =2) +
scale_colour_OkabeIto()
Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)
Suppose I have this small data T
69 59 100 70 35 1
matplot(t(T[1,]), type="l",xaxt="n")
I want find a polynomial which is fit to data. (even over fit is ok)
is there any way that I can do it in R?
First the data.
y <- scan(text = '69 59 100 70 35 1')
x <- seq_along(y)
Now a 2nd degree polynomial fit. This is fit with lm.
fit <- lm(y ~ poly(x, 2))
summary(fit)
#
#Call:
#lm(formula = y ~ poly(x, 2))
#
#Residuals:
# 1 2 3 4 5 6
# 7.0000 -20.6571 17.8286 0.4571 -6.7714 2.1429
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 55.667 6.848 8.128 0.00389 **
#poly(x, 2)1 -52.829 16.775 -3.149 0.05130 .
#poly(x, 2)2 -46.262 16.775 -2.758 0.07028 .
#---
#Signif. codes:
#0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 16.78 on 3 degrees of freedom
#Multiple R-squared: 0.8538, Adjusted R-squared: 0.7564
#F-statistic: 8.761 on 2 and 3 DF, p-value: 0.05589
Finally, the plot of both the original data and of the fitted values.
newy <- predict(fit, data.frame(x))
plot(y, type = "b")
lines(x, newy, col = "red")
What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584
i have the following equation for calculating the t statistics of a simple linear regression model.
t= beta1/SE(beta1)
SE(beta1)=sqrt((RSS/var(x1))*(1/n-2))
If i want to do this for an simple example wit R, i am not able to get the same results as the linear model in R.
x <- c(1,2,4,8,16)
y <- c(1,2,3,4,5)
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
-0.74194 0.01613 0.53226 0.56452 -0.37097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.50000 0.44400 3.378 0.0431 *
x 0.24194 0.05376 4.500 0.0205 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6558 on 3 degrees of freedom
Multiple R-squared: 0.871, Adjusted R-squared: 0.828
F-statistic: 20.25 on 1 and 3 DF, p-value: 0.02049
If i do this by hand i get a other value.
var(x)
37.2
sum(resid(mod)^2)
1.290323
beta1=0.24194
SE(beta1)=sqrt((1.290323/37.2)*(1/3))
SE(beta1)=0.1075269
So t= 0.24194/0.1075269=2.250042
So why is my calculation exact the half of the value from R? Has it something to do with one/two tailed tests? The value for t(0.05/2) is 3.18
Regards,
Jan
The different result was caused by a missing term in your formula for se(beta). It should be:
se(beta) = sqrt((1 / (n - 2)) * rss / (var(x) * (n - 1)))
The formula is usually written out as:
se(beta) = sqrt((1 / (n - 2)) * rss / sum((x - mean(x)) ^ 2))
rather than in terms of var(x).
For the sake of completeness, here's also the computational check:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-10-30
x <- c(1, 2, 4, 8, 16)
y <- c(1, 2, 3, 4, 5)
n <- length(x)
mod <- lm(y ~ x)
summary(mod)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Residuals:
#> 1 2 3 4 5
#> -0.74194 0.01613 0.53226 0.56452 -0.37097
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.50000 0.44400 3.378 0.0431 *
#> x 0.24194 0.05376 4.500 0.0205 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6558 on 3 degrees of freedom
#> Multiple R-squared: 0.871, Adjusted R-squared: 0.828
#> F-statistic: 20.25 on 1 and 3 DF, p-value: 0.02049
mod_se_b <- summary(mod)$coefficients[2, 2]
rss <- sum(resid(mod) ^ 2)
se_b <- sqrt((1 / (n - 2)) * rss / (var(x) * (n - 1)))
all.equal(se_b, mod_se_b)
#> [1] TRUE