I have the following model:
y = b1_group1*X1 + b1_group2*X1 + b2_group1*X2 + b2_group2*X2 + ... +
b10_group1*X10 + b10_group2*X10
Easily made in R as follows:
OLS <- lm(Y ~ t1:Group + t2:Group + t3:Group + t4:Group + t5:Group + t6:Group +
t7:Group + t8:Group + t9:Group + t10:Group,weights = weight, Alldata)
In STATA, I can now do the following test:
test (b1_group1=b1_group2) (b2_group1=b2_group2) (b3_group1=b3_group2)
b1_group1 - b1_group2 = 0
b2_group1 - b2_group2 = 0
b3_group1 - b3_group2 = 0
Which tells me whether the group of coefficents from X1, X2 and X3 are jointly different between Group 1 and Group 2 by means of an F test.
Can somebody please tell how how to do this in R? Thanks!
Look at this example:
library(car)
mod <- lm(mpg ~ disp + hp + drat*wt, mtcars)
linearHypothesis(mod, c("disp = hp", "disp = drat", "disp = drat:wt" ))
Linear hypothesis test
Hypothesis:
disp - hp = 0
disp - drat = 0
disp - drat:wt = 0
Model 1: restricted model
Model 2: mpg ~ disp + hp + drat * wt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 211.80
2 26 164.67 3 47.129 2.4804 0.08337 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
See ?linearHypothesis for a variety of other ways to specify the test.
Alternative:
The above shows you a quick and easy way to carry out hypothesis tests. Users with a solid understanding of the algebra of hypothesis tests may find the following approach more convenient, at least for simple versions of the test. Let's say we want to test whether or not the coefficients on cyl and carb are identical.
mod <- lm(mpg ~ disp + hp + cyl + carb, mtcars)
The following tests are equivalent:
Test one:
linearHypothesis(mod, c("cyl = carb" ))
Linear hypothesis test
Hypothesis:
cyl - carb = 0
Model 1: restricted model
Model 2: mpg ~ disp + hp + cyl + carb
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 238.83
2 27 238.71 1 0.12128 0.0137 0.9076
Test two:
rmod<- lm(mpg ~ disp + hp + I(cyl + carb), mtcars)
anova(mod, rmod)
Analysis of Variance Table
Model 1: mpg ~ disp + hp + cyl + carb
Model 2: mpg ~ disp + hp + I(cyl + carb)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 27 238.71
2 28 238.83 -1 -0.12128 0.0137 0.9076
Related
If my model looks like this, Y=β0+β1X1+β2X2+β3X3+β4X4, and I want to perform an F test (5%) in R for β1=β2, how do I do it?
The only tutorials I can find online deal with β1=β2=0, but that's not what I'm looking for here.
Here's an example in R testing whether the coefficient for vs is the same as the coefficient for am:
data(mtcars)
mod <- lm(mpg ~ hp + disp + vs + am, data=mtcars)
library(car)
linearHypothesis(mod, "vs=am")
# Linear hypothesis test
#
# Hypothesis:
# vs - am = 0
#
# Model 1: restricted model
# Model 2: mpg ~ hp + disp + vs + am
#
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 28 227.07
# 2 27 213.52 1 13.547 1.7131 0.2016
The glht function from multcomp package can do this (among others). For example, if your model is
mod1 <-lm( y ~ x1 + x2 + x3 + x4)
then you can use:
summary(multcomp::glht(mod1, "x1-x2=0"))
Run the model with and without the constraint and then use anova to compare them. No packages are used.
mod1 <- lm(mpg ~ cyl + disp + hp + drat, mtcars)
mod2 <- lm(mpg ~ I(cyl + disp) + hp + drat, mtcars) # constraint imposed
anova(mod2, mod1)
giving:
Analysis of Variance Table
Model 1: mpg ~ I(cyl + disp) + hp + drat
Model 2: mpg ~ cyl + disp + hp + drat
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 252.95
2 27 244.90 1 8.0513 0.8876 0.3545
The underlying calculation is the following. It gives the same result as above.
L <- matrix(c(0, 1, -1, 0, 0), 1) # hypothesis is L %*% beta == 0
q <- nrow(L) # 1
co <- coef(mod1)
resdf <- df.residual(mod1) # = nobs(mod1) - length(co) = 32 - 5 = 27
SSH <- t(L %*% co) %*% solve(L %*% vcov(mod1) %*% t(L)) %*% L %*% co
SSH/q # F value
## [,1]
## [1,] 0.8876363
pf(SSH/q, q, resdf, lower.tail = FALSE) # p value
## [,1]
## [1,] 0.3544728
Suppose in R I have multiple GLM objects from multiple glm() function calls.
glm_01
glm_02
...
glm_nn
...and suppose that I want to do all possible pairwise comparisons using a chi-squared or F ANOVA test.
anova(glm_01, glm_02, test = "F")
anova(glm_01, glm_03, test = "F")
anova(glm_01, glm_04, test = "F")
...
I don't want to do this manually because the list of models is quite long. Instead I'd like to grab a list of relevant model objects (anything starting with "glm_") and do all pairwise comparisons automatically. However I'm unsure how to pass the model objects (rather than their names in string form) to the anova() function.
As a simple example:
data(mtcars)
# create some models
glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# get list of relevant model objects from the R environment
model_list <- ls()
model_list <- model_list[substr(model_list, 1, 4) == "glm_"]
# create a table to store the pairwise ANOVA results
n_models <- length(model_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
for(row_index in 1:n_models) {
for(col_index in 1:n_models) {
anova_table[row_index, col_index] <- anova(model_list[row_index], model_list[col_index], test = "F")$'Pr(>F)'[2]
}
}
...but of course this loop at the end doesn't work because I'm not passing model objects to anova(), I'm passing the names of the objects as strings instead. How do I tell anova() to use the object that the string refers to, instead of the string itself?
Thank you.
======================
Possible solution:
data(mtcars)
glm_list <- list()
glm_list$glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_list$glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_list$glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_list$glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# create a table to store the pairwise ANOVA results
n_models <- length(glm_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
row_idx <- 0
col_idx <- 0
for(row_glm in glm_list)
{
row_idx <- row_idx + 1
for(col_glm in glm_list)
{
col_idx <- col_idx + 1
anova_table[row_idx, col_idx] <- anova(row_glm, col_glm, test = "F")$'Pr(>F)'[2]
}
col_idx <- 0
}
row_idx <- 0
The easiest way to do this would be to keep all your models in a list. This makes it simple to iterate over them. For example, you can create all of your models and do a pairwise comparison between all of them like this:
data(mtcars)
f_list <- list(mpg ~ cyl,
mpg ~ cyl + disp,
mpg ~ cyl + disp + hp,
mpg ~ cyl + disp + hp + wt)
all_glms <- lapply(f_list, glm, data = mtcars, family = gaussian)
all_pairs <- as.data.frame(combn(length(all_glms), 2))
result <- lapply(all_pairs, function(i) anova(all_glms[[i[1]]], all_glms[[i[2]]]))
Which gives you:
result
#> $V1
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 29 270.74 1 37.594
#>
#> $V2
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 28 261.37 2 46.965
#>
#> $V3
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 27 170.44 3 137.89
#>
#> $V4
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 28 261.37 1 9.3709
#>
#> $V5
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 27 170.44 2 100.3
#>
#> $V6
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp + hp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 28 261.37
#> 2 27 170.44 1 90.925
Created on 2020-08-25 by the reprex package (v0.3.0)
If you want to reference arbitrary objects in an accessible environment by symbol without putting them into a list object, the standard way to return the top object on the search list whose symbol is equal to a string is get(), or the vector equivalent mget(). I.e. get("glm_01") gets you the top object on the search list that has the symbol glm_01. The most minimal modification to your approach would be to wrap your calls to model_list[row_index] and model_list[col_index] in get().
You can be more precise about where to look for objects by assigning the models in a named environment and only getting from that environment (using the envir parameter to get()).
As a reproducible example, let's use the next no-sense example:
> library(glmmTMB)
> summary(glmmTMB(am ~ disp + hp + (1|carb), data = mtcars))
Family: gaussian ( identity )
Formula: am ~ disp + hp + (1 | carb)
Data: mtcars
AIC BIC logLik deviance df.resid
34.1 41.5 -12.1 24.1 27
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
carb (Intercept) 2.011e-11 4.485e-06
Residual 1.244e-01 3.528e-01
Number of obs: 32, groups: carb, 6
Dispersion estimate for gaussian family (sigma^2): 0.124
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7559286 0.1502385 5.032 4.87e-07 ***
disp -0.0042892 0.0008355 -5.134 2.84e-07 ***
hp 0.0043626 0.0015103 2.889 0.00387 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Actually, my real model family is nbinom2. I want to make a contrast test between disp and hp. So, I try:
> glht(glmmTMB(am ~ disp + hp + (1|carb), data = mtcars), linfct = matrix(c(0,1,-1)))
Error in glht.matrix(glmmTMB(am ~ disp + hp + (1 | carb), data = mtcars), :
‘ncol(linfct)’ is not equal to ‘length(coef(model))’
How can I avoid this error?
Thank you!
The problem is actually fairly simple: linfct needs to be a matrix with the number of columns equal to the number of parameters. You specified matrix(c(0,1,-1)) without specifying numbers of rows or columns, so R made a column matrix by default. Adding nrow=1 seems to work.
library(glmmTMB)
library(multcomp)
m1<- glmmTMB(am ~ disp + hp + (1|carb), data = mtcars)
modelparm.glmmTMB <- function (model, coef. = function(x) fixef(x)[[component]],
vcov. = function(x) vcov(x)[[component]],
df = NULL, component="cond", ...) {
multcomp:::modelparm.default(model, coef. = coef., vcov. = vcov.,
df = df, ...)
}
glht(m1, linfct = matrix(c(0,1,-1),nrow=1))
Is there a standard way to estimate confidence interval for the variance parameter of a linear model with fixed-effect. E.g. given:
reg=lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
how can I get the confidence interval for the variance parameter. confint only details fixed effect and lmer from lme4 does not accept model without level-2 random-effect, which is my case here.
Unfortunately, you have to implement it yourself.
Like so :
reg <- lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
alpha <- 0.05
n <- length(resid(reg))
sigma <- summary(reg)$sigma
sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
> sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
[1] 0.4600539
[1] 1.287194
It comes from the relation :
I assume you are looking for the summary() function.
The code shows the following:
data(mtcars)
reg<-lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
summary(reg)
# Call:
# lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.6923 -0.3901 0.0579 0.3649 1.2608
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.740648 0.738594 1.003 0.32487
# disp 0.002703 0.002715 0.996 0.32832
# hp 0.005275 0.003253 1.621 0.11657
# wt 1.001303 0.302761 3.307 0.00267 **
# am 0.155815 0.375515 0.415 0.68147
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6754 on 27 degrees of freedom
# Multiple R-squared: 0.8527, Adjusted R-squared: 0.8309
# F-statistic: 39.08 on 4 and 27 DF, p-value: 7.369e-11
To select it, you can store the summary as a variable and select the coefficients.
summa<-summary(reg)
summa$coefficients
With that, one can select the sd covariate that you want and do the confidence interval with the % of interest. To learn the confidence interval, one can read how it is done here
R does it automatically using confint(object, parms, level)
In your case, confint(reg, level = 0.95)
I am trying to fit two nested models and then test those against each other using anova function. The commands used are:
probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt,
family=binomial(link = "probit"))
nprobit <- update(probit, . ~ . - afqt1)
anova(nprobit, probit, test="Rao")
However, the variable afqt1 apparently contains NAs and because the update call does not take the same subset of data, anova() returns error
Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, :
models were not all fitted to the same size of dataset
Is there a simple way how to achieve refitting the model on the same dataset as the original model?
As suggested in the comments, a straightforward approach to this is to use the model data from the first fit (e.g. probit) and update's ability to overwrite arguments from the original call.
Here's a reproducible example:
data(mtcars)
mtcars[1,2] <- NA
nobs( xa <- lm(mpg~cyl+disp, mtcars) )
## [1] 31
nobs( update(xa, .~.-cyl) ) ##not nested
## [1] 32
nobs( xb <- update(xa, .~.-cyl, data=xa$model) ) ##nested
## [1] 31
It is easy enough to define a convenience wrapper around this:
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
This forces the data argument of the updated call to re-use the data from the first model fit.
nobs( xc <- update_nested(xa, .~.-cyl) )
## [1] 31
all.equal(xb, xc) ##only the `call` component will be different
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE
So now you can easily do anova:
anova(xa, xc)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 269.97
## 2 29 312.96 -1 -42.988 4.4584 0.04378 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The other approach suggested is na.omit on the data frame prior to the lm() call. At first I thought this would be impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars), but not because of speed. This approach requires manual bookkeeping of which vars should be sanitized of NAs and which shouldn't, and is precisely what the OP seems intent to avoid. The biggest drawback would be that you must always keep in sync the formula with the subsetted data frame.
This however can be overcome rather easily, as it turns out:
data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear +
carb, mtcars) )
## [1] 21
nobs( xb <- update(xa, .~.-cyl) ) ##not nested
## [1] 22
nobs( xb <- update_nested(xa, .~.-cyl) ) ##nested
## [1] 21
nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) ) ##nested
## [1] 21
all.equal(xb, xc)
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE
anova(xa, xc)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 10 104.08
## 2 11 104.42 -1 -0.34511 0.0332 0.8591