How to run a linear hypothesis with interaction? - r

My regression is as follow :
model <- lm(y ~ a:b + a + b + c)
And I want to test whereas the coefficients of my interaction so "a:b" and my variable "a" are equal to 0, or if at least one is different from 0.
I know that I need to use linearHypothesis.
But I only managed to test if at least one of the coefficients of my interaction is different from 0.
linearHypothesis(model,matchCoefs(model,":"))
Do you know how to enter into the linearHypothesis my variable "a" ?
Thanks for your help.

You can pass the names of the variables you want to test if ar both equal 0.
Dummy data:
a = rnorm(100)
b = rnorm(100)
c = rnorm(100)
y = 4 + 1*a + 3*b + 0.5*c + 2*a*b + rnorm(100)
mod = lm(y ~ a:b + a + b + c)
>car::linearHypothesis(mod, c("a","a:b"))
Linear hypothesis test
Hypothesis:
a = 0
a:b = 0
Model 1: restricted model
Model 2: y ~ a:b + a + b + c
Res.Df RSS Df Sum of Sq F Pr(>F)
1 97 657.20
2 95 116.31 2 540.88 220.89 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If there was only one of a or a*b in the model, the null would also be rejected, only if none of them were present. To see that try setting y = 4 + 3*b + 0.5*c + rnorm(100) and y = 4 + 3*b + 0.5*c + a + rnorm(100). But the test isn't perfect, if there is too much noise on y, even if a and a:b were in the model we would accept the null (try y = 4 + 1*a + 3*b + 0.5*c + 2*a*b + rnorm(100, sd=1000))

Related

Diff-in-diff using Instrumental Variables: How to implement with ivreg? (interaction between endogenous and exogenous variables)

I'd like to estimate the effect of a treatment on two separate groups, so something of the form
Equation 1
T being the treatment and M the dummy separating the two groups.
The problem is that the treatment is correlated to other variables that affect Y. Luckily, there exists a variable Z that serves as an instrument for T. What I've been able to implement in R was to "manually" run 2SLS, following the stages
Equation 2
and
Equation 3
To provide a reproducible example, first a simulation
n <- 100
set.seed(271)
Z <- runif(n)
e <- rnorm(n, sd = 0.5)
M <- as.integer(runif(n)) # dummy
u <- rnorm(n)
# Treat = 1 + 2*Z + e
alpha_0 <- 1
alpha_1 <- 2
Treat <- alpha_0 + alpha_1*Z + e
# Y = 3 + M + 2*Treat + 3*Treat * M + e + u (ommited vars that determine Treat affect Y)
beta_0 <- 3
beta_1 <- 1
beta_2 <- 2
beta_3 <- 3
Y <- beta_0 + beta_1*M + beta_2*Treat + beta_3 * M*Treat + e + u
The first stage regression
fs <- lm(Treat ~ Z)
stargazer::stargazer(fs, type = "text")
===============================================
Dependent variable:
---------------------------
Treat
-----------------------------------------------
Z 2.383***
(0.168)
Constant 0.835***
(0.096)
-----------------------------------------------
Observations 100
R2 0.671
Adjusted R2 0.668
Residual Std. Error 0.445 (df = 98)
F Statistic 200.053*** (df = 1; 98)
===============================================
And second stage
Treat_hat <- fitted(fs)
ss <- lm(Y ~ M + Treat_hat + M:Treat_hat)
stargazer::stargazer(ss, type = "text")
===============================================
Dependent variable:
---------------------------
Y
-----------------------------------------------
M 1.230
(1.717)
Treat_hat 2.243***
(0.570)
M:Treat_hat 2.636***
(0.808)
Constant 2.711**
(1.213)
-----------------------------------------------
Observations 100
R2 0.727
Adjusted R2 0.718
Residual Std. Error 2.539 (df = 96)
F Statistic 85.112*** (df = 3; 96)
===============================================
The problem now is that those Standard Errors aren't adjusted for the first stage, which looks like quite some work to do manually. As I'd do for any other IV regression, I'd prefer to just use AER::ivreg.
But I can't seem to get the same regression going there. Here are many possible iterations, that never quite do the same thing
AER::ivreg(Y ~ M + Treat + M:Treat | Z)
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z)
Warning message:
In ivreg.fit(X, Y, Z, weights, offset, ...) :
more regressors than instruments
These make sense, I guess
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
Call:
AER::ivreg(formula = Y ~ M + Treat + M:Treat | M + Z + M:Z)
Coefficients:
(Intercept) M Treat M:Treat
2.641 1.450 2.229 2.687
Surprisingly close, but not quite.
I couldn't find a way to tell ivreg that Treat and M:Treat aren't really two separate endogenous variables, but really just the same endogenous variable moved around and interacted with an exogenous one.
In conclusion,
i) Is there some way to mess with ivreg and make this work?
ii) Is there some other function for 2SLS that can just manually accept 1st and 2nd stage formulas without this sort of restriction, and that adjusts standard errors?
iii) What's the simplest way to get the correct SEs if there are no other alternatives? I didn't come across any direct R code, just a bunch of matrix multiplication formulas (although I didn't dig too deep for this one).
Thank you
Essentially, if Z is a valid a valid instrument for Treat, M:Z should be a valid instrument for M:Treat, so, to me this makes sense:
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
I actually managed to back out the correct param values for a modified simulation:
n <- 100
set.seed(271)
Z <- runif(n)
e <- rnorm(n, sd = 0.5)
M <- round(runif(n)) # note: I changed from as.integer() to round() in order to get some 1's in the regression
u <- rnorm(n)
# Treat = 1 + 2*Z + e
alpha_0 <- 1
alpha_1 <- 2
Treat <- alpha_0 + alpha_1*Z + e
beta_0 <- 3
beta_1 <- 1
beta_2 <- 2
beta_3 <- 3
Y <- beta_0 + beta_1*M + beta_2*Treat + beta_3 * M*Treat
Now:
my_ivreg <- AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
>summary(my_ivreg)
Call:
AER::ivreg(formula = Y ~ M + Treat + M:Treat | M + Z + M:Z)
Residuals:
Min 1Q Median 3Q Max
-1.332e-14 -7.105e-15 -3.553e-15 -8.882e-16 3.553e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000e+00 2.728e-15 1.100e+15 <2e-16 ***
M 1.000e+00 3.810e-15 2.625e+14 <2e-16 ***
Treat 2.000e+00 1.255e-15 1.593e+15 <2e-16 ***
M:Treat 3.000e+00 1.792e-15 1.674e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.633e-15 on 96 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
Wald test: 1.794e+31 on 3 and 96 DF, p-value: < 2.2e-16
Which is what we were looking for...

About using anova.cca() in vegan packages that show me :Error in terms.formula(formula,"Condition",data=data): 'data' argument is of the wrong type

Thanks to look at this question!
I make Redundancy analysis in R with vegan package using two data sets(one is microbial species, another is environment variables e.g pH DOC ...).
Code showing:
mod<- rda(species~.,data=environment)
summary(mod)
Call:
rda(formula = species ~ Treatment + SR + DOC + NH4 + NO3 + AG + BG + CB +
XYL + LAP + NAG + PHOS + PHOX + PEOX + pH + AP +MBC + MBN, data =
environment)
Partitioning of variance:
Inertia Proportion
Total 644.87 1.0000
Constrained 560.18 0.8687
Unconstrained 84.69 0.1313
Eigenvalues, and their contribution to the variance
Importance of components:
RDA1 RDA2 RDA3 RDA4 RDA5 RDA6
RDA7
Eigenvalue 237.4215 126.5881 66.9355 31.62478 21.82210 18.56123
9.04857
Proportion Explained 0.3682 0.1963 0.1038 0.04904 0.03384 0.02878
0.01403
Cumulative Proportion 0.3682 0.5645 0.6683 0.71731 0.75115 0.77993
0.79396 ...
and than i run the following code. it's normal.
anova.cca(mod)
Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999
Model: rda(formula = species ~ Treatment + SR + DOC + NH4 + NO3 + AG + BG +
CB + XYL + LAP + NAG + PHOS + PHOX + PEOX + pH + AP + MBC + MBN, data =
environment)
Df Variance F Pr(>F)
Model 21 560.18 7.2444 0.001 ***
Residual 23 84.69
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
after above i run the following to know the significance of axis.
anova.cca(mod,by = "axis")
Error in terms.formula(formula, "Condition", data = data) :
'data' argument is of the wrong type
i don't know what's happen, it confused me. I run the example:
data(dune)
data(dune.env)
dune.Manure <- rda(dune ~ ., dune.env)
anova.cca(dune.Manure,by='axis')
Permutation test for rda under reduced model
Forward tests for axes
Permutation: free
Number of permutations: 999
Model: rda(formula = dune ~ A1 + Moisture + Management + Use + Manure, data = dune.env)
Df Variance F Pr(>F)
RDA1 1 22.3955 7.4946 0.003 **
RDA2 1 16.2076 5.4239 0.066 .
RDA3 1 7.0389 2.3556 0.679
RDA4 1 4.0380 1.3513 0.998
....
Residual 7 20.9175
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
this works.I compare the data with example and mine, and find no mistakes (i think). The question is the error in "anova.cca(mod,by = "axis")", why it occurred?
. So how to deal this question.Thanks a lot!!!

Why is drop1 ignoring linear terms for mixed models?

I have six fixed factors: A, B, C, D, E and F, and one random factor R. I want to test linear terms, pure quadratic terms and two-way interactions using language R. So, I constructed the full linear mixed model and tried to test its terms with drop1:
full.model <- lmer(Z ~ A + B + C + D + E + F
+ I(A^2) + I(B^2) + I(C^2) + I(D^2) + I(E^2) + I(F^2)
+ A:B + A:C + A:D + A:E + A:F
+ B:C + B:D + B:E + B:F
+ C:D + C:E + C:F
+ D:E + D:F
+ E:F
+ (1 | R), data=mydata, REML=FALSE)
drop1(full.model, test="Chisq")
It seems that drop1 is completely ignoring linear terms:
Single term deletions
Model:
Z ~ A + B + C + D + E + F + I(A^2) + I(B^2) + I(C^2) + I(D^2) +
I(E^2) + I(F^2) + A:B + A:C + A:D + A:E + A:F + B:C + B:D +
B:E + B:F + C:D + C:E + C:F + D:E + D:F + E:F + (1 | R)
Df AIC LRT Pr(Chi)
<none> 127177
I(A^2) 1 127610 434.81 < 2.2e-16 ***
I(B^2) 1 127378 203.36 < 2.2e-16 ***
I(C^2) 1 129208 2032.42 < 2.2e-16 ***
I(D^2) 1 127294 119.09 < 2.2e-16 ***
I(E^2) 1 127724 548.84 < 2.2e-16 ***
I(F^2) 1 127197 21.99 2.747e-06 ***
A:B 1 127295 120.24 < 2.2e-16 ***
A:C 1 127177 1.75 0.185467
A:D 1 127240 64.99 7.542e-16 ***
A:E 1 127223 48.30 3.655e-12 ***
A:F 1 127242 66.69 3.171e-16 ***
B:C 1 127180 5.36 0.020621 *
B:D 1 127202 27.12 1.909e-07 ***
B:E 1 127300 125.28 < 2.2e-16 ***
B:F 1 127192 16.60 4.625e-05 ***
C:D 1 127181 5.96 0.014638 *
C:E 1 127298 122.89 < 2.2e-16 ***
C:F 1 127176 0.77 0.380564
D:E 1 127223 47.76 4.813e-12 ***
D:F 1 127182 6.99 0.008191 **
E:F 1 127376 201.26 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If I exclude interactions from the model:
full.model <- lmer(Z ~ A + B + C + D + E + F
+ I(A^2) + I(B^2) + I(C^2) + I(D^2) + I(E^2) + I(F^2)
+ (1 | R), data=mydata, REML=FALSE)
drop1(full.model, test="Chisq")
then the linear terms get tested:
Single term deletions
Model:
Z ~ A + B + C + D + E + F + I(A^2) + I(B^2) + I(C^2) + I(D^2) +
I(E^2) + I(F^2) + (1 | R)
Df AIC LRT Pr(Chi)
<none> 127998
A 1 130130 2133.9 < 2.2e-16 ***
B 1 130177 2181.0 < 2.2e-16 ***
C 1 133464 5467.6 < 2.2e-16 ***
D 1 129484 1487.9 < 2.2e-16 ***
E 1 130571 2575.0 < 2.2e-16 ***
F 1 128009 12.7 0.0003731 ***
I(A^2) 1 128418 422.2 < 2.2e-16 ***
I(B^2) 1 128193 197.4 < 2.2e-16 ***
I(C^2) 1 129971 1975.1 < 2.2e-16 ***
I(D^2) 1 128112 115.6 < 2.2e-16 ***
I(E^2) 1 128529 533.0 < 2.2e-16 ***
I(F^2) 1 128017 21.3 3.838e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Because this is the way drop1 works (it's not specific to mixed models - you would find this behaviour for a regular linear model fitted with lm as well). From ?drop1:
The hierarchy is respected when considering terms to be added or dropped: all main effects contained in a second-order interaction must remain, and so on.
I discuss this at some length in this CrossValidated post
The statistically tricky part is that testing lower-level interactions in a model that also contains higher-level interactions is (depending on who you talk to) either (i) hard to do correctly or (ii) just plain silly (for the latter position, see part 5 of Bill Venables's "exegeses on linear models"). The rubric for this is the principle of marginality. At the very least, the meaning of the lower-order terms depends sensitively on how contrasts in the model are coded (e.g. treatment vs. midpoint/sum-to-zero). My default rule is that if you're not sure you understand exactly why this might be a problem, you shouldn't violate the principle of marginality.
However, as Venables actually describes in the linked article, you can get R to violate marginality if you want (p. 15):
To my delight I see that marginality constraints between factor terms are by default honoured and students are not led down the logically slippery ‘Type III sums of squares’ path. We discuss why it is that no main effects are shown, and it makes a useful tutorial point.
The irony is, of course, that Type III sums of squares were available all along if only people understood what they really were and how to get them. If the call to drop1 contains any formula as the second argument, the sections of the model matrix corresponding to all non-intercept terms are omitted seriatim from the model, giving some sort of test for a main effect ...
Provided you have used a contrast matrix with zero-sum columns they will be unique, and they are none other than the notorious ‘Type III sums of squares’. If you use, say, contr.treatment contrasts, though, so that the columns do not have sum zero, you get nonsense. This sensitivity to something that should in this context be arbitrary ought to be enough to alert anyone to the fact that something silly is being done.
In other words, using scope = . ~ . will force drop1 to ignore marginality. You do this at your own risk - you should definitely be able to explain to yourself what you're actually testing when you follow this procedure ...
For example:
set.seed(101)
dd <- expand.grid(A=1:10,B=1:10,g=factor(1:10))
dd$y <- rnorm(1000)
library(lme4)
m1 <- lmer(y~A*B+(1|g),data=dd)
drop1(m1,scope=.~.)
## Single term deletions
##
## Model:
## y ~ A * B + (1 | g)
## Df AIC
## <none> 2761.9
## A 1 2761.7
## B 1 2762.4
## A:B 1 2763.1

nls line of best fit - how to force plotting of line?

I am trying to write a basic function to add some lines of best fit to plots using nls.
This works fine unless the data just happens to be defined exactly by the formula passed to nls. I'm aware of the issues and that this is documented behaviour as reported here.
My question though is how can I get around this and force a line of best fit to be plotted regardless of the data exactly being described by the model? Is there a way to detect the data matches exactly and plot the perfectly fitting curve? My current dodgy solution is:
#test data
x <- 1:10
y <- x^2
plot(x, y, pch=20)
# polynomial line of best fit
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2], d=co[3]), add = TRUE, col="red", lwd=2)
Which fails with the error:
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 1, d = 1)) :
singular gradient
The easy fix I apply is to jitter the data slightly, but this seems a bit destructive and hackish.
# the above code works after doing...
y <- jitter(x^2)
Is there a better way?
Use Levenberg-Marquardt.
x <- 1:10
y <- x^2
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 0, d = 0)) :
number of iterations exceeded maximum of 50
library(minpack.lm)
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
summary(fit)
Formula: y ~ f(x, a, b, d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1 0 Inf <2e-16 ***
b 0 0 NA NA
d 0 0 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 7 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.49e-08
Note that I had to adjust the starting values and the result is sensitive to starting values.
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0.1, d=0.1))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.000e+00 2.083e-09 4.800e+08 < 2e-16 ***
b -7.693e-08 1.491e-08 -5.160e+00 0.00131 **
d 1.450e-07 1.412e-08 1.027e+01 1.8e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.191e-08 on 7 degrees of freedom
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.49e-08

R not removing terms when doing MAM

I want to do a MAM, but I'm having difficulty in removing some terms:
full.model<-glm(SSB_sq~Veg_height+Bare+Common+Birds_Foot+Average_March+Average_April+
Average_May+Average_June15+Average_June20+Average_June25+Average_July15
+Average_July20+Average_July25,family="poisson")
summary(full.model)
I believe I have to remove these terms to start the MAM like so:
model1<-update(full.model,~.-Veg_height:Bare:Common:Birds_Foot:Average_March
:Average_April:Average_May:Average_June15:Average_June20:Average_June25:
Average_July15:Average_July20:Average_July25,family="poisson")
summary(model1)
anova(model1,full.model,test="Chi")
But I get this output:
anova(model1,full.model,test="Chi")
Analysis of Deviance Table
Model 1: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Model 2: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 213 237.87
2 213 237.87 0 0
I've tried putting plus signs in model1 instead of colons, as I was clutching at straws when reading my notes but the same thing happens.
Why are both my models the same? I've tried searching on Google but I'm not very good at the terminology so my searches aren't bringing up much.
If I read your intention correctly, are you trying to fit a null model with no terms in it? If so, a simpler way is just to use the SSB_sq ~ 1 as the formula, meaning a model with only an intercept.
fit <- lm(sr ~ ., data = LifeCycleSavings)
fit0 <- lm(sr ~ 1, data = LifeCycleSavings)
## or via an update:
fit01 <- update(fit, . ~ 1)
Which gives, for example:
> anova(fit)
Analysis of Variance Table
Response: sr
Df Sum Sq Mean Sq F value Pr(>F)
pop15 1 204.12 204.118 14.1157 0.0004922 ***
pop75 1 53.34 53.343 3.6889 0.0611255 .
dpi 1 12.40 12.401 0.8576 0.3593551
ddpi 1 63.05 63.054 4.3605 0.0424711 *
Residuals 45 650.71 14.460
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(fit, fit0)
Analysis of Variance Table
Model 1: sr ~ pop15 + pop75 + dpi + ddpi
Model 2: sr ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 650.71
2 49 983.63 -4 -332.92 5.7557 0.0007904 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
An explanation of the formulae I use:
The first model used the shortcut ., which means all remaining variables in argument data (in my model it meant all variables in LifeCycleSavings on the RHS of the formula, except for sr which is already on the LHS).
In the second model (fit0), we only include 1 on the RHS of the formula. In R, 1 means an intercept, so sr ~ 1 means fit an intercept-only model. By default, an intercept is assumed, hence we did not need 1 when specifying the first model fit.
If you want to suppress an intercept, add - 1 or + 0 to your formula.
For your data, the first model would be:
full.model <- glm(SSB_sq ~ ., data = FOO, family = "poisson")
where FOO is the data frame holding your variables - you are using a data frame, aren't you? The null, intercept-only model would be specified using one of:
null.model <- glm(SSB_sq ~ 1, data = FOO, family = "poisson")
or
null.model <- update(full.model, . ~ 1)

Resources