One-way ANOVA for stratified samples in R - r

I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.
I use the survey package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).
My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.
Here is an example. Any help would be appreciated.
## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
id=1:90,
variable=c(rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=140,sd=10)),
groups=c(rep("a",30),
rep("b",30),
rep("c",30)),
weights=c(rep(1,30), # undersampled
rep(1,30),
rep(100,30))) # oversampled
# correct for survey design
test.df.survey<-svydesign(id=~id,
strata=~groups,
weights=~weights,
data=test.df)
## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)
### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))

Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.
I'm going to use two-way anova and then soem port hoc test.
First of all let's build a linear model based on your data and check how does it look like.
library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.98238, p-value = 0.263
leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.6422 0.07692 .
87
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.
Several plots to check that our data fits to this test:
hist(resid(model.lm))
plot(model.lm)
Here is interpretation of plots, they don't look bad actually.
Let's run two-way anova:
anova(model.lm)
Analysis of Variance Table
Response: variable
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 2267.8 1133.88 9.9566 0.0001277 ***
Residuals 87 9907.8 113.88
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the results are very close to yours. Some post hoc test:
(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
MSerror Df Mean CV MSD
113.8831 87 147.8164 7.2195 6.570186
$parameters
test name.t ntr StudentizedRange alpha
Tukey groups:weights 3 3.372163 0.05
$means
variable std r Min Max Q25 Q50 Q75
a:1 150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1 151.8486 8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231
$comparison
NULL
$groups
variable groups
b:1 151.8486 a
a:1 150.8601 a
c:100 140.7404 b
attr(,"class")
[1] "group"
And maybe some different way:
aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
Df Sum Sq Mean Sq F value Pr(>F)
test.df$groups 2 2268 1133.9 9.957 0.000128 ***
Residuals 87 9908 113.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)
$`test.df$groups`
diff lwr upr p adj
b-a 0.9884608 -5.581725 7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461
Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.
Group c with bigger weight differs from groups a and b substantially.

According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(variable~groups,test.df.survey))

Related

R: Calculating ANOVA Sum sqr for a model with interacting numerical and categorical variables

I need to know how it is calculated the Sum Sqr column of the anova() function in R, for a linear model with the form:
modelXg <-lm(Y ~ X * group, data)
(which is equivalent to lm(Y~ X+group+X:group, data=dat) )
where: "X" is a numerical variable, and "group" is a categorical one.
The function anova(modelXg) returns a table like:
Analysis of Variance Table
Response: TMIN
Df Sum Sq Mean Sq F value Pr(>F)
X 1 6476 6476.1 282.9208 < 2.2e-16 ***
group 1 1176 1176.4 51.3956 7.666e-13 ***
X:group 1 64 64.2 2.8058 0.09393 .
Residuals 45130 1033029 22.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What I need is to know how to calculate all the terms of the Sum Sq column, described in a way as easy and reproducible as possible, because I need to implement it in C#.
I already searched a lot accross the Net, but I didn't find this exact case. I found some useful info in Interpretation of Sum Sq in ANOVA with numeric independent variable but it is incomplete for this case, because there the model does not involve the interaction between both variables.

R code to test the difference between coefficients of regressors from one panel regression

I am trying to compare two regression coefficient from the same panel regression used over two different time periods in order to confirm the statistical significance of difference. Therefore, running my panel regression first with observations over 2007-2009, I get an estimate of one coefficient I am interested in to compare with the estimate of the same coefficient obtained from the same panel model applied over the period 2010-2017.
Based on R code to test the difference between coefficients of regressors from one regression, I tried to compute a likelihood ratio test. In the linked discussion, they use a simple linear equation. If I use the same commands in R than described in the answer, I get results based on a chi-squared distribution and I don't understand if and how I can interpret that or not.
In r, I did the following:
linearHypothesis(reg.pannel.recession.fe, "Exp_Fri=0.311576")
where reg.pannel.recession.fe is the panel regression over the period 2007-2009, Exp_Fri is the coefficient of this regression I want to compare, 0.311576 is the estimated coefficient over the period 2010-2017.
I get the following results using linearHypothesis():
How can I interpret that? Should I use another function as it is plm objects?
Thank you very much for your help.
You get a F test in that example because as stated in the vignette:
The method for "lm" objects calls the default method, but it changes
the
default test to "F" [...]
You can also set the test to F, but basically linearHypothesis works whenever the standard error of the coefficient can be estimated from the variance-covariance matrix, as also said in the vignette:
The default method will work with any model
object for which the coefficient vector can be retrieved by ‘coef’
and the coefficient-covariance matrix by ‘vcov’ (otherwise the
argument ‘vcov.’ has to be set explicitly)
So using an example from the package:
library(plm)
data(Grunfeld)
wi <- plm(inv ~ value + capital,
data = Grunfeld, model = "within", effect = "twoways")
linearHypothesis(wi,"capital=0.3",test="F")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df F Pr(>F)
1 170
2 169 1 6.4986 0.01169 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
linearHypothesis(wi,"capital=0.3")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df Chisq Pr(>Chisq)
1 170
2 169 1 6.4986 0.0108 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And you can also use a t.test:
tested_value = 0.3
BETA = coefficients(wi)["capital"]
SE = coefficients(summary(wi))["capital",2]
tstat = (BETA- tested_value)/SE
pvalue = as.numeric(2*pt(-tstat,wi$df.residual))
pvalue
[1] 0.01168515

get p-values from post-hoc duncan test in r

I want to perform a post-hoc duncan test (use "agricolae" package in r) after running one-way anova comparing the means of 3 groups.
## run one-way anova
> t1 <- aov(q3a ~ pgy,data = pgy)
> summary(t1)
Df Sum Sq Mean Sq F value Pr(>F)
pgy 2 13 6.602 5.613 0.00367 **
Residuals 6305 7416 1.176
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1541 observations deleted due to missingness
## run post-hoc duncan test
> duncan.test(t1,"pgy",group = T, console = T)
Study: t1 ~ "pgy"
Duncan's new multiple range test
for q3a
Mean Square Error: 1.176209
pgy, means
q3a std r Min Max
PGY1 1.604292 1.068133 2656 1 5
PGY2 1.711453 1.126446 2017 1 5
PGY3 1.656269 1.057937 1635 1 5
Groups according to probability of means differences and alpha level( 0.05 )
Means with the same letter are not significantly different.
q3a groups
PGY2 1.711453 a
PGY3 1.656269 ab
PGY1 1.604292 b
However, the output only tells me the mean of PGY1 and PGY2 are different without p-values for each group comparison ( post-hoc pairwise t tests would generate p-values for each group comparison).
How can I get p value from a duncan test?
Thanks!!
One solution would be to use PostHocTest from the DescTools package.
Here is an example using the warpbreaks sample data.
require(DescTools);
res <- aov(breaks ~ tension, data = warpbreaks);
PostHocTest(res, method = "duncan");
#
# Posthoc multiple comparisons of means : Duncan's new multiple range test
# 95% family-wise confidence level
#
#$tension
# diff lwr.ci upr.ci pval
#M-L -10.000000 -17.95042 -2.049581 0.01472 *
#H-L -14.722222 -23.08443 -6.360012 0.00072 ***
#H-M -4.722222 -12.67264 3.228197 0.23861
The pairwise differences between the means for every group are given in the first column (e.g. M-L, and so on), along with confidence intervals and p-values.
For example, the difference in the mean breaks between H and M is not statistically significant.
If performing Duncan's test is not a critical requirement, you can also run pairwise.t.test with various other multiple comparison corrections. For example, using Bonferroni's method
with(warpbreaks, pairwise.t.test(breaks, tension, p.adj = "bonferroni"));
#
# Pairwise comparisons using t tests with pooled SD
#
#data: breaks and tension
#
# L M
#M 0.0442 -
#H 0.0015 0.7158
#
#P value adjustment method: bonferroni
Results are consistent with those from the post-hoc Duncan's test.

Interpreting output of analysis of deviance table from anova() model comparison

I have a large multivariate abundance data and I am interested in comparing multiple models that fit different combinations of three categorical predictor variables to my species matrix response variable. I have been using anova() to compare my different models, but I am having difficulty interpreting the output. Below, I have given my code as well as the corresponding R output.
invert.mvabund <- mvabund(mva.dat)
null<-manyglm(mva.dat~1, family='negative.binomial')
m1 <- manyglm(mva.dat~Habitat+Detritus, family='negative.binomial')
m2 <- manyglm(mva.dat~Habitat*Detritus, family='negative.binomial')
m3 <- manyglm(mva.dat~Habitat*Detritus+Block, family='negative.binomial')
anova(null,m1,m2,m3)
Analysis of Deviance Table
null: mva.dat ~ 1
m1: mva.dat ~ Habitat + Detritus
m2: mva.dat ~ Habitat * Detritus
m3: mva.dat ~ Habitat * Detritus + Block
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
null 99
m1 94 5 257.2 0.001 ***
m2 90 4 87.7 0.003 **
m3 81 9 173.5 0.003 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
How do I interpret these results? Is m2 the best-fitting model because it has the lowest deviance, even though it has a higher p-value than m1? Is this because the p-value is suggesting that there is a significant level of deviance, so the optimal model will have a higher p-value? Any suggestions on how to interpret these results would be much appreciated- I haven't been able to find a clear answer in my Google searches. Thanks!

anova.rq() in quantreg package in R

I'm interested in comparing estimates from different quantiles (same outcome, same covariates) using anova.rqlist function called by anova in the environment of the quantreg package in R. However the math in the function is beyond my rudimentary expertise. Lets say i fit 3 models at different quantiles;
library(quantreg)
data(Mammals) # data in quantreg to be used as a useful example
fit1 <- rq(weight ~ speed + hoppers + specials, tau = .25, data = Mammals)
fit2 <- rq(weight ~ speed + hoppers + specials, tau = .5, data = Mammals)
fit3 <- rq(weight ~ speed + hoppers + specials, tau = .75, data = Mammals)
Then i compare them using;
anova(fit1, fit2, fit3, test="Wald", joint=FALSE)
My question is which is of these models is being used as the basis of the comparison?
My understanding of the Wald test (wiki entry)
where θ^ is the estimate of the parameter(s) of interest θ that is compared with the proposed value θ0.
So my question is what is the anova function in quantreg choosing as the θ0?
Based on the pvalue returned from the anova my best guess is that it is choosing the lowest quantile specified (ie tau=0.25). Is there a way to specify the median (tau = 0.5) or better yet the mean estimate from obtained using lm(y ~ x1 + x2 + x3, data)?
anova(fit1, fit2, fit3, joint=FALSE)
actually produces
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.25 0.5 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
while
anova(fit3, fit1, fit2, joint=FALSE)
produces the exact same result
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.5 0.25 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The order of the models is clearly being changed in the anova, but how is it that the F value and Pr(>F) are identical in both tests?
All the quantiles you input are used and there is not one model used as a reference.
I suggest you read this post and the related answer to understand what your "theta.0" is.
I believe what you are trying to do is to test whether the regression lines are parallel. In other words whether the effects of the predictor variables (only income here) are uniform across quantiles.
You can use the anova() from the quantreg package to answer this question. You should indeed use several fits for each quantile.
When you use joint=FALSE as you did, you get coefficient-wise comparisons. But you only have one coefficient so there is only one line! And your results tells you that the effect of income is not uniform accross quantiles in your example. Use several predictor variables and you will get several p-values.
You can do an overall test of equality of the entire sets of coefficients if you do not use joint=FALSE and that would give you a "Joint Test of Equality of Slopes" and therefore only one p-value.
EDIT:
I think theta.0 is the average slope for all 'tau' values or the actual estimate from 'lm()', rather than a specific slope of any of the models. My reasoning is that 'anova.rq()' does not require any specific low value of 'tau' or even the median 'tau'.
There are several ways to test this. Either do the calculations by hand with theta.0 being equal to the average value, or compare many combinations because then you could a situation where certain of your models are close to the model with a low 'tau' values but not to the 'lm()' value. So if theta.0 is the slope of the first model with lowest 'tau' then your Pr(>F) will be high whereas in the other case, it will be low.
This question should maybe have been asked on cross-validated.

Resources