Within and between factors in regression models in R - r

I'm trying to run a rmANOVA and a corresponding regression model. In the experiment participants were completing a questionnaire which was evaluating how much of a trait X they have (score). Then they were performing a task, in which each participant was exposed to three conditions (COND - nSCM, SCM, SC). Their brain responses were measured (ERP).
This is how it looks like:
> head(df)
code SEX AGE SCORE COND ERP
1 AA1407 male 29 14 nSCM -3.0348373
2 AN0312 male 26 13 nSCM -1.8799240
3 BR1410 male 23 30 nSCM 0.4284033
4 EZ2404 male 23 23 nSCM -0.7615117
5 HA1012 female 27 22 nSCM -2.9301698
6 HS3004 male 30 16 nSCM -0.5468492
Since I am a bit confused about how to use different types of variables in R, maybe someone could also reassure me about the following:
> sapply(df,class)
code SEX AGE SCORE COND ERP
"factor" "factor" "numeric" "numeric" "factor" "numeric"
Based on the experimental design, the ANOVA design has one between-subject IV: SCORE, one within-subject IV: COND and the DV is ERP (right?).
This is the model I used and the summary:
> anERP <- aov(ERP ~ COND*SCORE, data = df)
> summary(anERP)
Df Sum Sq Mean Sq F value Pr(>F)
COND 2 0.21 0.105 0.027 0.9736
SCORE 1 16.87 16.868 4.297 0.0419 *
COND:SCORE 2 0.58 0.289 0.074 0.9291
Residuals 69 270.85 3.925
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, IF this is right (please let me know if anything doesn't seem right), I should also find an effect for SCORE when I build a regression model, right? Also, I'm not sure how to interpret this effect, since AQ is an interval variable (scores in range 6-35). I would appreciate a little help here.
Now I'm very confused about how this model should look like for regression. I started with simple lm model with SCORE and COND as fixed effects:
> lmERP <- lm(ERP ~ SCORE*COND, data = df)
> summary(lmERP)
Call:
lm(formula = ERP ~ SCORE * COND, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.2554 -1.0916 0.1975 1.4582 3.3097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.04269 1.06193 -2.865 0.00552 **
SCORE 0.06458 0.05229 1.235 0.22108
CONDSCM -0.08141 1.50180 -0.054 0.95693
CONDnSCM 0.36646 1.50180 0.244 0.80795
SCORE:CONDSCM 0.01111 0.07396 0.150 0.88104
SCORE:CONDnSCM -0.01707 0.07396 -0.231 0.81814
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.981 on 69 degrees of freedom
Multiple R-squared: 0.0612, Adjusted R-squared: -0.006827
F-statistic: 0.8997 on 5 and 69 DF, p-value: 0.4864
However, here the main effect of SCORE doesn't reach significance. How is it possible? Shouldn't rmANOVA and regression show roughly similar results (or at least the main effects)?
I guess I'm not applying the right linear model here, since it doesn't seem to recognise there are both within and between subject factors in the design.
I have read hundreds of webpages, tutorials and forums and I'm still completely confused about these models. Thank you in advance for any piece of advice!

Repeated-measures or mixed-model designs can be very confusing to specify using R's base aov function. In the code you have written, for example, aov will treat all the specified factors as independent (i.e., between-subject). I highly recommend using a library that makes it easier to specify these types of designs.
The ez library contains ezANOVA, which makes these tests simple to perform, provided that all your cases are complete (all factors are fully crossed, with no missing data). Assuming that your CODE column uniquely identifies each subject and you wanted to include all factors from your data set, the test would look something like this:
my.aov <- ezANOVA(data = df, dv = ERP, wid = CODE, between = .(SEX, AGE, SCORE), within = COND).
It is also possible to implement these designs with the lme4 package (in fact, ezANOVA is a wrapper around lme4's functions). While lme4 allows for more flexible model specifications and can tolerate incomplete data, its syntax is more difficult. Bodo Winter's tutorial on lme4 is a good start, if you want to go really deep.
As an aside, there is usually little point in performing both an ANOVA and a linear regression. Unless the two tests are specified in a way that treats the factors differently, the results will be equivalent.

Related

Multivariable regression interaction term with categorical variables

I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(death~patient.group*bmi.group, data = data, family = "binomial")
summary(model)
and I get the following:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the patient.group:bmi.group. My question is, is there a way I can get a single p-value for patient.group:bmi.group instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.
It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(patient.group=1:4, bmi.group=1:3, rep=1:5) %>%
mutate(
z=-0.25*patient.group + 0.75*bmi.group,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
select(-z)
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm(death~patient.group * bmi.group, data = data, family = "binomial")
summary(covariateModel)
Giving, in part
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139
patient.group 0.7407 0.6472 1.144 0.252
bmi.group 1.2697 0.8340 1.523 0.128
patient.group:bmi.group -0.3807 0.2984 -1.276 0.202
Here, the p value for the patient.group:bmi.group interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor(patient.group) + as.factor(bmi.group), data = data, family = "binomial")
interactionModel <- glm(death~as.factor(patient.group) * as.factor(bmi.group), data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Giving
Analysis of Deviance Table
Model 1: death ~ as.factor(patient.group) + as.factor(bmi.group)
Model 2: death ~ as.factor(patient.group) * as.factor(bmi.group)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.

How to get value of group = 0 in linear mixed model

I have a very simple stat question probably.
So, I am fitting linear mixed models like this:
lme(dependent ~ Group + Sex + Age + npgs, data=boookclub, random = ~ 1| subject)
Group is a factor variable with levels = 0, 1 , 2 , 3
The dependent are continuous variables standardized (mean 0) and the others are covariates with sex being factor, with Male/Female levels, Age being numerical, and npgs being numerical continuous standardized as well.
When I get the table with beta, standard error, t and p values, I get this:
Value Std.Error DF t-value p-value
(Intercept) -0.04550502 0.02933385 187 -1.551280 0.0025
Group1 0.04219801 0.03536929 181 1.193069 0.2344
Group2 0.03350827 0.03705896 181 0.904188 0.3671
Group3 0.00192119 0.03012654 181 0.063771 0.9492
SexMale 0.03866387 0.05012901 181 0.771287 0.4415
Age -0.00011675 0.00148684 181 -0.078520 0.9375
npgs 0.15308844 0.01637163 181 9.350835 0.0000
SexMale:Age 0.00492966 0.00276117 181 1.785352 0.0759
My problem is: how do I get the beta of Group0? In this case the intercept is Group0 but also the average of npgs, being npgs standardized. How do I get the Beta of Group0? And how can I check if Group0 is significantly associated to the dependent? I'd like to see the effect of all Group levels.
Thanks
The easiest way to do what you want may be with the emmeans package, but you may also have some conceptual issues. Technical details first, then conceptual:
Technical
Fitting an example (this isn't necessarily statistically sensible, but I wanted an example with a categorical fixed effect)
library(nlme)
m1 <- lme(Yield~Variety, random = ~1|Block, data=Alfalfa)
As with your example, the effects are "intercept" (= mean of the baseline group, which is the "Cossack" variety in this case [by default, the alphabetically-first group]), "Ladak" (difference between Ladak and Cossack means) and "Ranger" (similarly). (As #Ben hints in the comments above, R automatically generates dummies for [most of] the levels of the categorical variables [factors] in your model.)
coef(summary(m1))
## Value Std.Error DF t-value p-value
## (Intercept) 1.57166667 0.11665326 64 13.4729767 2.373343e-20
## VarietyLadak 0.09458333 0.07900687 64 1.1971532 2.356624e-01
## VarietyRanger -0.01916667 0.07900687 64 -0.2425949 8.090950e-01
The emmeans package is a convenient way to see predicted values for each group without recoding.
library(emmeans)
emmeans(m1, spec = ~Variety)
## Variety emmean SE df lower.CL upper.CL
## Cossack 1.57 0.117 5 1.27 1.87
## Ladak 1.67 0.117 5 1.37 1.97
## Ranger 1.55 0.117 5 1.25 1.85
Conceptual
You can't "check if Group0 is significantly associated with the dependent [response] variable". You can only check whether the response variables differs significantly between two groups, or whether it differs significantly among all groups (e.g. the results of anova()). You have to pick a baseline. (If you insist, you can test all pairwise comparisons among groups; emmeans can help with this too.) If you "remove the intercept" (by fitting Variety ~Yield-1, or by looking at the results that emmeans produces) then the difference you are quantifying is the difference between the mean of a particular group and zero. This is usually not a meaningful question; in the example here, for instance, this would be testing whether a wheat variety gave a yield that was significantly greater than zero — probably not very interesting.
On the other hand, if you are just interested in estimating the expected value in each group (conditioning on the baseline values of the other variables in the model), along with the standard errors/CIs, then the answers you get from emmeans are perfectly sensible.
There's a related question here that explains why you get an NA value if you manually create dummies for every level of your factor ...

wilcoxon test using data stratification

I have a really basic problem. I have the concentrations of one chemical stored in one column and the gender of the study participant in a second column.
What is the code to do the wilcoxon test to see if there is a difference between the concentrations found in boys and the concentrations found in girls? Some explanation of the code would also be useful for me to understand how it works. Thanks!
I got this code for the ANOVA test to work which is also fine. Can anyone tell me if it does the thing that I need?
av <- aov(UC_MEHP ~ BQF05C1, data=data)
av
summary(av)
the output looks like this
> av <- aov(UC_MEHP ~ BQF05C1, data=data)
> av
Call:
aov(formula = UC_MEHP ~ BQF05C1, data = data)
Terms:
BQF05C1 Residuals
Sum of Squares 0.3445 2917.4564
Deg. of Freedom 1 151
Residual standard error: 4.395555
Estimated effects may be unbalanced
21 observations deleted due to missingness
> summary(av)
Df Sum Sq Mean Sq F value Pr(>F)
BQF05C1 1 0.3 0.344 0.018 0.894
Residuals 151 2917.5 19.321
21 observations deleted due to missingness
I'm sorry, I know it's not a very advanced question...
From ?wilcox.test:
## S3 method for class 'formula'
wilcox.test(formula, data, subset, na.action, ...)
...
formula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric
variable giving the data values and ‘rhs’ a factor with two
levels giving the corresponding groups.
So wilcox.test(UC_MEHP ~ BQF05C1, data=data) should work (assuming that BQF05C1 is the column specifying gender and UC_MEHP is the concentration).

Anova table comparing groups, in R, exported to latex?

I'm mostly work with observational data, but I read a lot of experimental hard-science papers that report results in the form of anova tables, with letters indicating the significance of the differences between the groups, and then p-values of the f-stat for the joint significance of what is essentially a factor variable regression. Here is an example that I've pulled off of google image search.
I think that this might be a useful way to present summary statistics about groupwise differences (or lack thereof) in an observational dataset, before I go ahead and try to control for them in various ways. I'm not sure exactly what test the letters are typically representing (Tukey something?), but pairwise t-tests would suit my purposes fine.
My main question: how can I get such an output from a factor variable regression in R, and how can I seamlessly export it into latex?
Here is some example data:
var = c(3500,600,12400,6400,1500,0,4400,400,900,2000,350,0,5800,0,12800,1200,350,800,2500,2000,0,3200,1100,0,0,0,0,0,1000,0,0,0,0,0,12400,6000,1700,3500,3000,1000,0,0,3500,5000,1000,3600,1600,3500,0,900,4200,0,0,0,0,1700,1250,500,950,500,600,1080,500,980,670,1200,600,550,4000,600,2800,650,0,3700,12500,0,0,0,1200,2700,0,NA,0,0,0,3700,2000,3500,0,0,0,3500,800,1400,0,500,7000,3500,0,0,0,0,2700,0,0,0,0,2000,5000,0,0,7000,0,4800,0,0,0,0,1800,0,2500,1600,4600,0,2000,5400,4500,3200,0,12200,0,3500,0,0,2800,3600,3000,0,3150,0,0,3750,2800,0,1000,1500,6000,3090,2800,600,0,0,1000,3800,3000,0,800,600,1200,0,240,1000,300,3600,0,1200,300,2700,NA,1300,1200,1400,4600,3200,750,300,750,1200,700,870,900,3200,1300,1500,1200,0,960,1800,8000,1200,NA,0,1080,1300,1080,900,700,5000,1500,3750,0,1400,900,1400,400,3900,0,1400,1600,960,1200,2600,420,3400,2500,500,4000,0,4250,570,600,4550,2000,0,0,4300,2000,0,0,0,0,NA,0,2060,2600,1600,1800,3000,900,0,0,3200,0,1500,3000,0,3700,6000,0,0,1250,1200,12800,0,1000,1100,0,950,2500,800,3000,3600,3600,1500,0,0,3600,800,0,1000,1600,1700,0,3500,3700,3000,350,700,3500,0,0,0,0,1500,0,400,0,0,0,0,0,0,0,500,0,0,0,0,5600,0,0,0)
factor = as.factor(c(5,2,5,5,5,3,4,5,5,5,3,1,1,1,5,3,6,6,6,5,5,5,3,5,3,3,3,3,4,3,3,3,4,3,5,5,3,5,3,3,3,3,5,3,3,3,3,3,5,5,5,5,5,3,3,5,3,5,5,3,5,5,4,3,5,5,5,5,5,5,4,5,3,5,4,4,3,4,3,5,3,3,5,5,5,3,5,5,4,3,3,5,5,4,3,3,5,3,3,4,3,3,3,3,5,5,3,5,5,3,3,5,4,3,3,3,4,4,5,3,1,5,5,1,5,5,5,3,3,4,5,5,5,3,3,4,5,4,5,3,5,5,5,3,3,3,3,3,3,3,3,3,3,3,4,3,3,3,3,3,3,3,4,5,4,6,4,3,5,5,3,5,3,3,4,3,5,5,5,3,5,3,3,5,5,5,3,4,3,3,3,5,3,5,3,5,5,3,5,3,5,5,5,5,5,3,5,3,5,3,4,5,5,5,6,5,5,5,5,4,5,3,5,3,3,5,4,3,5,3,4,5,3,5,3,5,3,1,5,1,5,3,5,5,5,3,6,3,5,3,5,2,5,5,5,1,5,5,6,5,4,5,4,3,3,3,5,3,3,3,3,5,3,3,3,3,3,3,5,5,5,4,4,4,5,5,3,5,4,5,5,4,3,3,3,4,3,5,5,4,3,3))
do a simple regression on them and you get the following
m = lm((var-mean(var,na.rm=TRUE))~factor-1)
summary(m)
Call:
lm(formula = (var - mean(var, na.rm = TRUE)) ~ factor - 1)
Residuals:
Min 1Q Median 3Q Max
-2040.5 -1240.2 -765.5 957.1 10932.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor1 -82.42 800.42 -0.103 0.9181
factor2 -732.42 1600.84 -0.458 0.6476
factor3 -392.17 204.97 -1.913 0.0567 .
factor4 -65.19 377.32 -0.173 0.8629
factor5 408.07 204.13 1.999 0.0465 *
factor6 303.30 855.68 0.354 0.7233
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2264 on 292 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.02677, Adjusted R-squared: 0.006774
F-statistic: 1.339 on 6 and 292 DF, p-value: 0.2397
It looks pretty clear that factors 3 and 5 are different from zero, different from each other, but that factor 3 is not different from 2 and factor 5 is not different from 6, respectively (at whatever p value).
How can I get this into anova table output like in the example above? And is there a clean way to get this into latex, ideally in a form that allows a lot of variables?
The following answers only the third question.
It looks like xtable does what you'd like to do - exporting R tables to $\LaTeX$ code.
There's a nice gallery as well.
I've found both in a wiki post on stackoverflow.

How to perform single factor ANOVA in R with samples organized by column?

I have a data set where the samples are grouped by column. The following sample dataset is similar to my data's format:
a = c(1,3,4,6,8)
b = c(3,6,8,3,6)
c = c(2,1,4,3,6)
d = c(2,2,3,3,4)
mydata = data.frame(cbind(a,b,c,d))
When I perform a single factor ANOVA in Excel using the above dataset, I get the following results:
I know a typical format in R is as follows:
group measurement
a 1
a 3
a 4
. .
. .
. .
d 4
And the command to perform ANOVA in R would be to use aov(group~measurement, data = mydata). How do I perform single factor ANOVA in R with samples organized by column rather than by row? In other words, how do I duplicate the excel results using R? Many thanks for the help.
You stack them in the long format:
mdat <- stack(mydata)
mdat
values ind
1 1 a
2 3 a
3 4 a
4 6 a
5 8 a
6 3 b
7 6 b
snipped output
> aov( values ~ ind, mdat)
Call:
aov(formula = values ~ ind, data = mdat)
Terms:
ind Residuals
Sum of Squares 18.2 65.6
Deg. of Freedom 3 16
Residual standard error: 2.024846
Estimated effects may be unbalanced
Given the warning it might be safer to use lm:
> anova(lm(values ~ ind, mdat))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 3 18.2 6.0667 1.4797 0.2578
Residuals 16 65.6 4.1000
> summary(lm(values~ind, mdat))
Call:
lm(formula = values ~ ind, data = mdat)
Residuals:
Min 1Q Median 3Q Max
-3.40 -1.25 0.00 0.90 3.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4000 0.9055 4.859 0.000174 ***
indb 0.8000 1.2806 0.625 0.540978
indc -1.2000 1.2806 -0.937 0.362666
indd -1.6000 1.2806 -1.249 0.229491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.025 on 16 degrees of freedom
Multiple R-squared: 0.2172, Adjusted R-squared: 0.07041
F-statistic: 1.48 on 3 and 16 DF, p-value: 0.2578
And please don't ask me why Excel gives a different answer. Excel has generally been shown to be highly unreliable when it comes to statistics. The onus is on Excel to explain why it doesn't give answers comparable to R.
Edit in response to comments: The Excel Data Analysis Pack ANOVA procedure creates an output but it does not use an Excel function for that process, so when you change the data in the data cells from which it was derived, and then hit F9, or the equivalent menu recalculation command, there will be no change in the output section. This and other sources of user and numerical problems are documented in various pages of David Heiser's efforts at assessing Excel's problems with statistical calculations: http://www.daheiser.info/excel/frontpage.html Heiser started out his efforts which are now at least a decade-long, with the expectation that Microsoft would take responsibility for these errors, but they have consistently ignored his and others' efforts at identifying errors and suggesting better procedures. There was also a 6 section Special Report in the June 2008 issue of "Computational Statistics & Data Analysis" edited by BD McCullough that cover various statistical concerns with Excel.

Resources