Anova table comparing groups, in R, exported to latex? - r

I'm mostly work with observational data, but I read a lot of experimental hard-science papers that report results in the form of anova tables, with letters indicating the significance of the differences between the groups, and then p-values of the f-stat for the joint significance of what is essentially a factor variable regression. Here is an example that I've pulled off of google image search.
I think that this might be a useful way to present summary statistics about groupwise differences (or lack thereof) in an observational dataset, before I go ahead and try to control for them in various ways. I'm not sure exactly what test the letters are typically representing (Tukey something?), but pairwise t-tests would suit my purposes fine.
My main question: how can I get such an output from a factor variable regression in R, and how can I seamlessly export it into latex?
Here is some example data:
var = c(3500,600,12400,6400,1500,0,4400,400,900,2000,350,0,5800,0,12800,1200,350,800,2500,2000,0,3200,1100,0,0,0,0,0,1000,0,0,0,0,0,12400,6000,1700,3500,3000,1000,0,0,3500,5000,1000,3600,1600,3500,0,900,4200,0,0,0,0,1700,1250,500,950,500,600,1080,500,980,670,1200,600,550,4000,600,2800,650,0,3700,12500,0,0,0,1200,2700,0,NA,0,0,0,3700,2000,3500,0,0,0,3500,800,1400,0,500,7000,3500,0,0,0,0,2700,0,0,0,0,2000,5000,0,0,7000,0,4800,0,0,0,0,1800,0,2500,1600,4600,0,2000,5400,4500,3200,0,12200,0,3500,0,0,2800,3600,3000,0,3150,0,0,3750,2800,0,1000,1500,6000,3090,2800,600,0,0,1000,3800,3000,0,800,600,1200,0,240,1000,300,3600,0,1200,300,2700,NA,1300,1200,1400,4600,3200,750,300,750,1200,700,870,900,3200,1300,1500,1200,0,960,1800,8000,1200,NA,0,1080,1300,1080,900,700,5000,1500,3750,0,1400,900,1400,400,3900,0,1400,1600,960,1200,2600,420,3400,2500,500,4000,0,4250,570,600,4550,2000,0,0,4300,2000,0,0,0,0,NA,0,2060,2600,1600,1800,3000,900,0,0,3200,0,1500,3000,0,3700,6000,0,0,1250,1200,12800,0,1000,1100,0,950,2500,800,3000,3600,3600,1500,0,0,3600,800,0,1000,1600,1700,0,3500,3700,3000,350,700,3500,0,0,0,0,1500,0,400,0,0,0,0,0,0,0,500,0,0,0,0,5600,0,0,0)
factor = as.factor(c(5,2,5,5,5,3,4,5,5,5,3,1,1,1,5,3,6,6,6,5,5,5,3,5,3,3,3,3,4,3,3,3,4,3,5,5,3,5,3,3,3,3,5,3,3,3,3,3,5,5,5,5,5,3,3,5,3,5,5,3,5,5,4,3,5,5,5,5,5,5,4,5,3,5,4,4,3,4,3,5,3,3,5,5,5,3,5,5,4,3,3,5,5,4,3,3,5,3,3,4,3,3,3,3,5,5,3,5,5,3,3,5,4,3,3,3,4,4,5,3,1,5,5,1,5,5,5,3,3,4,5,5,5,3,3,4,5,4,5,3,5,5,5,3,3,3,3,3,3,3,3,3,3,3,4,3,3,3,3,3,3,3,4,5,4,6,4,3,5,5,3,5,3,3,4,3,5,5,5,3,5,3,3,5,5,5,3,4,3,3,3,5,3,5,3,5,5,3,5,3,5,5,5,5,5,3,5,3,5,3,4,5,5,5,6,5,5,5,5,4,5,3,5,3,3,5,4,3,5,3,4,5,3,5,3,5,3,1,5,1,5,3,5,5,5,3,6,3,5,3,5,2,5,5,5,1,5,5,6,5,4,5,4,3,3,3,5,3,3,3,3,5,3,3,3,3,3,3,5,5,5,4,4,4,5,5,3,5,4,5,5,4,3,3,3,4,3,5,5,4,3,3))
do a simple regression on them and you get the following
m = lm((var-mean(var,na.rm=TRUE))~factor-1)
summary(m)
Call:
lm(formula = (var - mean(var, na.rm = TRUE)) ~ factor - 1)
Residuals:
Min 1Q Median 3Q Max
-2040.5 -1240.2 -765.5 957.1 10932.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor1 -82.42 800.42 -0.103 0.9181
factor2 -732.42 1600.84 -0.458 0.6476
factor3 -392.17 204.97 -1.913 0.0567 .
factor4 -65.19 377.32 -0.173 0.8629
factor5 408.07 204.13 1.999 0.0465 *
factor6 303.30 855.68 0.354 0.7233
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2264 on 292 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.02677, Adjusted R-squared: 0.006774
F-statistic: 1.339 on 6 and 292 DF, p-value: 0.2397
It looks pretty clear that factors 3 and 5 are different from zero, different from each other, but that factor 3 is not different from 2 and factor 5 is not different from 6, respectively (at whatever p value).
How can I get this into anova table output like in the example above? And is there a clean way to get this into latex, ideally in a form that allows a lot of variables?

The following answers only the third question.
It looks like xtable does what you'd like to do - exporting R tables to $\LaTeX$ code.
There's a nice gallery as well.
I've found both in a wiki post on stackoverflow.

Related

How can I change the default object value for lm()?

Apologies for any grammar issues as English is not my first language.
I am trying to investigate data in regards to whether black people are discriminated against in comparison to white people when submitting their resumes (Bertrand and Mullainathan, 2004).
I do the following:
> resume <- read.csv("resume.csv", header=T)
> fit <- lm(call~race, data=resume)
> summary(fit1)
Call:
lm(formula = call ~ race, data = resume)
Residuals:
Min 1Q Median 3Q Max
-0.09651 -0.09651 -0.06448 -0.06448 0.93552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.064476 0.005505 11.713 < 2e-16 ***
racewhite 0.032033 0.007785 4.115 3.94e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2716 on 4868 degrees of freedom
Multiple R-squared: 0.003466, Adjusted R-squared: 0.003261
F-statistic: 16.93 on 1 and 4868 DF, p-value: 3.941e-05
As you can see from the summary, it displays 'racewhite' as the variable and I have no idea how to change this so it instead displays 'raceblack'.
I know this might be quite a simple question, but thank you in advance for helping me out :)
lm appears to be treating race as a factor. By default, the first level of the factor is used as a control and coefficients for the other levels represent differences between the given level and the control. The "first" level of a factor defined by character values is the first level when the levels are sorted alphabetically.
You can change this by using reorder before the call to lm. Or, assuming race has only two levels, you can simply reinterpret the coefficient by changing it's sign: based on the results above, the coefficient for race regarding white as control will be -0.032033. All other statistics - std err, p-value etc will be unchanged.
It would have been helpful to see at least some of your input data.

How can I test if my dependent variable increases or decreases from one year to another?

I have a dataset called data , which contains data on contaminant, my dependent variable. All observations in one year are independent from the following years.
My predictors are Species (three levels) and Year (three levels): Basically, I need to see if there is an increase/decrease in the contaminant data over time for each species separately.
So far I tried this code
model1<- lm(contaminant~Species*Year,data=data)
#using Year as numerical (covariate): Indeed, I do not care about
the difference in contaminants load among species in each year.
I simply want to test if the slopes of each species are significant.
1st Question: I am doing that correctly by treating Year as a number? Or is there another specific way/code to treat time series? I actually want to have a p-value that tells me that Series1 in the graph below (made with average values of each group) had a significant increase over time.
enter image description here
My summary output looks like this:
Call: lm(formula = Contaminant ~ Species * Year, data = data)
Residuals:
Min 1Q Median 3Q Max
-5.1135 -1.3595 -0.1475 1.3225 7.3652
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -588.6625 996.6024 -0.591 0.556036
Species2 -823.3590 1320.9209 -0.623 0.534451
Species3 -4798.0032 1393.0990 -3.444 0.000830 ***
Year 0.2930 0.4941 0.593 0.554484
Species2:Year 0.4092 0.6549 0.625 0.533462
Species3:Year 2.3802 0.6907 3.446 0.000824 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.21 on 103 degrees of freedom
Multiple R-squared: 0.3853, Adjusted R-squared: 0.3555
F-statistic: 12.91 on 5 and 103 DF, p-value: 9.428e-10
2nd Question: Why is my summary output giving me only two interactions? Why it does not provide also Species1:Year?
3rd Question: Does anybody knows how to make a graph like this in r? So far I am only able to do that in excel using mean values of each group
Thanks
1) If your observations include only whole years and not full dates, it is fine to add the year as a variable. As long as you do not convert it to a factor, it assumes per year a constant increase.
2) Dummy - or one-hot - Encoding always tests the difference of groups to a baseline group. That means S2 tests S2-S1, and S3 tests S3-S1. The same is valid for the interaction terms.
3) There are multiple possibilities, but it will be more than 1-2 lines of code. See e.g. http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/#line-graphs-1

Within and between factors in regression models in R

I'm trying to run a rmANOVA and a corresponding regression model. In the experiment participants were completing a questionnaire which was evaluating how much of a trait X they have (score). Then they were performing a task, in which each participant was exposed to three conditions (COND - nSCM, SCM, SC). Their brain responses were measured (ERP).
This is how it looks like:
> head(df)
code SEX AGE SCORE COND ERP
1 AA1407 male 29 14 nSCM -3.0348373
2 AN0312 male 26 13 nSCM -1.8799240
3 BR1410 male 23 30 nSCM 0.4284033
4 EZ2404 male 23 23 nSCM -0.7615117
5 HA1012 female 27 22 nSCM -2.9301698
6 HS3004 male 30 16 nSCM -0.5468492
Since I am a bit confused about how to use different types of variables in R, maybe someone could also reassure me about the following:
> sapply(df,class)
code SEX AGE SCORE COND ERP
"factor" "factor" "numeric" "numeric" "factor" "numeric"
Based on the experimental design, the ANOVA design has one between-subject IV: SCORE, one within-subject IV: COND and the DV is ERP (right?).
This is the model I used and the summary:
> anERP <- aov(ERP ~ COND*SCORE, data = df)
> summary(anERP)
Df Sum Sq Mean Sq F value Pr(>F)
COND 2 0.21 0.105 0.027 0.9736
SCORE 1 16.87 16.868 4.297 0.0419 *
COND:SCORE 2 0.58 0.289 0.074 0.9291
Residuals 69 270.85 3.925
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, IF this is right (please let me know if anything doesn't seem right), I should also find an effect for SCORE when I build a regression model, right? Also, I'm not sure how to interpret this effect, since AQ is an interval variable (scores in range 6-35). I would appreciate a little help here.
Now I'm very confused about how this model should look like for regression. I started with simple lm model with SCORE and COND as fixed effects:
> lmERP <- lm(ERP ~ SCORE*COND, data = df)
> summary(lmERP)
Call:
lm(formula = ERP ~ SCORE * COND, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.2554 -1.0916 0.1975 1.4582 3.3097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.04269 1.06193 -2.865 0.00552 **
SCORE 0.06458 0.05229 1.235 0.22108
CONDSCM -0.08141 1.50180 -0.054 0.95693
CONDnSCM 0.36646 1.50180 0.244 0.80795
SCORE:CONDSCM 0.01111 0.07396 0.150 0.88104
SCORE:CONDnSCM -0.01707 0.07396 -0.231 0.81814
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.981 on 69 degrees of freedom
Multiple R-squared: 0.0612, Adjusted R-squared: -0.006827
F-statistic: 0.8997 on 5 and 69 DF, p-value: 0.4864
However, here the main effect of SCORE doesn't reach significance. How is it possible? Shouldn't rmANOVA and regression show roughly similar results (or at least the main effects)?
I guess I'm not applying the right linear model here, since it doesn't seem to recognise there are both within and between subject factors in the design.
I have read hundreds of webpages, tutorials and forums and I'm still completely confused about these models. Thank you in advance for any piece of advice!
Repeated-measures or mixed-model designs can be very confusing to specify using R's base aov function. In the code you have written, for example, aov will treat all the specified factors as independent (i.e., between-subject). I highly recommend using a library that makes it easier to specify these types of designs.
The ez library contains ezANOVA, which makes these tests simple to perform, provided that all your cases are complete (all factors are fully crossed, with no missing data). Assuming that your CODE column uniquely identifies each subject and you wanted to include all factors from your data set, the test would look something like this:
my.aov <- ezANOVA(data = df, dv = ERP, wid = CODE, between = .(SEX, AGE, SCORE), within = COND).
It is also possible to implement these designs with the lme4 package (in fact, ezANOVA is a wrapper around lme4's functions). While lme4 allows for more flexible model specifications and can tolerate incomplete data, its syntax is more difficult. Bodo Winter's tutorial on lme4 is a good start, if you want to go really deep.
As an aside, there is usually little point in performing both an ANOVA and a linear regression. Unless the two tests are specified in a way that treats the factors differently, the results will be equivalent.

Passing strings as variables names in R for loop, but keeping names in results

Ok, I'm working on a silly toy problem in R (part of an edx course actually), running a bunch of bivariate logits and look at the p values. And I'm trying to add some coding practice to my data crunching practice by doing the chore as a for loop rather than as a bunch of individual models. So I pulled the variable names I wanted out of the data frame, stuck that in a vector, and passed that vector to glm() with a for loop.
After about an hour and a half of searching and hacking around to deal with the inevitable variable length errors, I realized that R was interpreting the elements of the variable vector as character strings rather than variable names. Solved that problem, ended up with a final working loop as follows:
for (i in 1:length(dumber)) {
print(summary(glm(WorldSeries ~ get(dumber[i]) , data=baseball, family=binomial)))
}
where dumber is the vector of independent variable names, WorldSeries is the dependent variable.
And that was awesome... except for one little problem. The console output is a bunch of model summaries, which is what I want, but the summaries aren't labelled with the variable names. Instead, they're just labelled with the code from the for loop! For example, here are the summaries for two of the variables my little loop went through:
Call:
glm(formula = WorldSeries ~ get(dumber[i]), family = binomial,
data = baseball)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5610 -0.5209 -0.5088 -0.4902 2.1268
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725 6.07285 -0.014 0.989
get(dumber[i]) -4.65992 15.06881 -0.309 0.757
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.830 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.83
Number of Fisher Scoring iterations: 4
Call:
glm(formula = WorldSeries ~ get(dumber[i]), family = binomial,
data = baseball)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9871 -0.8017 -0.5089 -0.5089 2.2643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.03868 0.43750 0.088 0.929559
get(dumber[i]) -0.25220 0.07422 -3.398 0.000678 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.96 on 242 degrees of freedom
AIC: 230.96
Number of Fisher Scoring iterations: 4
That's obviously hopeless, especially as the number of elements of the variable vector increases. I'm sure if I knew a lot more about object-oriented programming than I do, I'd be able to just create some kind of complicated object that has the elements of dumber matched to the model summaries, or directly tinker with the summaries to insert the elements of dumber into where it currently just reads "get(dumber[i])". But I currently know jack-all about OOP (I'm learning! It's slow!). So does anyone wanna clue me in? Thanks!
You could do this (only send the outcome and predictor columns one at a time to glm):
for (i in 1:length(dumber)) {
print(summary(glm(WorldSeries ~ . , data=baseball[, c("WorldSeries", dumber[i])],
family=binomial)))
}
You could also do this (label the outputs with the value of 'dumber'):
for (i in 1:length(dumber)) { print( paste0("Current predictor is ...", dumber))
print(summary(glm(WorldSeries ~ get(dumber) , data=baseball, family=binomial)))
}
As you progress down the road to R mastery, you would probably want to build a list of summary objects and then use lapply to print or cat your tailored output.

How to perform single factor ANOVA in R with samples organized by column?

I have a data set where the samples are grouped by column. The following sample dataset is similar to my data's format:
a = c(1,3,4,6,8)
b = c(3,6,8,3,6)
c = c(2,1,4,3,6)
d = c(2,2,3,3,4)
mydata = data.frame(cbind(a,b,c,d))
When I perform a single factor ANOVA in Excel using the above dataset, I get the following results:
I know a typical format in R is as follows:
group measurement
a 1
a 3
a 4
. .
. .
. .
d 4
And the command to perform ANOVA in R would be to use aov(group~measurement, data = mydata). How do I perform single factor ANOVA in R with samples organized by column rather than by row? In other words, how do I duplicate the excel results using R? Many thanks for the help.
You stack them in the long format:
mdat <- stack(mydata)
mdat
values ind
1 1 a
2 3 a
3 4 a
4 6 a
5 8 a
6 3 b
7 6 b
snipped output
> aov( values ~ ind, mdat)
Call:
aov(formula = values ~ ind, data = mdat)
Terms:
ind Residuals
Sum of Squares 18.2 65.6
Deg. of Freedom 3 16
Residual standard error: 2.024846
Estimated effects may be unbalanced
Given the warning it might be safer to use lm:
> anova(lm(values ~ ind, mdat))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 3 18.2 6.0667 1.4797 0.2578
Residuals 16 65.6 4.1000
> summary(lm(values~ind, mdat))
Call:
lm(formula = values ~ ind, data = mdat)
Residuals:
Min 1Q Median 3Q Max
-3.40 -1.25 0.00 0.90 3.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4000 0.9055 4.859 0.000174 ***
indb 0.8000 1.2806 0.625 0.540978
indc -1.2000 1.2806 -0.937 0.362666
indd -1.6000 1.2806 -1.249 0.229491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.025 on 16 degrees of freedom
Multiple R-squared: 0.2172, Adjusted R-squared: 0.07041
F-statistic: 1.48 on 3 and 16 DF, p-value: 0.2578
And please don't ask me why Excel gives a different answer. Excel has generally been shown to be highly unreliable when it comes to statistics. The onus is on Excel to explain why it doesn't give answers comparable to R.
Edit in response to comments: The Excel Data Analysis Pack ANOVA procedure creates an output but it does not use an Excel function for that process, so when you change the data in the data cells from which it was derived, and then hit F9, or the equivalent menu recalculation command, there will be no change in the output section. This and other sources of user and numerical problems are documented in various pages of David Heiser's efforts at assessing Excel's problems with statistical calculations: http://www.daheiser.info/excel/frontpage.html Heiser started out his efforts which are now at least a decade-long, with the expectation that Microsoft would take responsibility for these errors, but they have consistently ignored his and others' efforts at identifying errors and suggesting better procedures. There was also a 6 section Special Report in the June 2008 issue of "Computational Statistics & Data Analysis" edited by BD McCullough that cover various statistical concerns with Excel.

Resources