Why do R and statsmodels give slightly different ANOVA results? - r

Using a small R sample dataset and the ANOVA example from statsmodels, the degrees of freedom for one of the variables are reported differently, & the F-values results are also slightly different. Perhaps they have slightly different default approaches? Can I set up statsmodels to use R's defaults?
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
##R code on R sample dataset
#> anova(with(ChickWeight, lm(weight ~ Time + Diet)))
#Analysis of Variance Table
#
#Response: weight
# Df Sum Sq Mean Sq F value Pr(>F)
#Time 1 2042344 2042344 1576.460 < 2.2e-16 ***
#Diet 3 129876 43292 33.417 < 2.2e-16 ***
#Residuals 573 742336 1296
#write.csv(file='ChickWeight.csv', x=ChickWeight, row.names=F)
cw = pd.read_csv('ChickWeight.csv')
cw_lm=ols('weight ~ Time + Diet', data=cw).fit()
print(sm.stats.anova_lm(cw_lm, typ=2))
# sum_sq df F PR(>F)
#Time 2024187.608511 1 1523.368567 9.008821e-164
#Diet 108176.538530 1 81.411791 2.730843e-18
#Residual 764035.638024 575 NaN NaN
Head and tail of the datasets are the same*, also mean, min, max, median of weight and time.

Looks like "Diet" only has one degree of freedom in the statsmodels call which means it was probably treated as a continuous variable whereas in R it has 3 degrees of freedom so it probably was a factor/discrete random variable.
To make ols() treat "Diet" as a categorical random variable, use
cw_lm=ols('weight ~ C(Diet) + Time', data=cw).fit()

Related

Model simplification (two way ANOVA)

I am using ANOVA to analyse results from an experiment to see whether there are any effects of my explanatory variables (Heating and Dungfauna) on my response variable (Biomass). I started by looking at the main effects and interaction:
full.model <- lm(log(Biomass) ~ Heating*Dungfauna, data= df)
anova(full.model)
I understand that it is necessary to complete model simplification, removing non-significant interactions or effects to eventually reach the simplest model which still explains the results. I tried two ways of removing the interaction. However, when I manually remove the interaction (Heating*Fauna -> Heating+Fauna), the new ANOVA gives a different output to when I use this model simplification 'shortcut':
new.model <- update(full.model, .~. -Dungfauna:Heating)
anova(model)
Which way is the appropriate way to remove the interaction and simplify the model?
In both cases the data is log transformed -
lm(log(CC_noAcari_EmergencePatSoil)~ Dungfauna*Heating, data= biomass)
ANOVA output from manually changing Heating*Dungfauna to Heating+Dungfauna:
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Heating 2 4.806 2.403 5.1799 0.01012 *
Dungfauna 1 37.734 37.734 81.3432 4.378e-11 ***
Residuals 39 18.091 0.464
ANOVA output from using simplification 'shortcut':
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Dungfauna 1 41.790 41.790 90.0872 1.098e-11 ***
Heating 2 0.750 0.375 0.8079 0.4531
Residuals 39 18.091 0.464
R's anova and aov functions compute the Type I or "sequential" sums of squares. The order in which the predictors are specified matters. A model that specifies y ~ A + B is asking for the effect of A conditioned on B, whereas Y ~ B + A is asking for the effect of B conditioned on A. Notice that your first model specifies Dungfauna*Heating, while your comparison model uses Heating+Dungfauna.
Consider this simple example using the "mtcars" data set. Here I specify two additive models (no interactions). Both models specify the same predictors, but in different orders:
add.model <- lm(log(mpg) ~ vs + cyl, data = mtcars)
anova(add.model)
Df Sum Sq Mean Sq F value Pr(>F)
vs 1 1.22434 1.22434 48.272 1.229e-07 ***
cyl 1 0.78887 0.78887 31.103 5.112e-06 ***
Residuals 29 0.73553 0.02536
add.model2 <- lm(log(mpg) ~ cyl + vs, data = mtcars)
anova(add.model2)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 2.00795 2.00795 79.1680 8.712e-10 ***
vs 1 0.00526 0.00526 0.2073 0.6523
Residuals 29 0.73553 0.02536
You could specify Type II or Type III sums of squares using car::Anova:
car::Anova(add.model, type = 2)
car::Anova(add.model2, type = 2)
Which gives the same result for both models:
Sum Sq Df F value Pr(>F)
vs 0.00526 1 0.2073 0.6523
cyl 0.78887 1 31.1029 5.112e-06 ***
Residuals 0.73553 29
summary also provides equivalent (and consistent) metrics regardless of the order of predictors, though it's not quite a formal ANOVA table:
summary(add.model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.92108 0.20714 18.930 < 2e-16 ***
vs -0.04414 0.09696 -0.455 0.652
cyl -0.15261 0.02736 -5.577 5.11e-06 ***

Repeated Measures: From SPSS to R

I am looking to run a mixed effects model in R based on how I used to run the stats in SPSS with a repeated measures ANOVA. Here is how I set up the repeated measures ANOVA in SPSS. How would I convert this to lme4 in R?
Key:
EBT100... is the name of the task, Genotype is my IV, and my within-subject factors are Day (5 levels) and Cue (9 levels). Att is my DV.
In R, this is the code that I am trying to run:
In R, here is my code:
lmeModel <- lmer(Att ~ Genotype*Day*Cue + (1|Subject)
My Genotype Effect is the same between R and SPSS (p~0.12), but all of my interactions are different (Genotype x Day, Genotype x Cue, Genotype x Day x Cue).
R (lme4) Output:
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Genotype 488 243.9 2 32 2.272 0.11954
Day 25922 6480.4 4 1408 60.356 < 2.2e-16 ***
Cue 35821 4477.6 8 1408 41.703 < 2.2e-16 ***
Genotype:Day 3646 455.7 8 1408 4.244 4.751e-05 ***
Genotype:Cue 736 46.0 16 1408 0.429 0.97560
Day:Cue 5063 158.2 32 1408 1.474 0.04352 *
Genotype:Day:Cue 3297 51.5 64 1408 0.480 0.99984
SPSS Repeated Measures ANOVA output:
F.value Pr(>F)
Genotype 2.272 0.120
Day 9.603 0.000
Cue 83.916 0.000
Genotype:Day 0.675 0.712
Genotype:Cue 0.863 0.613
Day:Cue 3.168 0.00
Genotype:Day:Cue 1.031 0.411
You can see that the main effect of Genotype is the same for both R and SPSS. Additionally, in R, my DenDF output is not correct either. Any idea as to why this would be?
Even more...
Using ezANOVA, with the same dataset that I am using for lme4, this is my code:
anova <- ezANOVA(data = dat,
wid = Subject,
dv = Att,
within = .(Day, Cue),
between = Genotype,
type = 3)
ezANOVA Output:
Effect DFn DFd F p p<.05 ges
2 Genotype 2 32 2.2715034 1.195449e-01 0.044348362
3 Day 4 128 9.6034152 8.003233e-07 * 0.103474748
5 Cue 8 256 83.9162989 3.938364e-67 * 0.137556761
4 Genotype:Day 8 128 0.6753544 7.124675e-01 0.015974029
6 Genotype:Cue 16 256 0.8624463 6.133218e-01 0.003267726
7 Day:Cue 32 1024 3.1679308 1.257738e-08 * 0.022046134
8 Genotype:Day:Cue 64 1024 1.0313631 4.115000e-01 0.014466102
How can I convert ezANOVA to lme4?
Any information would be greatly appreciated!
Thank you!
First off: It would be very beneficial and instructive if you could share your data, which allows for an easier comparison of lmer results with those from SPSS/ezANOVA.
Personally I prefer mixed effect (i.e. hierarchical) models as I find them easier to understand (and construct), so I am not that familiar with repeated measure ANOVA. Translating the latter into the former boils down to correctly translating within/between effects of your RM-ANOVA into the appropriate terms of your lmer mixed-effect model.
Provided I understood you correctly, the following seems consistent with your model problem statement:
Genotype is your fixed effect
Subject is your random (grouping or blocking) effect
Day is a within-Subject effect
Cue is a within-Subject effect
The corresponding lmer model should look something like this:
lmer(Obs ~ Genotype * Day * Cue + (Day:Cue|Subject)
If this is not tractable, you should try
lmer(Obs ~ Genotype * Day * Cue + (Day|Subject) + (Cue|Subject) + (1|Subject)

Access z-value and other statistics in output of Zelig relogit

I want to compute a logit regression for rare events. I decided to use the Zelig package (relogit function) to do so.
Usually, I use stargazer to extract and save regression results. However, there seem to be compatibility issues with these two packages (Using stargazer with Zelig).
I now want to extract the following information from the Zelig relogit output:
Coefficients, z values, p values, number of observations, log likelihood, AIC
I have managed to extract the p-values and coefficients. However, I failed at the rest. But I am sure these values must be accessible somehow, because they are reported in the summary() output (however, I did not manage to store the summary output as an R object). The summary cannot be processed in the same way as a regular glm summary (https://stats.stackexchange.com/questions/176821/relogit-model-from-zelig-package-in-r-how-to-get-the-estimated-coefficients)
A reproducible example:
##Initiate package, model and data
require(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit")
##Call summary on output (reports in console most of the needed information)
summary(z.out1)
##Storing the summary fails and only produces a useless object
summary(z.out1) -> z.out1.sum
##Some of the output I can access as follows
z.out1$get_coef() -> z.out1.coeff
z.out1$get_pvalue() -> z.out1.p
z.out1$get_se() -> z.out1.se
However, I did not find similar commands for other elements, such as z values, AIC etc. However, as they are shown in the summary() call, they should be accessible somehow.
The summary call result:
Model:
Call:
z5$zelig(formula = conflict ~ major + contig + power + maxdem +
mindem + years, data = mid)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0742 -0.4444 -0.2772 0.3295 3.1556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.535496 0.179685 -14.111 < 2e-16
major 2.432525 0.157561 15.439 < 2e-16
contig 4.121869 0.157650 26.146 < 2e-16
power 1.053351 0.217243 4.849 1.24e-06
maxdem 0.048164 0.010065 4.785 1.71e-06
mindem -0.064825 0.012802 -5.064 4.11e-07
years -0.063197 0.005705 -11.078 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3979.5 on 3125 degrees of freedom
Residual deviance: 1868.5 on 3119 degrees of freedom
AIC: 1882.5
Number of Fisher Scoring iterations: 6
Next step: Use 'setx' method
Use from_zelig_model for deviance, AIC.
m <- from_zelig_model(z.out1)
m$aic
...
Z-values are coefficient / sd.
z.out1$get_coef()[[1]]/z.out1$get_se()[[1]]

Confused about ANOVA in R

I am new to R and statistics and am trying to do two-factor ANOVA on a dataset in csv file where values of each factor are in its own column. I was using
> mydata <- read.csv("myfile.csv")
> model = lm(result ~ factor1 * factor2, data=mydata)
As a check, I tried the ChickWeight data from R sample dataset.
> anova(with(ChickWeight, lm(weight ~ Time + Diet)))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1576.460 < 2.2e-16 ***
Diet 3 129876 43292 > 33.417 < 2.2e-16 ***
Residuals 573 742336 1296
> write.csv(file="ChickWeight.csv", x=ChickWeight, row.names=F)
> data = read.csv("ChickWeight.csv", header=T)
> anova(lm(weight ~ Time + Diet, data=data))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1537.033 < 2.2e-16 ***
Diet 1 108177 108177 81.412 < 2.2e-16 ***
Residuals 575 764036 1329
Noticeably, degrees of freedom are lost for Diet column with the data read from csv into a dataframe. What am I missing here?
Got the clue from this post: Why do R and statsmodels give slightly different ANOVA results?
When the data is being read from CSV file, the Diet column is becoming an ordinary numeric column, but for ANOVA it has to be a factor variable (I am still not clear why it is a separate class/kind in R and why it cannot take care of it automatically: inexact binary representation of floats? ).
So the solution was:
> data$Diet = factor(data$Diet)
> anova(lm("weight ~ Time + Diet", data=data))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1576.460 < 2.2e-16 ***
Diet 3 129876 43292 33.417 < 2.2e-16 ***
Residuals 573 742336 1296

R: logistic regression using frequency table, cannot find correct Pearson Chi Square statistics

I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?
Following is my code and output.
1. Data
Here is my data frame test_data, y is outcome, x1 and x2 are covariates:
y x1 x2 freq
0 No 0 268
0 No 1 14
0 Yes 0 109
0 Yes 1 1
1 No 0 31
1 No 1 6
1 Yes 0 45
1 Yes 1 6
I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq.
2. GLM Model
Then I did the logistic regression as:
logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)
Output shows:
Deviance Residuals:
1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***
x1 1.3538 0.2516 5.381 7.39e-08 ***
x2 1.6261 0.4313 3.770 0.000163 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 457.35 on 7 degrees of freedom
Residual deviance: 416.96 on 5 degrees of freedom
AIC: 422.96
Number of Fisher Scoring iterations: 5
Coefficients are the same as STATA.
3. Chi Square Statistics
when I tried to calculate the Pearson chi square:
pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)
I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)=5, instead of 1. So I got an extremely small p-value~e^-100.
4. Discrimination
Then I calculated the area under ROC curve as:
library(verification)
logit_mf = model.frame(logit)
roc.area(logit_mf $y, fitted(logit))$A
The output is:
[1] 0.5
Warning message:
In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") :
cannot compute exact p-value with ties
Thanks!
I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:
# extract covariate patterns
library(dplyr)
temp=test_data %>% group_by(x1, x2) %>% summarise(m=sum(freq),y=sum(freq*y))
temp$pred=fitted(p01_logit_j)[1:4]
# calculate Pearson chi square
temp=mutate(temp, pearson=(y-mpred)/sqrt(mpred*(1-pred)))
pearson_chi2 = with(temp, sum(pearson^2))
temp_dof = 4-(2+1) #dof=J-(p+1)
# calculate p-value
pchisq(pearson_chi2, temp_dof, lower.tail=F)
The result of p-value is 0.241941, which is same as STATA.
In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:
# expand observation
y_expand=c(rep(0,392),rep(1,88))
logit_mf = model.frame(logit)
logit_pred = fitted(logit)
logit_mf$freq=test_data$freq
# expand prediction
yhat_expand=with(logit_mf, rep(pred, freq))
library(verification)
roc.area(y_expand, yhat)$A
AUC=0.6760, same as that of STATA.

Resources