How can I obtain the rsquare out of an anova in R - r

I'm looking for the method/function that returns de Rsquared of an anova model in R.
Could not find anything so far.
Thanks

tl;dr: you can get the R-squared of the anova by looking at the summary output of the corresponding linear model
Let's go step by step:
1) Let's use the data from here
pain <- c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
drug <- c(rep("A", 9), rep("B", 9), rep("C", 9))
migraine <- data.frame(pain, drug)
2) Let's get the anova:
AOV <- aov(pain ~ drug, data=migraine)
summary(AOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 28.22 14.111 11.91 0.000256 ***
## Residuals 24 28.44 1.185
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3) Now, the anova is directly related to the linear model, so let's get it and find the anova from it:
LM <- lm(pain ~ drug, data=migraine)
anova(LM)
## Analysis of Variance Table
##
## Response: pain
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 28.222 14.1111 11.906 0.0002559 ***
## Residuals 24 28.444 1.1852
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As expected, the results are exactly the same. This means that...
3) We can get the R-squared from the linear model:
summary(LM)
## Call:
## lm(formula = pain ~ drug, data = migraine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7778 -0.7778 0.1111 0.3333 2.2222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6667 0.3629 10.104 4.01e-10 ***
## drugB 2.1111 0.5132 4.114 0.000395 ***
## drugC 2.2222 0.5132 4.330 0.000228 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.089 on 24 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4562
## F-statistic: 11.91 on 2 and 24 DF, p-value: 0.0002559
So the R-squared is 0.498
But what if we don't believe this?
4) What is the R-squared? It's the sum of squares regression divided by the total sum of squares (i.e., the sum of squares of the regression plus the sum of squares of the residuals). So let's find those numbers in the anova and calculate the R-squared directly:
# We use the tidy function from the broom package to extract values
library(broom)
tidy_aov <- tidy(AOV)
tidy_aov
## term df sumsq meansq statistic p.value
## 1 drug 2 28.22222 14.111111 11.90625 0.0002558807
## 2 Residuals 24 28.44444 1.185185 NA NA
# The values we need are in the sumsq column of this data frame
sum_squares_regression <- tidy_aov$sumsq[1]
sum_squares_residuals <- tidy_aov$sumsq[2]
R_squared <- sum_squares_regression /
(sum_squares_regression + sum_squares_residuals)
R_squared
## 0.4980392
So we get the same result: R-squared is 0.4980392

If you want to calculate the Adjusted R-square then you can apply the following formula (from https://www.statisticshowto.datasciencecentral.com/adjusted-r2/):
s <- summary(LM)
r2 <- s$r.squared
n <- dim(migraine)[1]
k <- 2
#adjusted R-square
1 - ((1-r2)*(n-1)/(n-k-1))
#the same as
s$adj.r.squared
Adjustment means penalization for additional variables ('k' in formula) just like in case of the AIC calculation. If the goodness-of-fit, the estimations vs residuals ratio does not increase significantly by adding a new independent variable then you shouldn't include it.
So, R-square will always increase by involving more and more variables while Adjusted R-square will stop improving after a certain number of regressors.

Related

R polynomal regression or group values and test between groups + outcome interpreatation

I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.
If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code
model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub)
summary(model1)
model1
I have plotted using:
g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) + geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g
The model output was:
Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.126332 -0.037683 -0.002602 0.053222 0.083503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.796e+03 3.566e+03 -2.467 0.0297 *
poly(year, 2, raw = TRUE)1 8.747e+00 3.545e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03 8.813e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared: 0.3369, Adjusted R-squared: 0.2264
F-statistic: 3.048 on 2 and 12 DF, p-value: 0.08503
How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?
Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?
Would be happy about input regarding model and outcome interpretation..
Grouping
Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.
g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
## [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
## Levels: 1 2 3
fm <- lm(rate ~ g, yearlyratesub)
summary(fm)
giving
Call:
lm(formula = rate ~ g, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.018491 0.006091 0.029684 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.110854 0.019694 5.629 0.000111 ***
g2 0.127783 0.024687 5.176 0.000231 ***
g3 -0.006714 0.027851 -0.241 0.813574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03939 on 12 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.738
F-statistic: 20.72 on 2 and 12 DF, p-value: 0.0001281
We could consider combining the outer two groups.
g2 <- factor(g == 2)
fm2 <- lm(rate ~ g2, yearlyratesub)
summary(fm2)
giving:
Call:
lm(formula = rate ~ g2, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.016813 0.007096 0.031363 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10750 0.01341 8.015 2.19e-06 ***
g2TRUE 0.13114 0.01963 6.680 1.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03793 on 13 degrees of freedom
Multiple R-squared: 0.7744, Adjusted R-squared: 0.757
F-statistic: 44.62 on 1 and 13 DF, p-value: 1.517e-05
Sinusoid
Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)
fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)),
yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
summary(fm3)
giving
Formula: rate ~ cbind(a = 1, b = cos(c * year))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 0.4999618 0.0001449 3449.654 < 2e-16 ***
.lin.a 0.1787200 0.0150659 11.863 5.5e-08 ***
.lin.b 0.0753754 0.0205818 3.662 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05688 on 12 degrees of freedom
Number of iterations to convergence: 2
Achieved convergence tolerance: 5.241e-08
Comparison
Plotting the fits and looking at their residual sum of squares and AIC we have
plot(yearlyratesub)
# fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
for(i in seq_along(L)) {
lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
}
legend("topright", names(L), col = seq_along(L), lwd = 2)
giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.
cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
## RSS AIC
## fm0 0.05488031 -33.59161
## fm 0.01861659 -49.80813
## fm2 0.01870674 -51.73567
## fm3 0.04024237 -38.24512
Note
yearlyratesub <-
structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287,
0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875,
0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104,
0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776,
0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
), class = "data.frame")
fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
summary(fm0)
giving
Call:
lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.128335 -0.038289 -0.002715 0.054090 0.084792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.930e+03 3.621e+03 -2.466 0.0297 *
poly(year, 2, raw = TRUE)1 8.880e+00 3.600e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.207e-03 8.949e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06763 on 12 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.2278
F-statistic: 3.065 on 2 and 12 DF, p-value: 0.0841

Interpretation of Sum Sq in ANOVA with numeric independent variable

I don't know how to interpret the Sum of Squares in a numeric independent variable.
summary(aov(Petal.Width ~ Petal.Length + Species, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Petal.Length 1 80.26 80.26 2487.02 < 2e-16 ***
## Species 2 1.60 0.80 24.77 5.48e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The sum of squares in Species are clear to me (sum of squared deviations from the group means) but how to interpret them if you have a numeric independent variable like Petal.Length?
The components of this linear model are not orthogonal so we cannot
calculate the sum of squares (SS) of each component independently of the others. Rather we must take a sequence of model comparisons. In this case aov
considers these models owing to the order in which the components were listed in the formula.
fm0 <- lm(Petal.Width ~ 1, iris) # null model
fm1 <- lm(Petal.Width ~ Petal.Length, iris)
fm2 <- lm(Petal.Width ~ Petal.Length + Species, iris) # full model
Note that the residual sum of squares (RSS) of a model fm is sum(resid(fm)^2) and R has a function specifically for this which is deviance(fm). Keeping this in mind we can decompose the RSS of the null model like this:
deviance(fm0) # RSS of null model
= (deviance(fm0) - deviance(fm1) # SS of Petal.Length
+ (deviance(fm1) - deviance(fm2) # SS of Species
+ deviance(fm2) # RSS of full model
and each sum of squares reported in the table in the question is one of the
lines above. That is,
deviance(fm0) - deviance(fm1) # SS of Petal.Length
## [1] 80.25984
deviance(fm1) - deviance(fm2) # SS of Species
## [1] 1.598453
deviance(fm2) # RSS of full model
## [1] 4.711643
Note
Note that the SS values we get depend on the sequence of models we use. For example, if we use this sequence which considers Species before Petal.Length (whereas above we considered Petal.Length and then Species) we get difference SS values.
fm0 # same null model as above
fm1a <- lm(Petal.Width ~ Species, iris)
fm2 # same full model as above
deviance(fm0) - deviance(fm1a) # SS of Species
## [1] 80.41333
deviance(fm1a) - deviance(fm2) # SS of Petal.Length
## [1] 1.444957
deviance(fm2) # RSS of full model
## [1] 4.711643
and note that this does correspond to aov if we list the components in that order, i.e. this time we listed Species before Petal.Length to change the sequence of models that aov would consider:
summary(aov(Petal.Width ~ Species + Petal.Length, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 1245.89 < 2e-16 ***
## Petal.Length 1 1.44 1.44 44.77 4.41e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

How is Pr(>|t|) in a linear regression in R calculated?

What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584

How to perform a three-way (binary factors) between-subjects ANOVA with main effects and all interactions in R

The study randomized participants by Source (Expert vs Attractive) and by Argument (Strong vs Weak), were categorized into Monitor type (High vs Low). I want to test the significance of the main effects, the two-way interactions, and the three-way interactions of the following dataframe - specifically,
Main effects = Self-Monitors (High vs. Low), Argument (Strong vs. Weak), Source (Attractive vs. Expert)
Two-way interactions = Self-MonitorsArgument, Self-MonitorsSource, Argument*Source
Three-way interactions = Self-MonitorsArgumentSource
This is the code:
data<-data.frame(Monitor=c(rep("High.Self.Monitors", 24),rep("Low.Self.Monitors", 24)),
Argument=c(rep("Strong", 24), rep("Weak", 24), rep("Strong", 24), rep("Weak", 24)),
Source=c(rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12),
rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12)),
Response=c(4,3,4,5,2,5,4,6,3,4,5,4,4,4,2,3,5,3,2,3,4,3,2,4,3,5,3,2,6,4,4,3,5,3,2,3,5,5,7,5,6,4,3,5,6,7,7,6,
3,5,5,4,3,2,1,5,3,4,3,4,5,4,3,2,4,6,2,4,4,3,4,3,5,6,4,7,6,7,5,6,4,6,7,5,6,4,4,2,4,5,4,3,4,2,3,4))
data$Monitor<-as.factor(data$Monitor)
data$Argument<-as.factor(data$Argument)
data$Source<-as.factor(data$Source)
I'd like to obtain the main effects, as well as all two-way interactions and the three-way interaction. However, if I type in anova(lm(Response ~ Monitor*Argument*Source, data=data)) I obtain:
Analysis of Variance Table
Response: Response
Df Sum Sq Mean Sq F value Pr(>F)
Monitor 1 24.000 24.0000 13.5322 0.0003947 ***
Source 1 0.667 0.6667 0.3759 0.5413218
Monitor:Source 1 0.667 0.6667 0.3759 0.5413218
Residuals 92 163.167 1.7736
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and if I enter summary(aov(Response ~ Monitor*Argument*Source, data=data))
Call:
lm.default(formula = Response ~ Monitor * Argument * Source,
data = data)
Residuals:
Min 1Q Median 3Q Max
-2.7917 -0.7917 0.2083 1.2083 2.5417
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4583 0.2718 12.722 < 2e-16 ***
MonitorLow.Self.Monitors 1.1667 0.3844 3.035 0.00313 **
ArgumentWeak NA NA NA NA
SourceExpert 0.3333 0.3844 0.867 0.38817
MonitorLow.Self.Monitors:ArgumentWeak NA NA NA NA
MonitorLow.Self.Monitors:SourceExpert -0.3333 0.5437 -0.613 0.54132
ArgumentWeak:SourceExpert NA NA NA NA
MonitorLow.Self.Monitors:ArgumentWeak:SourceExpert NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.332 on 92 degrees of freedom
Multiple R-squared: 0.1344, Adjusted R-squared: 0.1062
F-statistic: 4.761 on 3 and 92 DF, p-value: 0.00394
Any thoughts or ideas?
Edit
Your data isn't well randomized as you say. In order to estimate a three-way interaction you'd have to have a group of subjects having "Low", "Strong" and "Expert" combination of levels of the three factors. You do not have such a group.
Look at:
table(data[,1:3])
For example.

Regression on constant

I need to run a regression on a constant. In Eviews, I don't need to put any thing as a predictor when I run a regression on constant.I don't know how to do that in R. Does any one knows what should I write in this commnd?
fit= lm(r~?)
You can specify a constant as 1 in a formula:
r <- 1:5
fit <- lm(r ~ 1)
summary(fit)
# Call:
# lm(formula = r ~ 1)
#
# Residuals:
# 1 2 3 4 5
# -2.00e+00 -1.00e+00 2.22e-16 1.00e+00 2.00e+00
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.0000 0.7071 4.243 0.0132 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.581 on 4 degrees of freedom
Note that you don't need lm to get this result:
mean(r)
#[1] 3
sd(r)/sqrt(length(r))
#[1] 0.7071068
However, you might want to use lm in order to have a Null model against which you can compare other models ...
Edit:
Since you comment that you need "the p-value", I suggest to use a t-test instead.
t.test(r)
# One Sample t-test
#
#data: r
#t = 4.2426, df = 4, p-value = 0.01324
#alternative hypothesis: true mean is not equal to 0
#95 percent confidence interval:
# 1.036757 4.963243
#sample estimates:
#mean of x
# 3
This is equivalent, but more efficient computationally.

Resources