I don't know how to interpret the Sum of Squares in a numeric independent variable.
summary(aov(Petal.Width ~ Petal.Length + Species, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Petal.Length 1 80.26 80.26 2487.02 < 2e-16 ***
## Species 2 1.60 0.80 24.77 5.48e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The sum of squares in Species are clear to me (sum of squared deviations from the group means) but how to interpret them if you have a numeric independent variable like Petal.Length?
The components of this linear model are not orthogonal so we cannot
calculate the sum of squares (SS) of each component independently of the others. Rather we must take a sequence of model comparisons. In this case aov
considers these models owing to the order in which the components were listed in the formula.
fm0 <- lm(Petal.Width ~ 1, iris) # null model
fm1 <- lm(Petal.Width ~ Petal.Length, iris)
fm2 <- lm(Petal.Width ~ Petal.Length + Species, iris) # full model
Note that the residual sum of squares (RSS) of a model fm is sum(resid(fm)^2) and R has a function specifically for this which is deviance(fm). Keeping this in mind we can decompose the RSS of the null model like this:
deviance(fm0) # RSS of null model
= (deviance(fm0) - deviance(fm1) # SS of Petal.Length
+ (deviance(fm1) - deviance(fm2) # SS of Species
+ deviance(fm2) # RSS of full model
and each sum of squares reported in the table in the question is one of the
lines above. That is,
deviance(fm0) - deviance(fm1) # SS of Petal.Length
## [1] 80.25984
deviance(fm1) - deviance(fm2) # SS of Species
## [1] 1.598453
deviance(fm2) # RSS of full model
## [1] 4.711643
Note
Note that the SS values we get depend on the sequence of models we use. For example, if we use this sequence which considers Species before Petal.Length (whereas above we considered Petal.Length and then Species) we get difference SS values.
fm0 # same null model as above
fm1a <- lm(Petal.Width ~ Species, iris)
fm2 # same full model as above
deviance(fm0) - deviance(fm1a) # SS of Species
## [1] 80.41333
deviance(fm1a) - deviance(fm2) # SS of Petal.Length
## [1] 1.444957
deviance(fm2) # RSS of full model
## [1] 4.711643
and note that this does correspond to aov if we list the components in that order, i.e. this time we listed Species before Petal.Length to change the sequence of models that aov would consider:
summary(aov(Petal.Width ~ Species + Petal.Length, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 1245.89 < 2e-16 ***
## Petal.Length 1 1.44 1.44 44.77 4.41e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
I am trying to predict which variables impact lift, which is sales rate for food goods on promotion. In my dataset, lift is my dependent variable and I have eight possible independent variables.Here are the first couple of rows of my dataset.
I need to do this analysis for 20 different products across 30 different stores. I want to know if it is possible to run 20 regressions on all of the products simultaneously in R. This way I would only have to run 30 regressions manually, one for each store, and I would get results for each store. I would like to use stepwise because this is what I am familiar with.
Here is the code I have written so far using only one regression at a time:
data0<- subset(data0, Store == "Store 1")
data0<- subset(data0, Product == "Product 1")
########Summary Stats
head(data0)
summary(data0)
str(data0)
###Data Frame
data0<-pdata.frame(data0, index=c("Product","Time"))
data0<-data.frame(data0)
###Stepwise
step_qtr_1v<- lm(Lift ~
+ Depth
+ Length
+ Copromotion
+ Category.Sales.On.Merch
+ Quality.Support.Binary
, data = data0)
summary(step_qtr_1v)
I am new to R so would appreciate simplicity. Thank you.
Its really important to follow the guidelines when asking a question. Nonetheless, I've made a toy example with the iris dataset.
In order to run the same regressions multiple times over different parts of your dataset, you can use the lapply() function, which applies a function over a vector or list (in this case, the name of the species). The only thing you have to do is pass this to the subset argument in the lm() function:
data("iris")
species <- unique(iris$Species)
species
Running species shows the levels of this variable:
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
And running colnames(iris) tells us what variables to use:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The lapply function can be run thereafter like so:
models <- lapply(species, function(x) {
lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
})
lapply(models, summary)
The result:
[[1]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.38868 -0.07905 0.00632 0.10095 0.48238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.86547 0.34331 2.521 0.0152 *
Petal.Width 0.46253 0.23410 1.976 0.0542 .
Sepal.Length 0.11606 0.10162 1.142 0.2594
Sepal.Width -0.02865 0.09334 -0.307 0.7602
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1657 on 46 degrees of freedom
Multiple R-squared: 0.1449, Adjusted R-squared: 0.08914
F-statistic: 2.598 on 3 and 46 DF, p-value: 0.06356
[[2]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.61706 -0.13086 -0.02966 0.09854 0.54311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16506 0.40032 0.412 0.682
Petal.Width 1.36021 0.23569 5.771 6.37e-07 ***
Sepal.Length 0.43586 0.07938 5.491 1.67e-06 ***
Sepal.Width -0.10685 0.14625 -0.731 0.469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2319 on 46 degrees of freedom
Multiple R-squared: 0.7713, Adjusted R-squared: 0.7564
F-statistic: 51.72 on 3 and 46 DF, p-value: 8.885e-15
[[3]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.7325 -0.1493 0.0516 0.1555 0.5866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46503 0.47686 0.975 0.335
Petal.Width 0.21565 0.17410 1.239 0.222
Sepal.Length 0.74297 0.07129 10.422 1.07e-13 ***
Sepal.Width -0.08225 0.15999 -0.514 0.610
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2819 on 46 degrees of freedom
Multiple R-squared: 0.7551, Adjusted R-squared: 0.7391
F-statistic: 47.28 on 3 and 46 DF, p-value: 4.257e-14
BTW, you are not performing any stepwise regression in your code. But the above example can be easily modified to do so.
Hope this helps.
The default OLS regression in R gives me the p-value regarding whether or not the coefficient is different from zero.
Is there a way to change this default regarding coefficients that are different from one?
Thank you
Just carry out the linear hypothesis. In R use the function car::LinearHypothesis:
mod <- lm(Sepal.Width~., iris)
then run any of the following to test as to whether the coefficient for Petal.Length = 1
car::linearHypothesis(mod, "Petal.Length = 1")
car::lht(mod, "Petal.Length = 1")
Linear hypothesis test
Hypothesis:
Petal.Length = 1
Model 1: restricted model
Model 2: Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 145 24.837
2 144 10.328 1 14.509 202.31 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I'm looking for the method/function that returns de Rsquared of an anova model in R.
Could not find anything so far.
Thanks
tl;dr: you can get the R-squared of the anova by looking at the summary output of the corresponding linear model
Let's go step by step:
1) Let's use the data from here
pain <- c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
drug <- c(rep("A", 9), rep("B", 9), rep("C", 9))
migraine <- data.frame(pain, drug)
2) Let's get the anova:
AOV <- aov(pain ~ drug, data=migraine)
summary(AOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 28.22 14.111 11.91 0.000256 ***
## Residuals 24 28.44 1.185
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3) Now, the anova is directly related to the linear model, so let's get it and find the anova from it:
LM <- lm(pain ~ drug, data=migraine)
anova(LM)
## Analysis of Variance Table
##
## Response: pain
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 28.222 14.1111 11.906 0.0002559 ***
## Residuals 24 28.444 1.1852
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As expected, the results are exactly the same. This means that...
3) We can get the R-squared from the linear model:
summary(LM)
## Call:
## lm(formula = pain ~ drug, data = migraine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7778 -0.7778 0.1111 0.3333 2.2222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6667 0.3629 10.104 4.01e-10 ***
## drugB 2.1111 0.5132 4.114 0.000395 ***
## drugC 2.2222 0.5132 4.330 0.000228 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.089 on 24 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4562
## F-statistic: 11.91 on 2 and 24 DF, p-value: 0.0002559
So the R-squared is 0.498
But what if we don't believe this?
4) What is the R-squared? It's the sum of squares regression divided by the total sum of squares (i.e., the sum of squares of the regression plus the sum of squares of the residuals). So let's find those numbers in the anova and calculate the R-squared directly:
# We use the tidy function from the broom package to extract values
library(broom)
tidy_aov <- tidy(AOV)
tidy_aov
## term df sumsq meansq statistic p.value
## 1 drug 2 28.22222 14.111111 11.90625 0.0002558807
## 2 Residuals 24 28.44444 1.185185 NA NA
# The values we need are in the sumsq column of this data frame
sum_squares_regression <- tidy_aov$sumsq[1]
sum_squares_residuals <- tidy_aov$sumsq[2]
R_squared <- sum_squares_regression /
(sum_squares_regression + sum_squares_residuals)
R_squared
## 0.4980392
So we get the same result: R-squared is 0.4980392
If you want to calculate the Adjusted R-square then you can apply the following formula (from https://www.statisticshowto.datasciencecentral.com/adjusted-r2/):
s <- summary(LM)
r2 <- s$r.squared
n <- dim(migraine)[1]
k <- 2
#adjusted R-square
1 - ((1-r2)*(n-1)/(n-k-1))
#the same as
s$adj.r.squared
Adjustment means penalization for additional variables ('k' in formula) just like in case of the AIC calculation. If the goodness-of-fit, the estimations vs residuals ratio does not increase significantly by adding a new independent variable then you shouldn't include it.
So, R-square will always increase by involving more and more variables while Adjusted R-square will stop improving after a certain number of regressors.
I'm making a scatterplot matrix with ggpairs{GGally) as follows, but I'd like to display the p values for each term in my aov results in the upper panels, rather than just the overall and by-species correlation value that comes with the package.
How can I get the right column from this aov result into my upper plots? Can I write a custom function to do this, and how? Is is even possible using ggpairs? Thanks.
library(GGally);library(ggplot2)
pm <- ggpairs(data = iris,
mapping = aes(color = Species),
columns = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
pm
result <- aov(Sepal.Length ~ Sepal.Width*Petal.Length, data = iris)
print(summary(result))
Df Sum Sq Mean Sq F value Pr(>F)
Sepal.Width 1 1.41 1.41 12.9 0.000447 ***
Petal.Length 1 84.43 84.43 771.4 < 2e-16 ***
Sepal.Width:Petal.Length 1 0.35 0.35 3.2 0.075712 .
Residuals 146 15.98 0.11
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I need to run a regression on a constant. In Eviews, I don't need to put any thing as a predictor when I run a regression on constant.I don't know how to do that in R. Does any one knows what should I write in this commnd?
fit= lm(r~?)
You can specify a constant as 1 in a formula:
r <- 1:5
fit <- lm(r ~ 1)
summary(fit)
# Call:
# lm(formula = r ~ 1)
#
# Residuals:
# 1 2 3 4 5
# -2.00e+00 -1.00e+00 2.22e-16 1.00e+00 2.00e+00
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.0000 0.7071 4.243 0.0132 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.581 on 4 degrees of freedom
Note that you don't need lm to get this result:
mean(r)
#[1] 3
sd(r)/sqrt(length(r))
#[1] 0.7071068
However, you might want to use lm in order to have a Null model against which you can compare other models ...
Edit:
Since you comment that you need "the p-value", I suggest to use a t-test instead.
t.test(r)
# One Sample t-test
#
#data: r
#t = 4.2426, df = 4, p-value = 0.01324
#alternative hypothesis: true mean is not equal to 0
#95 percent confidence interval:
# 1.036757 4.963243
#sample estimates:
#mean of x
# 3
This is equivalent, but more efficient computationally.