Linear regression for each category of a variable - r

Let's say I'm working with the iris dataset in R:
data(iris)
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. : 4.300 Min. : 2.000 Min. : 1.000 Min. : 0.100
1st Qu.: 5.100 1st Qu.: 2.800 1st Qu.: 1.600 1st Qu.: 0.300
Median : 5.800 Median : 3.000 Median : 4.350 Median : 1.300
Mean : 5.843 Mean : 3.057 Mean : 3.758 Mean : 1.199
3rd Qu.: 6.400 3rd Qu.: 3.300 3rd Qu.: 5.100 3rd Qu.: 1.800
Max. : 7.900 Max. : 4.400 Max. : 6.900 Max. : 2.500
Species
setosa : 50
versicolor: 50
virginica : 50
I want to perform a linear regression where Petal.Length is the dependent variable, and Sepal.Length is the independent variable. How can I, in R, perform this regression for each Species category at once, getting values of P, R² and F for each test?

Use by.
by(iris, iris$Species, \(x) summary(lm(Petal.Length ~ Sepal.Length, x)))
# iris$Species: setosa
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.40856 -0.08027 -0.00856 0.11708 0.46512
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.80305 0.34388 2.335 0.0238 *
# Sepal.Length 0.13163 0.06853 1.921 0.0607 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.1691 on 48 degrees of freedom
# Multiple R-squared: 0.07138, Adjusted R-squared: 0.05204
# F-statistic: 3.69 on 1 and 48 DF, p-value: 0.0607
#
# ---------------------------------------------------------
# iris$Species: versicolor
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.68611 -0.22827 -0.04123 0.19458 0.79607
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.18512 0.51421 0.360 0.72
# Sepal.Length 0.68647 0.08631 7.954 2.59e-10 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.3118 on 48 degrees of freedom
# Multiple R-squared: 0.5686, Adjusted R-squared: 0.5596
# F-statistic: 63.26 on 1 and 48 DF, p-value: 2.586e-10
#
# ---------------------------------------------------------
# iris$Species: virginica
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.68603 -0.21104 0.06399 0.18901 0.66402
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.61047 0.41711 1.464 0.15
# Sepal.Length 0.75008 0.06303 11.901 6.3e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2805 on 48 degrees of freedom
# Multiple R-squared: 0.7469, Adjusted R-squared: 0.7416
# F-statistic: 141.6 on 1 and 48 DF, p-value: 6.298e-16
Edit
To elaborate on my comment, we can extract the desired values very easily doing
by(iris, iris$Species, \(x) lm(Petal.Length ~ Sepal.Length, x)) |>
lapply(\(x) {
with(summary(x), c(r2=r.squared, f=fstatistic,
p=do.call(pf, c(as.list(unname(fstatistic)), lower.tail=FALSE))))
}) |> do.call(what=rbind)
# r2 f.value f.numdf f.dendf p
# setosa 0.07138289 3.689765 1 48 6.069778e-02
# versicolor 0.56858983 63.263024 1 48 2.586190e-10
# virginica 0.74688439 141.636664 1 48 6.297786e-16

If you would like to pull out those values, we can use
library (dplyr)
df <- iris
list_res <- df %>%
base::split (., df$Species, drop = FALSE) %>%
lapply (., function (x) {
fit <- lm(Petal.Length ~ Sepal.Length, data = x) %>%
summary ()
r <- fit$r.squared
coeffs <- fit$coefficients %>%
as_tibble ()
f <- fit$fstatistic[[1]]
list_res <- list (r, coeffs, f)
names (list_res) <- c("R-Squared", "Coefficients", "F-Value")
return (list_res)
})
That returns a list of three objects for each regression model including the desired values. I've left the coefficients table here as it is, since it's always good to know to which independent variable your p-values belong. If you want those p-values pulled out separately, we can use coeffs <- fit$coefficients [,4] %>% as.list (), for instance.

Related

How can I run multiple stepwise linear regressions at once?

I am trying to predict which variables impact lift, which is sales rate for food goods on promotion. In my dataset, lift is my dependent variable and I have eight possible independent variables.Here are the first couple of rows of my dataset.
I need to do this analysis for 20 different products across 30 different stores. I want to know if it is possible to run 20 regressions on all of the products simultaneously in R. This way I would only have to run 30 regressions manually, one for each store, and I would get results for each store. I would like to use stepwise because this is what I am familiar with.
Here is the code I have written so far using only one regression at a time:
data0<- subset(data0, Store == "Store 1")
data0<- subset(data0, Product == "Product 1")
########Summary Stats
head(data0)
summary(data0)
str(data0)
###Data Frame
data0<-pdata.frame(data0, index=c("Product","Time"))
data0<-data.frame(data0)
###Stepwise
step_qtr_1v<- lm(Lift ~
+ Depth
+ Length
+ Copromotion
+ Category.Sales.On.Merch
+ Quality.Support.Binary
, data = data0)
summary(step_qtr_1v)
I am new to R so would appreciate simplicity. Thank you.
Its really important to follow the guidelines when asking a question. Nonetheless, I've made a toy example with the iris dataset.
In order to run the same regressions multiple times over different parts of your dataset, you can use the lapply() function, which applies a function over a vector or list (in this case, the name of the species). The only thing you have to do is pass this to the subset argument in the lm() function:
data("iris")
species <- unique(iris$Species)
species
Running species shows the levels of this variable:
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
And running colnames(iris) tells us what variables to use:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The lapply function can be run thereafter like so:
models <- lapply(species, function(x) {
lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
})
lapply(models, summary)
The result:
[[1]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.38868 -0.07905 0.00632 0.10095 0.48238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.86547 0.34331 2.521 0.0152 *
Petal.Width 0.46253 0.23410 1.976 0.0542 .
Sepal.Length 0.11606 0.10162 1.142 0.2594
Sepal.Width -0.02865 0.09334 -0.307 0.7602
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1657 on 46 degrees of freedom
Multiple R-squared: 0.1449, Adjusted R-squared: 0.08914
F-statistic: 2.598 on 3 and 46 DF, p-value: 0.06356
[[2]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.61706 -0.13086 -0.02966 0.09854 0.54311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16506 0.40032 0.412 0.682
Petal.Width 1.36021 0.23569 5.771 6.37e-07 ***
Sepal.Length 0.43586 0.07938 5.491 1.67e-06 ***
Sepal.Width -0.10685 0.14625 -0.731 0.469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2319 on 46 degrees of freedom
Multiple R-squared: 0.7713, Adjusted R-squared: 0.7564
F-statistic: 51.72 on 3 and 46 DF, p-value: 8.885e-15
[[3]]
Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
data = iris, subset = iris$Species == x)
Residuals:
Min 1Q Median 3Q Max
-0.7325 -0.1493 0.0516 0.1555 0.5866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46503 0.47686 0.975 0.335
Petal.Width 0.21565 0.17410 1.239 0.222
Sepal.Length 0.74297 0.07129 10.422 1.07e-13 ***
Sepal.Width -0.08225 0.15999 -0.514 0.610
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2819 on 46 degrees of freedom
Multiple R-squared: 0.7551, Adjusted R-squared: 0.7391
F-statistic: 47.28 on 3 and 46 DF, p-value: 4.257e-14
BTW, you are not performing any stepwise regression in your code. But the above example can be easily modified to do so.
Hope this helps.

How to run a model with variables from different dataframes multiple times with lapply in R

I have 2 dataframes
#dummy df for examples:
set.seed(1)
df1 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
df2 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
I want to do this for every column in both dataframes (except the t column):
model <- lm(df2$A ~ df1$A, data = NULL)
I have tried something like this:
model <- function(yvar, xvar){
lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
}
lapply(names(data), model)
but it obviously doesn't work. What am i doing wrong?
In the end, what i really want is to get the coefficients and other stuff from the models. But what is stopping me is how to run a linear model with variables from different dataframes multiple times.
the output i want i'll guess it should look something like this:
# [[1]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[2]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[3]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
Since df1 and df2 have same names you can do this as :
model <- function(var){
lm(df1[[var]] ~ df2[[var]])
}
result <- lapply(names(df1)[-1], model)
result
#[[1]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.1504 -0.4763
#[[2]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 3.0227 0.6374
#[[3]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.4240 0.2411
To get summary statistics from the model you can use broom::tidy :
purrr::map_df(result, broom::tidy, .id = 'model_num')
# model_num term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 (Intercept) 15.2 3.03 5.00 0.000194
#2 1 df2[[var]] -0.476 0.248 -1.92 0.0754
#3 2 (Intercept) 3.02 4.09 0.739 0.472
#4 2 df2[[var]] 0.637 0.227 2.81 0.0139
#5 3 (Intercept) 15.4 4.40 3.50 0.00351
#6 3 df2[[var]] 0.241 0.272 0.888 0.390

Missing data behaviour in lm: complete cases used even with predictors without missing data

My question: what is the most efficient way of removing a predictor with NAs and consider the complete cases excluding that predictor?
The question arises from the following regression situation with NAs, in which there are missing values in Ozone (mostly) and Solar.R.
data(airquality)
summary(airquality)
# Ozone Solar.R Wind Temp Month
# Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
# 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
# Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
# Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
# 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
# Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
# NA's :37 NA's :7
# Day
# Min. : 1.0
# 1st Qu.: 8.0
# Median :16.0
# Mean :15.8
# 3rd Qu.:23.0
# Max. :31.0
Regression of Wind on the remaining variables. Considers only the complete cases.
summary(lm(Wind ~ ., data = airquality))
#
# Call:
# lm(formula = Wind ~ ., data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.3908 -2.2800 -0.3078 1.4132 9.6501
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 15.519460 2.918393 5.318 5.96e-07 ***
# Ozone -0.060746 0.011798 -5.149 1.23e-06 ***
# Solar.R 0.003791 0.003216 1.179 0.241
# Temp -0.036604 0.044576 -0.821 0.413
# Month -0.159671 0.208082 -0.767 0.445
# Day 0.017353 0.031238 0.556 0.580
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.822 on 105 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.3994, Adjusted R-squared: 0.3708
# F-statistic: 13.96 on 5 and 105 DF, p-value: 1.857e-10
If Ozone is removed, still considers only the complete cases (with Ozone included). But this is different from manually removing Ozone.
summary(lm(Wind ~ . - Ozone, data = airquality))
#
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -6.012 -2.323 -0.361 1.493 9.605
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 24.3159074 2.6354288 9.227 3.09e-15 ***
# Solar.R 0.0009228 0.0035281 0.262 0.794
# Temp -0.1900820 0.0369159 -5.149 1.21e-06 ***
# Month 0.0313046 0.2280600 0.137 0.891
# Day 0.0008969 0.0346116 0.026 0.979
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.143 on 106 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.2477, Adjusted R-squared: 0.2193
# F-statistic: 8.727 on 4 and 106 DF, p-value: 3.961e-06
summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
#
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -8.1779 -2.2063 -0.2757 1.9448 9.3510
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.660271 2.416766 9.790 < 2e-16 ***
# Solar.R 0.002980 0.003113 0.957 0.340
# Temp -0.186386 0.032725 -5.695 6.89e-08 ***
# Month 0.074952 0.206334 0.363 0.717
# Day -0.011028 0.030304 -0.364 0.716
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.158 on 141 degrees of freedom
# (7 observations deleted due to missingness)
# Multiple R-squared: 0.2125, Adjusted R-squared: 0.1901
# F-statistic: 9.511 on 4 and 141 DF, p-value: 7.761e-07
It is indeed unfortunate and surprising that Wind ~ . - Ozone considers Ozone when finding complete cases; seems worth discussion on the r-devel#r-project.org mailing list, if you want to pursue it. In the meantime, how about
summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))
?

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

contrasts in anova

I understand the contrasts from previous posts and I think I am doing the right thing but it is not giving me what I would expect.
x <- c(11.80856, 11.89269, 11.42944, 12.03155, 10.40744,
12.48229, 12.1188, 11.76914, 0, 0,
13.65773, 13.83269, 13.2401, 14.54421, 13.40312)
type <- factor(c(rep("g",5),rep("i",5),rep("t",5)))
type
[1] g g g g g i i i i i t t t t t
Levels: g i t
When I run this:
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.514 1.729 6.659 2.33e-05 ***
typei -4.240 2.445 -1.734 0.109
typet 2.222 2.445 0.909 0.381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
Here my reference is my type "g", so my typei is the difference between type "g" and type "i", and my typet is the difference between type "g" and type "t".
I wanted to see two more contrasts here, the difference between typei+typeg and type "t" and difference between type "i" and type "t"
so the contrasts
> contrasts(type) <- cbind( c(-1,-1,2),c(0,-1,1))
> summary.lm(aov(x~type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.8412 0.9983 10.860 1.46e-07 ***
type1 -0.6728 1.4118 -0.477 0.642
type2 4.2399 2.4453 1.734 0.109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
When I try to do the second contrast by changing my reference I get different results. I am not understanding what is wrong with my contrast.
Refence: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
mat <- cbind(rep(1/3, 3), "g+i vs t"=c(-1/2, -1/2, 1),"i vs t"=c(0, -1, 1))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(type) <- my.contrast
summary.lm(aov(x ~ type))
my.contrast
> g+i vs t i vs t
[1,] -1.3333 1
[2,] 0.6667 -1
[3,] 0.6667 0
> contrasts(type) <- my.contrast
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.274 -0.414 0.097 0.663 5.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.841 0.998 10.86 1.5e-07 ***
typeg+i vs t 4.342 2.118 2.05 0.063 .
typei vs t 6.462 2.445 2.64 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.87 on 12 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.271
F-statistic: 3.6 on 2 and 12 DF, p-value: 0.0594

Resources