Adjusted mean with emmeans at 3 expressions - r

I know how to get the adjusted mean by emmeans when I have 2 expressions present, such as with sex.
sex == 1 : men, sex == 2 : women --> 2 expressions.
The associated model with the subsequent adjusted mean (EMM) calculation is:
mean_MF <- lm(LZ~age + SES_3 + sex, data = MF)
summary(mean_MF)
emmeans(mean_MF, ~ sex)
and the output looks like this:
> emmeans(mean_MF, ~ sex)
sex emmean SE df lower.CL upper.CL
1 7.05 0.0193 20894 7.02 7.09
2 6.96 0.0187 20894 6.93 7.00
Results are averaged over the levels of: belastet_SZ, belastet_SNZ, guteSeiten_SZ, guteSeiten_SNZ, SES_3
Confidence level used: 0.95
But if I want to calculate the adjusted mean of a variable with 3 values, I only get an adjusted mean of a common value? expression, instead of for all 3.
e.g. for age (Alter), here I have 3 characteristics which are coded as follows:
18-30 years: 1
31-40 years: 2
41-51 years: 3
What else do I need to add to the emmeans function so that I get the adjusted means of all three variables?
F_Alter <- lm(LZ~ SES_3 + Alter, data = Frauen)
summary(F_Alter)
emmeans(F_Alter, ~ Alter)
The summary of (F_Alter) looks as follows:
> summary(F_Alter)
Call:
lm(formula = LZ ~ SES_3 + Alterfactor, data = Frauen)
Residuals:
Min 1Q Median 3Q Max
-7.2303 -1.1162 0.1951 1.1220 3.8838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.44956 0.05653 131.777 < 2e-16 ***
SES_3mittel -0.42539 0.04076 -10.437 < 2e-16 ***
SES_3niedrig -1.11411 0.05115 -21.781 < 2e-16 ***
Alterfactor -0.07309 0.02080 -3.513 0.000444 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.889 on 14481 degrees of freedom
(5769 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.03287, Adjusted R-squared: 0.03267
F-statistic: 164 on 3 and 14481 DF, p-value: < 2.2e-16
In the following output I only get a value of 1.93 instead of my 3 expressions and the respective specific EEM's.
emmeans(F_Alter, ~ Alter)
Alter emmean SE df lower.CL upper.CL
1.93 6.8 0.0179 14481 6.76 6.83
Results are averaged over the levels of: SES_3
Confidence level used: 0.95
What can I change in the emmeans formula to get the output for my 3 age expressions (1, 2, 3)?

The predictor Alter in the original question was not coded as a factor, and so it was being treated as a continuous numeric variable in the model estimation and by emmeans.
The problem is fixed by creating a new factor variable,
Frauen$Alterfactor = as.factor(Frauer$Alter)
and then using this new variable as the predictor in the model.

Related

Plotting mean and standard error of mean from linear regression

I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like
Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

How is Pr(>|t|) in a linear regression in R calculated?

What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584

Regression models with categorical variable: dummy code or convert to factor

I know that this might be a little bit of a silly question, but the main reason that I want to ask is because I have been taught DUMMY CODE! DUMMY CODE! DUMMY CODE! By multiple teachers in multiple classes all using R.
So I did this comparison on the Auto data set in the ISLR package.
library(ISLR)
Auto$c3 <- ifelse(Auto$cylinders == 3, 1, 0)
Auto$c4 <- ifelse(Auto$cylinders == 4, 1, 0)
Auto$c5 <- ifelse(Auto$cylinders == 5, 1, 0)
Auto$c6 <- ifelse(Auto$cylinders == 6, 1, 0)
Auto$c8 <- ifelse(Auto$cylinders == 8, 1, 0)
Auto$cylinders <- as.factor(Auto$cylinders)
summary(lm(mpg~displacement + cylinders, data = Auto))
summary(lm(mpg~displacement + c4 + c5 + c6 + c8, data = Auto))
Call:
lm(formula = mpg ~ displacement + cylinders, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-10.692 -2.694 -0.347 2.157 20.307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.33811 2.25278 10.80 < 2e-16 ***
displacement -0.05225 0.00693 -7.54 3.3e-13 ***
cylinders4 10.67609 2.23296 4.78 2.5e-06 ***
cylinders5 10.60478 3.39198 3.13 0.0019 **
cylinders6 7.04473 2.46493 2.86 0.0045 **
cylinders8 8.65170 2.92786 2.95 0.0033 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.39 on 386 degrees of freedom
Multiple R-squared: 0.687, Adjusted R-squared: 0.683
F-statistic: 170 on 5 and 386 DF, p-value: <2e-16
> summary(lm(mpg~displacement + c4 + c5 + c6 + c8, data = Auto))
Call:
lm(formula = mpg ~ displacement + c4 + c5 + c6 + c8, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-10.692 -2.694 -0.347 2.157 20.307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.33811 2.25278 10.80 < 2e-16 ***
displacement -0.05225 0.00693 -7.54 3.3e-13 ***
c4 10.67609 2.23296 4.78 2.5e-06 ***
c5 10.60478 3.39198 3.13 0.0019 **
c6 7.04473 2.46493 2.86 0.0045 **
c8 8.65170 2.92786 2.95 0.0033 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.39 on 386 degrees of freedom
Multiple R-squared: 0.687, Adjusted R-squared: 0.683
F-statistic: 170 on 5 and 386 DF, p-value: <2e-16
Both produce the same output, which in my head is not surprising. The thing that does surprise me is the fact that I have been taught to dummy code instead of converting to factor. Is there any analytical, computational, or any reason at all to dummy code over using a factor variable? Using a factor seems so much easier, requires less code, and you don't end up with a bunch of extra variables. The only possible advantage of dummy coding that I can see versus using a factor is that you can select your reference group, which I'm guessing you can probably do with a factor too.
Dummy coding can be done easily using dummies package.
library(dummies)
#sample data
auto <- tail(ISLR::Auto,10)
#dummy coding
auto_dummyCoded <- cbind(auto, dummy(c("cylinders"), data=auto))
auto_dummyCoded
In above dummy coding, two new variables are added (i.e. cylinders4, cylinders6) as there are two cylinders categories in the sample data.
Now instead of dummy coding let's convert cylinders column to "factor" before passing it to lm
auto$cylinders <- as.factor(auto$cylinders)
fit <- lm(mpg ~ cylinders, data=auto, x=T)
Let's print fit$xto see how cylinders column was coded internally. R has converted cylinders column as cylinders6 and one constant column intercept (which is one less than the number of categories available in "cylinders" column along with one extra constant variable. Just an alternative way of dummy coding!)
(Intercept) cylinders6
388 1 0
389 1 1
390 1 0
391 1 0
392 1 0
393 1 0
394 1 0
395 1 0
396 1 0
397 1 0

Difference between lm(y~x1/x2) and aov(y~x1+Error(x2))

I have trouble understanding the difference between these two notations.
According to R intro y~x1/x2 represents that x2 in nested within x1. If x1 is a factor and x2 a continuous variable, is lm( y~x1/x2) a correct representation of nested ANCOVA?
What is confusing is that some online help topics suggest using aov(y~x1+Error(x2)) to represent a nested anova. Yet those two codes have completely different results.
For example:
x2 = rnorm(1000,2)
x1 = rep( c("A","B"), each=500)
y = x2*3+rnorm(1000)
Under this scenario I would expect x2 to be significant and x1 to be non significant.
summary(aov(y~x1+Error(x2)))
Error: x2
Df Sum Sq Mean Sq
x1 1 9262 9262
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 0.0 0.0003 0 0.985
Residuals 997 967.9 0.9708
aov() works as expected. However, lm()....
summary(lm( y~x1/x2))
Call:
lm(formula = y ~ x1/x2)
Residuals:
Min 1Q Median 3Q Max
-3.4468 -0.6352 0.0092 0.6526 2.8294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08727 0.09566 0.912 0.3618
x1B -0.24501 0.13715 -1.786 0.0743 .
x1A:x2 2.94012 0.04362 67.401 <2e-16 ***
x1B:x2 3.06272 0.04326 70.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9838 on 996 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9055
F-statistic: 3191 on 3 and 996 DF, p-value: < 2.2e-16
x1 is marginally significant, and in many iterations it is highly significant? How can these results be so different?
What am I missing? Those two formulas are not suppose to represent the same thing? Or am I misunderstanding something on the underlying statistics?

Resources