Plotting mean and standard error of mean from linear regression - r

I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like

Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.

Related

Type 3 Anova and lm summary outputs different results, continuous variables

After running a lm() model I've found my coefficients in the output and the Anova (type III) outputs are not the same. I think lm() is supposed to use type III sum of squares. Just hoping to figure out why I'm getting different results? The p-values are the same. I've tried using the (options(contrasts=c("contr.sum", "contr.poly"))) in a couple of different ways for the Anova function but it doesn't seem to change the output, and I'm pretty sure the two Anovas are doing the same thing. I don't think I actually need to use contrasts for my data because all of my variables are continuous but I was just trying everything.
#First version of type III Anova:
ImmMarmod.III.aov <- car::Anova(IMMMAR_model.logRTL, type = 3)
ImmMarmod.III.aov
Anova Table (Type III tests)
Response: LOG10RTL
Sum Sq Df F value Pr(>F)
(Intercept) 2.36246 1 57.1843 8.789e-09 ***
fat 0.00208 1 0.0504 0.8237
hap 0.03801 1 0.9200 0.3442
bka 0.08196 1 1.9839 0.1681
age 0.00760 1 0.1841 0.6706
Residuals 1.40465 34
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#second version of type III Anova:
options(contrasts = c("contr.treatment", "contr.poly"))
Anova(IMMMAR_model.logRTL, type=3) #type III
Anova Table (Type III tests)
#Response: LOG10RTL
# Sum Sq Df F value Pr(>F)
#(Intercept) 2.36246 1 57.1843 8.789e-09 ***
#fat 0.00208 1 0.0504 0.8237
#hap 0.03801 1 0.9200 0.3442
#bka 0.08196 1 1.9839 0.1681
#age 0.00760 1 0.1841 0.6706
#Residuals 1.40465 34
summary(IMMMAR_model.logRTL)
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.247246 0.032696 7.562 8.79e-09 ***
#fat 0.007855 0.034987 0.225 0.824
#hap 0.029878 0.031149 0.959 0.344
#bka 0.042890 0.030450 1.409 0.168
#age 0.014579 0.033982 0.429 0.671
I was expecting the results to be the same between the summary(lm) and Anova(model,type=3)
Any input would be greatly welcomed!

Adjusted mean with emmeans at 3 expressions

I know how to get the adjusted mean by emmeans when I have 2 expressions present, such as with sex.
sex == 1 : men, sex == 2 : women --> 2 expressions.
The associated model with the subsequent adjusted mean (EMM) calculation is:
mean_MF <- lm(LZ~age + SES_3 + sex, data = MF)
summary(mean_MF)
emmeans(mean_MF, ~ sex)
and the output looks like this:
> emmeans(mean_MF, ~ sex)
sex emmean SE df lower.CL upper.CL
1 7.05 0.0193 20894 7.02 7.09
2 6.96 0.0187 20894 6.93 7.00
Results are averaged over the levels of: belastet_SZ, belastet_SNZ, guteSeiten_SZ, guteSeiten_SNZ, SES_3
Confidence level used: 0.95
But if I want to calculate the adjusted mean of a variable with 3 values, I only get an adjusted mean of a common value? expression, instead of for all 3.
e.g. for age (Alter), here I have 3 characteristics which are coded as follows:
18-30 years: 1
31-40 years: 2
41-51 years: 3
What else do I need to add to the emmeans function so that I get the adjusted means of all three variables?
F_Alter <- lm(LZ~ SES_3 + Alter, data = Frauen)
summary(F_Alter)
emmeans(F_Alter, ~ Alter)
The summary of (F_Alter) looks as follows:
> summary(F_Alter)
Call:
lm(formula = LZ ~ SES_3 + Alterfactor, data = Frauen)
Residuals:
Min 1Q Median 3Q Max
-7.2303 -1.1162 0.1951 1.1220 3.8838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.44956 0.05653 131.777 < 2e-16 ***
SES_3mittel -0.42539 0.04076 -10.437 < 2e-16 ***
SES_3niedrig -1.11411 0.05115 -21.781 < 2e-16 ***
Alterfactor -0.07309 0.02080 -3.513 0.000444 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.889 on 14481 degrees of freedom
(5769 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.03287, Adjusted R-squared: 0.03267
F-statistic: 164 on 3 and 14481 DF, p-value: < 2.2e-16
In the following output I only get a value of 1.93 instead of my 3 expressions and the respective specific EEM's.
emmeans(F_Alter, ~ Alter)
Alter emmean SE df lower.CL upper.CL
1.93 6.8 0.0179 14481 6.76 6.83
Results are averaged over the levels of: SES_3
Confidence level used: 0.95
What can I change in the emmeans formula to get the output for my 3 age expressions (1, 2, 3)?
The predictor Alter in the original question was not coded as a factor, and so it was being treated as a continuous numeric variable in the model estimation and by emmeans.
The problem is fixed by creating a new factor variable,
Frauen$Alterfactor = as.factor(Frauer$Alter)
and then using this new variable as the predictor in the model.

Why is my summary in R only including some of my variables?

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.
For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)

How to interpret results of hurdle model that seem unusual?

I have discrete count data (trap_catch) for two groups withing the variable in_tree (1 = trap in a tree, or 0 = trap not in a tree), and I want to see if counts were different between these two groups. The data is overdispersed and there are many zeroes, so I have come to the conclusion that I need a hurdle model. Is this OK?
trap_id trap_catch in_tree
1 0 0
2 10 1
3 0 0
4 0 1
5 9 1
6 3 0
Here is an example of how the data is set up. My code is as follows:
mod.hurdle <- hurdle(trap_catch~in_tree, data=data,dist="negbin")
summary(mod.hurdle)
The results I get are as follows and seem so different to any examples I have read:
Pearson residuals:
Min 1Q Median 3Q Max
-0.8986 -0.6635 -0.2080 0.2474 6.8513
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2582 0.1285 9.793 < 2e-16 ***
in_tree 1.3722 0.3100 4.426 9.58e-06 ***
Log(theta) -0.2056 0.2674 -0.769 0.442
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5647 0.1944 8.049 8.32e-16 ***
in_tree 16.0014 1684.1379 0.010 0.992
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.8142
Number of iterations in BFGS optimization: 8
Log-likelihood: -513.7 on 5 Df
I am confused as to how to interpret these results.
I apologise in advance for my lack of understanding - I am very new to this type of analysis.

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Resources