Why is my summary in R only including some of my variables? - r

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.

For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234

If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)

Related

Why does the use of log generate missing values (NA in regression coefficient)?

I have 4 data sets which is CO2 emission , GDP per capita, GDP per capita square and Number of tourist arrival. I am trying to run a model to observe the number of tourist arrival impact on Co2 emission in order to derive the Tourism induced Environmental Kuznets Curve. Below is the code and summary results .
Without log
Yt<-Data$`CO2 emissions`
X1t<-Data$`GDP per capita`
X2t<-Data$`GDP per caita square`
X3t<-Data$`Number of Tourist arrival`
model<-lm(Yt~X1t+X2t+X3t)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.238e-02 7.395e-03 1.674 0.100187
X1t -2.581e-05 6.710e-05 -0.385 0.702139
X2t 1.728e-07 4.572e-08 3.780 0.000413 ***
X3t 1.928e-07 3.501e-08 5.507 1.2e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02252 on 51 degrees of freedom
Multiple R-squared: 0.9475, Adjusted R-squared: 0.9444
F-statistic: 306.5 on 3 and 51 DF, p-value: < 2.2e-16
With log
LYt<-(log(Yt))
LX1t<-(log(X1t))
LX2t<-(log(X2t))
LX3t<-(log(X3t))
model1<-lm(LYt~LX1t+LX2t+LX3t)
summary(model1)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.38623 0.46040 -20.387 < 2e-16 ***
LX1t 0.83679 0.09834 8.509 2.01e-11 ***
LX2t NA NA NA NA
LX3t 0.17802 0.06888 2.585 0.0126 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2863 on 52 degrees of freedom
Multiple R-squared: 0.9074, Adjusted R-squared: 0.9038
F-statistic: 254.7 on 2 and 52 DF, p-value: < 2.2e-16
It is pretty evident that GDP per capita and GDP per capita square are perfectly collinear. However, why does the regression coefficients show missing values (NA) only in the case of log transformed model?

Quantify the significance of a slope

This is more of a technical question rather than coding. Is it possible to test whether the slope of a data set is significant or not?
So I have the following plot:
I managed to determine the slope of this blue geom_smooth line, which is basically the mean of all the pink lines. Is it possible to test if the slope of that blue line is significant based on the dataset?
I'm not directly interested in the code but just want to check if it's possible to determine a possible significance in the data set.
This shows the p-value, 0.0544, of the slope in the output:
summary(lm(demand ~ Time, BOD))
giving:
Call:
lm(formula = demand ~ Time, data = BOD)
Residuals:
1 2 3 4 5 6
-1.9429 -1.6643 5.3143 0.5929 -1.5286 -0.7714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.5214 2.6589 3.205 0.0328 *
Time 1.7214 0.6387 2.695 0.0544 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.085 on 4 degrees of freedom
Multiple R-squared: 0.6449, Adjusted R-squared: 0.5562
F-statistic: 7.265 on 1 and 4 DF, p-value: 0.05435

Using summary() and summary.lm() on planned comparisons in R - why do the outputs differ?

I have run a three-way independent ANOVA in R. (Sound) Manipulation being my independent variable with the three levels: congruent (KON), incongruent (INK) and no sound (control). Furthermore, I have constructed planned comparisons. The first comparison c1 is the contrast of KON & INK vs. the control group and the second comparison c2 is the contrast of KON vs. INK. The outputs look like this:
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Manipulation 2 11.97 5.985 2.388 0.0975 .
Manipulation: control vs. Experimental 1 7.97 7.970 3.181 0.0778 .
Manipulation: INK vs. KON 1 4.00 3.999 1.596 0.2097
Residuals 91 228.01 2.506
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary.lm(model)
Residuals:
Min 1Q Median 3Q Max
-2.5062 -1.3333 -0.3333 1.1398 4.4111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0317 0.1647 18.411 <2e-16 ***
Manipulationc1 -0.2214 0.1172 -1.889 0.0621 .
Manipulationc2 -0.2531 0.2003 -1.263 0.2097
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.583 on 91 degrees of freedom
Multiple R-squared: 0.04988, Adjusted R-squared: 0.02899
F-statistic: 2.388 on 2 and 91 DF, p-value: 0.0975
What strikes me is that R uses my pre-defined label of the comparisons, i.e. "control vs. experimental" and "INK vs. KON" in the first summary() output, yet it uses something else in the second output summary.lm(). Why is this?
Furthermore, it seems odd, that the p-value of the first comparison differs across the two outputs, i.e. 0.0778 in case of summary() and 0.0621 in case of summary.lm(). Where is this difference coming from?
You should inspect class(model):
M <- aov(formula = Petal.Length ~ Species, data = iris)
summary(M)
summary.lm(M)
class(M)
First there is "aov" - so summary(M) is the same as summary.aov(M)

Display category labels in regression output in R

Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?

How to perform a three-way (binary factors) between-subjects ANOVA with main effects and all interactions in R

The study randomized participants by Source (Expert vs Attractive) and by Argument (Strong vs Weak), were categorized into Monitor type (High vs Low). I want to test the significance of the main effects, the two-way interactions, and the three-way interactions of the following dataframe - specifically,
Main effects = Self-Monitors (High vs. Low), Argument (Strong vs. Weak), Source (Attractive vs. Expert)
Two-way interactions = Self-MonitorsArgument, Self-MonitorsSource, Argument*Source
Three-way interactions = Self-MonitorsArgumentSource
This is the code:
data<-data.frame(Monitor=c(rep("High.Self.Monitors", 24),rep("Low.Self.Monitors", 24)),
Argument=c(rep("Strong", 24), rep("Weak", 24), rep("Strong", 24), rep("Weak", 24)),
Source=c(rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12),
rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12)),
Response=c(4,3,4,5,2,5,4,6,3,4,5,4,4,4,2,3,5,3,2,3,4,3,2,4,3,5,3,2,6,4,4,3,5,3,2,3,5,5,7,5,6,4,3,5,6,7,7,6,
3,5,5,4,3,2,1,5,3,4,3,4,5,4,3,2,4,6,2,4,4,3,4,3,5,6,4,7,6,7,5,6,4,6,7,5,6,4,4,2,4,5,4,3,4,2,3,4))
data$Monitor<-as.factor(data$Monitor)
data$Argument<-as.factor(data$Argument)
data$Source<-as.factor(data$Source)
I'd like to obtain the main effects, as well as all two-way interactions and the three-way interaction. However, if I type in anova(lm(Response ~ Monitor*Argument*Source, data=data)) I obtain:
Analysis of Variance Table
Response: Response
Df Sum Sq Mean Sq F value Pr(>F)
Monitor 1 24.000 24.0000 13.5322 0.0003947 ***
Source 1 0.667 0.6667 0.3759 0.5413218
Monitor:Source 1 0.667 0.6667 0.3759 0.5413218
Residuals 92 163.167 1.7736
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and if I enter summary(aov(Response ~ Monitor*Argument*Source, data=data))
Call:
lm.default(formula = Response ~ Monitor * Argument * Source,
data = data)
Residuals:
Min 1Q Median 3Q Max
-2.7917 -0.7917 0.2083 1.2083 2.5417
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4583 0.2718 12.722 < 2e-16 ***
MonitorLow.Self.Monitors 1.1667 0.3844 3.035 0.00313 **
ArgumentWeak NA NA NA NA
SourceExpert 0.3333 0.3844 0.867 0.38817
MonitorLow.Self.Monitors:ArgumentWeak NA NA NA NA
MonitorLow.Self.Monitors:SourceExpert -0.3333 0.5437 -0.613 0.54132
ArgumentWeak:SourceExpert NA NA NA NA
MonitorLow.Self.Monitors:ArgumentWeak:SourceExpert NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.332 on 92 degrees of freedom
Multiple R-squared: 0.1344, Adjusted R-squared: 0.1062
F-statistic: 4.761 on 3 and 92 DF, p-value: 0.00394
Any thoughts or ideas?
Edit
Your data isn't well randomized as you say. In order to estimate a three-way interaction you'd have to have a group of subjects having "Low", "Strong" and "Expert" combination of levels of the three factors. You do not have such a group.
Look at:
table(data[,1:3])
For example.

Resources