Quantify the significance of a slope - r

This is more of a technical question rather than coding. Is it possible to test whether the slope of a data set is significant or not?
So I have the following plot:
I managed to determine the slope of this blue geom_smooth line, which is basically the mean of all the pink lines. Is it possible to test if the slope of that blue line is significant based on the dataset?
I'm not directly interested in the code but just want to check if it's possible to determine a possible significance in the data set.

This shows the p-value, 0.0544, of the slope in the output:
summary(lm(demand ~ Time, BOD))
giving:
Call:
lm(formula = demand ~ Time, data = BOD)
Residuals:
1 2 3 4 5 6
-1.9429 -1.6643 5.3143 0.5929 -1.5286 -0.7714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.5214 2.6589 3.205 0.0328 *
Time 1.7214 0.6387 2.695 0.0544 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.085 on 4 degrees of freedom
Multiple R-squared: 0.6449, Adjusted R-squared: 0.5562
F-statistic: 7.265 on 1 and 4 DF, p-value: 0.05435

Related

Why does the use of log generate missing values (NA in regression coefficient)?

I have 4 data sets which is CO2 emission , GDP per capita, GDP per capita square and Number of tourist arrival. I am trying to run a model to observe the number of tourist arrival impact on Co2 emission in order to derive the Tourism induced Environmental Kuznets Curve. Below is the code and summary results .
Without log
Yt<-Data$`CO2 emissions`
X1t<-Data$`GDP per capita`
X2t<-Data$`GDP per caita square`
X3t<-Data$`Number of Tourist arrival`
model<-lm(Yt~X1t+X2t+X3t)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.238e-02 7.395e-03 1.674 0.100187
X1t -2.581e-05 6.710e-05 -0.385 0.702139
X2t 1.728e-07 4.572e-08 3.780 0.000413 ***
X3t 1.928e-07 3.501e-08 5.507 1.2e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02252 on 51 degrees of freedom
Multiple R-squared: 0.9475, Adjusted R-squared: 0.9444
F-statistic: 306.5 on 3 and 51 DF, p-value: < 2.2e-16
With log
LYt<-(log(Yt))
LX1t<-(log(X1t))
LX2t<-(log(X2t))
LX3t<-(log(X3t))
model1<-lm(LYt~LX1t+LX2t+LX3t)
summary(model1)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.38623 0.46040 -20.387 < 2e-16 ***
LX1t 0.83679 0.09834 8.509 2.01e-11 ***
LX2t NA NA NA NA
LX3t 0.17802 0.06888 2.585 0.0126 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2863 on 52 degrees of freedom
Multiple R-squared: 0.9074, Adjusted R-squared: 0.9038
F-statistic: 254.7 on 2 and 52 DF, p-value: < 2.2e-16
It is pretty evident that GDP per capita and GDP per capita square are perfectly collinear. However, why does the regression coefficients show missing values (NA) only in the case of log transformed model?

Why is my summary in R only including some of my variables?

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.
For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)

Using summary() and summary.lm() on planned comparisons in R - why do the outputs differ?

I have run a three-way independent ANOVA in R. (Sound) Manipulation being my independent variable with the three levels: congruent (KON), incongruent (INK) and no sound (control). Furthermore, I have constructed planned comparisons. The first comparison c1 is the contrast of KON & INK vs. the control group and the second comparison c2 is the contrast of KON vs. INK. The outputs look like this:
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Manipulation 2 11.97 5.985 2.388 0.0975 .
Manipulation: control vs. Experimental 1 7.97 7.970 3.181 0.0778 .
Manipulation: INK vs. KON 1 4.00 3.999 1.596 0.2097
Residuals 91 228.01 2.506
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary.lm(model)
Residuals:
Min 1Q Median 3Q Max
-2.5062 -1.3333 -0.3333 1.1398 4.4111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0317 0.1647 18.411 <2e-16 ***
Manipulationc1 -0.2214 0.1172 -1.889 0.0621 .
Manipulationc2 -0.2531 0.2003 -1.263 0.2097
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.583 on 91 degrees of freedom
Multiple R-squared: 0.04988, Adjusted R-squared: 0.02899
F-statistic: 2.388 on 2 and 91 DF, p-value: 0.0975
What strikes me is that R uses my pre-defined label of the comparisons, i.e. "control vs. experimental" and "INK vs. KON" in the first summary() output, yet it uses something else in the second output summary.lm(). Why is this?
Furthermore, it seems odd, that the p-value of the first comparison differs across the two outputs, i.e. 0.0778 in case of summary() and 0.0621 in case of summary.lm(). Where is this difference coming from?
You should inspect class(model):
M <- aov(formula = Petal.Length ~ Species, data = iris)
summary(M)
summary.lm(M)
class(M)
First there is "aov" - so summary(M) is the same as summary.aov(M)

Display category labels in regression output in R

Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?

How does lm() know which predictors are categorical?

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.
Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium
R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.

Resources