The default OLS regression in R gives me the p-value regarding whether or not the coefficient is different from zero.
Is there a way to change this default regarding coefficients that are different from one?
Thank you
Just carry out the linear hypothesis. In R use the function car::LinearHypothesis:
mod <- lm(Sepal.Width~., iris)
then run any of the following to test as to whether the coefficient for Petal.Length = 1
car::linearHypothesis(mod, "Petal.Length = 1")
car::lht(mod, "Petal.Length = 1")
Linear hypothesis test
Hypothesis:
Petal.Length = 1
Model 1: restricted model
Model 2: Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 145 24.837
2 144 10.328 1 14.509 202.31 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
I don't know how to interpret the Sum of Squares in a numeric independent variable.
summary(aov(Petal.Width ~ Petal.Length + Species, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Petal.Length 1 80.26 80.26 2487.02 < 2e-16 ***
## Species 2 1.60 0.80 24.77 5.48e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The sum of squares in Species are clear to me (sum of squared deviations from the group means) but how to interpret them if you have a numeric independent variable like Petal.Length?
The components of this linear model are not orthogonal so we cannot
calculate the sum of squares (SS) of each component independently of the others. Rather we must take a sequence of model comparisons. In this case aov
considers these models owing to the order in which the components were listed in the formula.
fm0 <- lm(Petal.Width ~ 1, iris) # null model
fm1 <- lm(Petal.Width ~ Petal.Length, iris)
fm2 <- lm(Petal.Width ~ Petal.Length + Species, iris) # full model
Note that the residual sum of squares (RSS) of a model fm is sum(resid(fm)^2) and R has a function specifically for this which is deviance(fm). Keeping this in mind we can decompose the RSS of the null model like this:
deviance(fm0) # RSS of null model
= (deviance(fm0) - deviance(fm1) # SS of Petal.Length
+ (deviance(fm1) - deviance(fm2) # SS of Species
+ deviance(fm2) # RSS of full model
and each sum of squares reported in the table in the question is one of the
lines above. That is,
deviance(fm0) - deviance(fm1) # SS of Petal.Length
## [1] 80.25984
deviance(fm1) - deviance(fm2) # SS of Species
## [1] 1.598453
deviance(fm2) # RSS of full model
## [1] 4.711643
Note
Note that the SS values we get depend on the sequence of models we use. For example, if we use this sequence which considers Species before Petal.Length (whereas above we considered Petal.Length and then Species) we get difference SS values.
fm0 # same null model as above
fm1a <- lm(Petal.Width ~ Species, iris)
fm2 # same full model as above
deviance(fm0) - deviance(fm1a) # SS of Species
## [1] 80.41333
deviance(fm1a) - deviance(fm2) # SS of Petal.Length
## [1] 1.444957
deviance(fm2) # RSS of full model
## [1] 4.711643
and note that this does correspond to aov if we list the components in that order, i.e. this time we listed Species before Petal.Length to change the sequence of models that aov would consider:
summary(aov(Petal.Width ~ Species + Petal.Length, iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 1245.89 < 2e-16 ***
## Petal.Length 1 1.44 1.44 44.77 4.41e-10 ***
## Residuals 146 4.71 0.03
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I'm making a scatterplot matrix with ggpairs{GGally) as follows, but I'd like to display the p values for each term in my aov results in the upper panels, rather than just the overall and by-species correlation value that comes with the package.
How can I get the right column from this aov result into my upper plots? Can I write a custom function to do this, and how? Is is even possible using ggpairs? Thanks.
library(GGally);library(ggplot2)
pm <- ggpairs(data = iris,
mapping = aes(color = Species),
columns = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
pm
result <- aov(Sepal.Length ~ Sepal.Width*Petal.Length, data = iris)
print(summary(result))
Df Sum Sq Mean Sq F value Pr(>F)
Sepal.Width 1 1.41 1.41 12.9 0.000447 ***
Petal.Length 1 84.43 84.43 771.4 < 2e-16 ***
Sepal.Width:Petal.Length 1 0.35 0.35 3.2 0.075712 .
Residuals 146 15.98 0.11
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I need to run a regression on a constant. In Eviews, I don't need to put any thing as a predictor when I run a regression on constant.I don't know how to do that in R. Does any one knows what should I write in this commnd?
fit= lm(r~?)
You can specify a constant as 1 in a formula:
r <- 1:5
fit <- lm(r ~ 1)
summary(fit)
# Call:
# lm(formula = r ~ 1)
#
# Residuals:
# 1 2 3 4 5
# -2.00e+00 -1.00e+00 2.22e-16 1.00e+00 2.00e+00
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.0000 0.7071 4.243 0.0132 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.581 on 4 degrees of freedom
Note that you don't need lm to get this result:
mean(r)
#[1] 3
sd(r)/sqrt(length(r))
#[1] 0.7071068
However, you might want to use lm in order to have a Null model against which you can compare other models ...
Edit:
Since you comment that you need "the p-value", I suggest to use a t-test instead.
t.test(r)
# One Sample t-test
#
#data: r
#t = 4.2426, df = 4, p-value = 0.01324
#alternative hypothesis: true mean is not equal to 0
#95 percent confidence interval:
# 1.036757 4.963243
#sample estimates:
#mean of x
# 3
This is equivalent, but more efficient computationally.
I am running a regression on R
fbReg <- lm(y~x2+x7+x8,table.b1)
I then run an Anova table to analyze the significance of the regression
anova(fbReg)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 76.193 76.193 26.172 3.100e-05 ***
x7 1 139.501 139.501 47.918 3.698e-07 ***
x8 1 41.400 41.400 14.221 0.0009378 ***
Residuals 24 69.870 2.911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is there anything i can do to make my anova table sum all the sum of squares for x2,x7,x8 instead of having them separate.
Essentially, have the anova table look like this
df SS MS FvAL PR(>F)
Regression 3 257.094 ETC....
Error(Residual) 24 69.870 ETC.....
Thanks
To illustrate my comment:
> lm2 <- lm(Fertility ~ Catholic+Education+Agriculture, data = swiss)
> lm1 <- lm(Fertility ~ 1, data = swiss)
> anova(lm1,lm2)
Analysis of Variance Table
Model 1: Fertility ~ 1
Model 2: Fertility ~ Catholic + Education + Agriculture
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 7178.0
2 43 2567.9 3 4610.1 25.732 1.089e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I am trying to analyze some visual transect data of organisms to generate a habitat distribution model. Once organisms are sighted, they are followed as point data is collected at a given time interval. Because of the autocorrelation among these "follows," I wish to utilize a GAM-GEE approach similar to that of Pirotta et al. 2011, using packages 'yags' and 'splines' (http://www.int-res.com/abstracts/meps/v436/p257-272/). Their R scripts are shown here (http://www.int-res.com/articles/suppl/m436p257_supp/m436p257_supp1-code.r). I have used this code with limited success and multiple issues of models failing to converge.
Below is the structure of my data:
> str(dat2)
'data.frame': 10792 obs. of 4 variables:
$ dist_slag : num 26475 26340 25886 25400 24934 ...
$ Depth : num -10.1 -10.5 -16.6 -22.2 -29.7 ...
$ dolphin_presence: int 0 0 0 0 0 0 0 0 0 0 ...
$ block : int 1 1 1 1 1 1 1 1 1 1 ...
> head(dat2)
dist_slag Depth dolphin_presence block
1 26475.47 -10.0934 0 1
2 26340.47 -10.4870 0 1
3 25886.33 -16.5752 0 1
4 25399.88 -22.2474 0 1
5 24934.29 -29.6797 0 1
6 24519.90 -26.2370 0 1
Here is the summary of my block variable (indicating the number of groups for which autocorrelation exists within each block
> summary(dat2$block)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 39.00 76.00 73.52 111.00 148.00
However, I would like to use the package 'gamm4', as I am more familiar with Professor Simon Wood's packages and functions, and it appears gamm4 might be the most appropriate. It is important to note that the models have a binary response (organism presence of absence along a transect), and thus why I think gamm4 is more appropriate than gamm. In the gamm help, it provides the following example for autocorrelation within factors:
## more complicated autocorrelation example - AR errors
## only within groups defined by `fac'
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1]*(fac[i-1]==fac[i]) + e[i]
y <- f + e
b <- gamm(y~s(x,k=20),correlation=corAR1(form=~1|fac))
Following this example, the following is the code I used for my dataset
b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),random=(form=~1|block), family=binomial(),data=dat)
However, by examining the output (summary(b$gam)) and specifically summary(b$mer)), I am either unsure of how to interpret the results, or do not believe that the autocorrelation within the group is being taken into consideration.
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
>
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
How do I ensure that autocorrelation is indeed being accounted for within each unique value of the "block" variable? What is the simplest way to interpret the output for "summary(b$mer)"?
The results do differ from a normal gam (package mgcv) using the same variables and parameters without the "correlation=..." term, indicating that something different is occurring.
However, when I use a different variable for the correlation term (season), I get the SAME output:
> dat2 <- data.frame(dist_slag = dat$dist_slag, Depth = dat$Depth, dolphin_presence = dat$dolphin_presence,
+ block = dat$block, season=dat$season)
> head(dat2)
dist_slag Depth dolphin_presence block season
1 26475.47 -10.0934 0 1 F
2 26340.47 -10.4870 0 1 F
3 25886.33 -16.5752 0 1 F
4 25399.88 -22.2474 0 1 F
5 24934.29 -29.6797 0 1 F
6 24519.90 -26.2370 0 1 F
> summary(dat2$season)
F S
3224 7568
> b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),correlation=corAR1(1, form=~1 | season), family=binomial(),data=dat2)
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
I just want to make sure it is correctly allowing for correlation within each value for the "block" variable. How do I formulate the model to say that autocorrelation can exist within each single value for block, but assume independence among blocks?
On another note, I am also receiving the following warning message after model completion for larger models (with many more variables than 2):
Warning message:
In mer_finalize(ans) : false convergence (8)
gamm4 is built on top of lme4, which does not allow for a correlation parameter (in contrast to the nlme, package, which underlies mgcv::gamm). mgcv::gamm does handle binary data, although it uses PQL, which is generally less accurate than Laplace/GHQ approximations as in gamm4/lme4. It is unfortunate (!!) that you're not getting a warning telling you that the correlation argument is being ignored (when I try a simple example using a correlation argument with lme4, I do get a warning, but it's possible that the extra argument is getting swallowed somewhere inside gamm4).
Your desired autocorrelation structure ("autocorrelation can exist within each single value for block, but assume independence among blocks") is exactly the way correlation structures are coded in nlme (and hence in mgcv::gamm).
I would use mcgv::gamm, and would suggest that if at all possible you try it out on some simulated data with known structure (or use the data set provided in the supplementary material above and see if you can reproduce their qualitative conclusions with your alternative approach).
StackOverflow is nice, but there is probably more mixed model expertise at r-sig-mixed-models#r-project.org