Calculating between and within variance (mean squares) in mgcv::gam - r

I would like to calculate the between and within variability of a parametric term (or mean squares of parametric term and residuals) in a mgcv::gam, but can't figure out how to do that with mgcv.
Below is an example. I've created a gam object. Then used the summary and anova.gam functions, but they only provide F-values. I assume the F-value of the parametric term can be interpreted like its linear model equivalent - as the mean sq of group divided by the mean sq of residuals.
library(mgcv)
demo <- read.csv("https://stats.idre.ucla.edu/stat/data/demo3.csv")
## Convert variables to factor
demo <- within(demo, {
group <- factor(group)
time <- factor(time)
id <- factor(id)
})
gam1 <- gam(pulse ~ group + s(id, bs="re"), method="ML", data = demo)
summary(gam1)
Family: gaussian
Link function: identity
Formula:
pulse ~ group + s(id, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.583 3.425 6.886 6.48e-07 ***
group2 18.417 4.844 3.802 0.000976 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(id) 7.944e-05 6 0 1
R-sq.(adj) = 0.369 Deviance explained = 39.7%
-ML = 92.376 Scale est. = 140.77 n = 24
anova.gam(gam1)
Family: gaussian
Link function: identity
Formula:
pulse ~ group + s(id, bs = "re")
Parametric Terms:
df F p-value
group 1 14.46 0.000976
Approximate significance of smooth terms:
edf Ref.df F p-value
s(id) 7.944e-05 6.000e+00 0 1
I would like to calculate the mean squares. With a simple linear model, like in the example below, I can use the anova function and get the mean squares. How can I calculate the mean squares in a gam using mgcv?
library(stats)
lm1 <- lm(pulse ~ group + id, data = demo)
anova(lm1)
Analysis of Variance Table
Response: pulse
Df Sum Sq Mean Sq F value Pr(>F)
group 1 2035.04 2035.04 10.636 0.004903 **
id 6 35.58 5.93 0.031 0.999829
Residuals 16 3061.33 191.33
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Related

What is ga() in the gamlss package doing?

I have been looking into the gamlss package for fitting semiparametric models and came across something strange in the ga() function. Even if the model is specified as having a gamma distribution, fitted using REML, the output for the model is Gaussian, fitted using GCV.
Example::
library(mgcv)
library(gamlss)
library(gamlss.add)
data(rent)
ga3 <- gam(R~s(Fl)+s(A), method="REML", data=rent, family=Gamma(log))
gn3 <- gamlss(R~ga(~s(Fl)+s(A), method="REML"), data=rent, family=GA)
Model summary for the GAM::
summary(ga3)
Family: Gamma
Link function: log
Formula:
R ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.667996 0.008646 771.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.263 1.482 442.53 <2e-16 ***
s(A) 4.051 4.814 36.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.302 Deviance explained = 28.8%
-REML = 13979 Scale est. = 0.1472 n = 1969
Model summary for the GAMLSS::
summary(getSmo(gn3))
Family: gaussian
Link function: identity
Formula:
Y.var ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.306e-13 8.646e-03 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.269 1.492 440.14 <2e-16 ***
s(A) 3.747 4.469 38.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.294 Deviance explained = 29.6%
GCV = 0.97441 Scale est. = 0.97144 n = 1969
Question::
Why is the model output giving the incorrect distribution and fitting method? Is there something that I am missing here and this is correct?
When using the ga()-function, gamlss calls in the background the gam()-function from mgcv without specifying the family. As a result, the splines are fitted assuming a normal distribution. Therefore you see when showing the fitted smoothers family: gaussian and link function: identity. Also note that the scale estimate returned when using ga() is related to the normal distribution.
Yes, when using the ga()-function, each gamlss iteration calls in the background the gam()-function from mgcv. It uses the correct local working variable and local weights for a gamma distribution on each iteration.

How to compare the slope of interaction variables in mixed effect model in r

I want to test the effects of island area and land use, and the interaction between island area and land use on species richness. For land use, I have three groups, namely forest, farmland and mix. The data is based on transects on different islands, so the island ID is set as random effect.
My model looks like this:
#model = glmer(SR ~ Area + land_use + Area:land_use + (1|islandID))
#summary(model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ Area + land_use + Area:land_use + (1 | islandID)
Data: transect_ZS
REML criterion at convergence: 184.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.66105 -0.56159 -0.00294 0.57259 1.72096
Random effects:
Groups Name Variance Std.Dev.
islandID (Intercept) 0.1524 0.3903
Residual 0.6805 0.8249
Number of obs: 70, groups: islandID, 34
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.9996 0.5187 57.0061 -1.927 0.05893 .
Area 0.9064 0.2834 40.9977 3.198 0.00267 **
land_useforest 0.6563 0.5569 62.0889 1.179 0.24309
land_usemix 0.9611 0.6373 55.3032 1.508 0.13723
Area:land_useforest -0.8318 0.3034 63.4045 -2.742 0.00793 **
Area:land_usemix -0.7756 0.4748 56.3692 -1.633 0.10795
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results told me that island area and the interaction terms have significant effect on SR:
# > anova(model)
#Type III Analysis of Variance Table with Satterthwaite's method
# Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#Area 3.0359 3.03590 1 27.448 4.4615 0.04390 *
#land_use 1.5520 0.77601 2 57.617 1.1404 0.32679
#Area:land_use 5.1658 2.58288 2 60.935 3.7958 0.02795 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
And then I used lsmeans function to conduct Tukeys' pairwise comparison:
#lsmeans(model, pairwise ~ Area:land_use, adjust="tukey")
The results indicate that the species richness from farmland and forest is significantly different, right? I wonder if this difference should be seen as the significant difference of intercept of the species richness-area relationship between farmland and forest in this model? That is the species richness from farmland transects is higher than that from forest transects?
#$contrasts
contrast estimate SE df t.ratio p.value
1.19968425045037 farmland - 1.19968425045037 forest 3.4153 0.288 62.6 1.185 0.0466
1.19968425045037 farmland - 1.19968425045037 mix -0.0306 0.426 64.0 -0.072 0.9972
1.19968425045037 forest - 1.19968425045037 mix -0.3722 0.377 63.9 -0.987 0.5087
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 3 estimates
But how to test if the slope of the species richness-area relationship between farmland and forest in this model is significant different? That is to testify if the species richness-area relationship from farmland transects is more steeper than that from forest transect?
I think you want
lstrends(model, pairwise ~ land_use, var = "Area", adjust="tukey")
The functions lsmeans and lstrends are in the emmeans package, in which they are equivalent to emmeans and emtrends respectively. So look at the documentation for those functions. The lsmeans package is just a front end.

Extracting t-stat p values from lm in R

I have run a regression model in R using the lm function. The resulting ANOVA table gives me the F-value for each coefficient (which doesnt really make sense to me). What I would like to know is the t-stat for each coefficient and its corresponding p-value. How do I get this? Is it stored by the function or does it require additional computation?
Here is the code and output:
library(lubridate)
library(RCurl)
library(plyr)
[in] fit <- lm(btc_close ~ vix_close + gold_close + eth_close, data = all_dat)
# Other useful functions
coefficients(fit) # model coefficients
confint(fit, level=0.95) # CIs for model parameters
anova(fit) # anova table
[out]
Analysis of Variance Table
Response: btc_close
Df Sum Sq Mean Sq F value Pr(>F)
vix_close 1 20911897 20911897 280.1788 <2e-16 ***
gold_close 1 91902 91902 1.2313 0.2698
eth_close 1 42716393 42716393 572.3168 <2e-16 ***
Residuals 99 7389130 74638
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If my statistics knowledge serves me correctly, these f-values are meaningless. Theoretically, I should receive an F-value for the model and a T-value for each coefficient.
Here is an example with comments of how you can extract just the t-values.
# Some dummy data
n <- 1e3L
df <- data.frame(x = rnorm(n), z = rnorm(n))
df$y <- with(df, 0.01 * x^2 + z/3)
# Run regression
lr1 <- lm(y ~ x + z, data = df)
# R has special summary method for class "lm"
summary(lr1)
# Call:
# lm(formula = y ~ x + z, data = df)
# Residuals:
# Min 1Q Median 3Q Max
# -0.010810 -0.009025 -0.005259 0.003617 0.096771
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.0100122 0.0004313 23.216 <2e-16 ***
# x 0.0008105 0.0004305 1.883 0.06 .
# z 0.3336034 0.0004244 786.036 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.01363 on 997 degrees of freedom
# Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
# F-statistic: 3.09e+05 on 2 and 997 DF, p-value: < 2.2e-16
# Now, if you only want the t-values
summary(lr1)[["coefficients"]][, "t value"]
# Or (better practice as explained in comments by Axeman)
coef(summary(lr1))[, "t value"]
# (Intercept) x z
# 23.216317 1.882841 786.035718
You could try this:
summary(fit)
summary(fit)$coefficients[,4] for p-values
summary(fit)$coefficients[,3] for t-values
As Benjamin has already answered, I would advise using broom::tidy() to coerce the model object to a tidy dataframe. The statistic column will contain the relevant test statistic and is easily available for plotting with ggplot2.
you can use this
summary(fit)$coefficients[,3]
To extract only t-values

why step() returns weird results from backward elimination for full model using lmerTest

I am confused that why the results from processing step(model) in lmerTest are abnormal.
m0 <- lmer(seed ~ connection*age + (1|unit), data = test)
step(m0)
note: Both "connection" and "age" have been set as.factor()
Random effects:
Chi.sq Chi.DF elim.num p.value
unit 0.25 1 1 0.6194
Fixed effects:
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
connection 1 0.01746 0.017457 1.5214 0.22142
age 1 0.07664 0.076643 6.6794 0.01178 *
connection:age 1 0.04397 0.043967 3.8317 0.05417 .
Residuals 72 0.82617 0.011475
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Least squares means:
Estimate Standard Error DF t-value Lower CI Upper CI p-value
Final model:
Call:
lm(formula = fo, data = mm, contrasts = l.lmerTest.private.contrast)
Coefficients:
(Intercept) connectionD ageB connectionD:ageB
-0.84868 -0.07852 0.01281 0.09634
Why it does not show me the Final model?
The thing here is that random effect was eliminated as being NS according to the LR test. Then the anova method for the fixed effects model, the "lm" object was applied and no elimination of NS fixed effects was done. You are right, that the output is different from "lmer" objects and there are no (differences of ) least squares means. If you want to get the latter you may try the lsmeans package. For the backward elimination of NS effect of the final model you may use stats::step function.

Getting p-value via nls in r

I'm a complete novice in R and I'm trying to do a non-linear least squares fit to some data. The following (SC4 and t are my data columns) seems to work:
fit = nls(SC4 ~ fin+(inc-fin)*exp(-t/T), data=asc4, start=c(fin=0.75,inc=0.55,T=150.0))
The "summary(fit)" command produces an output that doesn't include a p-value, which is ultimately, aside from the fitted parameters, what I'm trying to get. The parameters I get look sensible.
Formula: SC4 ~ fin + (inc - fin) * exp(-t/T)
Parameters:
Estimate Std. Error t value Pr(>|t|)
fin 0.73703 0.02065 35.683 <2e-16 ***
inc 0.55671 0.02206 25.236 <2e-16 ***
T 51.48446 21.25343 2.422 0.0224 *
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04988 on 27 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 4.114e-06
So is there any way to get a p-value? I'd be happy to use another command other than nls if that will do the job. In fact, I'd be happy to use gnuplot for example if there's some way to get a p-value from it (in fact gnuplot is what I normally use for graphics).
PS I'm looking for a p-value for the overall fit, rather than for individual coefficients.
The way to do this in R is you have to use the anova function to compute the fitted values of your current model and then fit your model with less variables, and then use the function anova(new_model,previous_model). The computed F-score will be closer to one if you cannot reject the null that the parameters for the variables you have removed are equal to zero. The summary function when doing the standard linear regression will usually do this for you automatically.
So for example this is how you would do it for the standard linear regression:
> x = rnorm(100)
> y=rnorm(100)
> reg = lm(y~x)
> summary(reg)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.3869 -0.6815 -0.1137 0.7431 2.5939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002802 0.105554 -0.027 0.9789
x -0.182983 0.104260 -1.755 0.0824 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.056 on 98 degrees of freedom
Multiple R-squared: 0.03047, Adjusted R-squared: 0.02058
F-statistic: 3.08 on 1 and 98 DF, p-value: 0.08237
But then if you use anova you should get the same F-score:
> anova(reg)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 3.432 3.4318 3.0802 0.08237 .
Residuals 98 109.186 1.1141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Resources