I have been looking into the gamlss package for fitting semiparametric models and came across something strange in the ga() function. Even if the model is specified as having a gamma distribution, fitted using REML, the output for the model is Gaussian, fitted using GCV.
Example::
library(mgcv)
library(gamlss)
library(gamlss.add)
data(rent)
ga3 <- gam(R~s(Fl)+s(A), method="REML", data=rent, family=Gamma(log))
gn3 <- gamlss(R~ga(~s(Fl)+s(A), method="REML"), data=rent, family=GA)
Model summary for the GAM::
summary(ga3)
Family: Gamma
Link function: log
Formula:
R ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.667996 0.008646 771.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.263 1.482 442.53 <2e-16 ***
s(A) 4.051 4.814 36.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.302 Deviance explained = 28.8%
-REML = 13979 Scale est. = 0.1472 n = 1969
Model summary for the GAMLSS::
summary(getSmo(gn3))
Family: gaussian
Link function: identity
Formula:
Y.var ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.306e-13 8.646e-03 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.269 1.492 440.14 <2e-16 ***
s(A) 3.747 4.469 38.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.294 Deviance explained = 29.6%
GCV = 0.97441 Scale est. = 0.97144 n = 1969
Question::
Why is the model output giving the incorrect distribution and fitting method? Is there something that I am missing here and this is correct?
When using the ga()-function, gamlss calls in the background the gam()-function from mgcv without specifying the family. As a result, the splines are fitted assuming a normal distribution. Therefore you see when showing the fitted smoothers family: gaussian and link function: identity. Also note that the scale estimate returned when using ga() is related to the normal distribution.
Yes, when using the ga()-function, each gamlss iteration calls in the background the gam()-function from mgcv. It uses the correct local working variable and local weights for a gamma distribution on each iteration.
Related
This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated yesterday.
I have a GLM with (quasi)poisson family.
My dataset has 3 variables:
rate_data
rate_benchmark
X
So fitting the model:
model <- glm(formula = rate_data ~ offset(log(rate_benchmark)) + X - 1, family = (quasi)poisson, data = data)
model_null <- glm(formula = rate_data ~ offset(log(rate_benchmark)) - 1, family = (quasi)poisson, data = data)
When using "poisson" it gives me warnings about non-integer values, which it doesnt give me for the quasipoisson. However, when testing for my beta being zero anova(model_null, model, test = "LRT") it gives me completely different deviance (hence also different p-values).
Which model am I supposed to use? My first thought was using quasipoisson, but no warnings does not necessarily mean it is correct.
The Poisson and quasi-Poisson models differ in their assumptions about the form of the function relating the mean and variance of each observation. The Poisson assumes the variance equals the mean; the quasi-Poisson assumes that $\sigma^2 = \theta\mu$, which reduces to the Poisson when $\theta=1$. Consequently, the deviance and p-values will, as you have observed, be different between the two models.
You can in fact run a Poisson regression on non-integer data, at least in R; you'll still get the "right" coefficient estimates etc. The warnings are there as warnings; they don't represent an algorithm failure. Here's an example:
z <- 1 + 2*runif(100)
x <- rgamma(100,2,sqrt(z + z*z/2))
summary(glm(x~z, family=poisson))
... blah blah blah ...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.56390 0.17214 3.276 0.00105 **
z 0.43119 0.08368 5.153 2.57e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
... blah blah blah ...
There were 50 or more warnings (use warnings() to see the first 50)
Now we'll compare to a pure "quasi" model with the same link function and relationship between mean and variance; the "quasi" model makes no assumptions about integer values for the target variable:
summary(glm(x~z, family=quasi(link="log", variance="mu")))
... stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasi family taken to be 2.318851)
Note that the parameter estimates are exactly the same, but the standard errors are different; this is due to the different calculations of variance, as reflected by the different dispersion parameters.
Now for the quasi-Poisson model, which will, again, give us the same parameter estimates as the Poisson model, but with different standard errors:
summary(glm(x~z, family=quasipoisson))
... more stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 2.31885)
Since the mean-variance relationship and link functions are the same as in the "quasi" model, the model results are the same also.
The Poisson distribution deals with counts -- the actual number of objects you counted in a defined volume, or the actual number of events you counted in a defined period of time.
If you normalized to a rate, the distribution is not Poisson.
I created the following GAMM function using the R package gamlss:
model<-gamlss(Overlap~ Diff.Long + Diff.Fzp + DiffSeason +
random(Xnumber),family=BEZI(mu.link = "logit", sigma.link = "log",
nu.link = "logit"),data=data,trace=F)
The output of this model is:
******************************************************************
Family: c("BEZI", "Zero Inflated Beta")
Call: gamlss(formula = Overlap ~ Diff.Long + Diff.Fzp + DiffSeason +
random(Xnumber), family = BEZI(mu.link = "logit",
sigma.link = "log", nu.link = "logit"), data = data, trace = F)
Fitting method: RS()
------------------------------------------------------------------
Mu link function: logit
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.188647 0.208359 -0.905 0.36715
Diff.Long -0.002072 0.000736 -2.814 0.00575 **
Diff.Fzp -0.030909 0.013749 -2.248 0.02648 *
DiffSeasonEW-LW -0.617976 0.211260 -2.925 0.00415 **
DiffSeasonLW-LW -0.356989 0.270548 -1.320 0.18963
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.865 0.126 14.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
Nu link function: logit
Nu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.3470 0.3156 -7.437 2.02e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 126
Degrees of Freedom for the fit: 11.15247
Residual Deg. of Freedom: 114.8475
at cycle: 5
Global Deviance: -43.54531
AIC: -21.24037
SBC: 10.39118
******************************************************************
I'm not the most familiar yet with additive models, and am trying to find the significance of my random effect ("Xnumber"). I know that the package mgcv has a way but the package gamlss is the only one to have the distribution I need (zero-inflated beta).
If anyone knows any functions I can use, that would be great? Or is it in the summary, but I just don't know where to look?
I am a newbie when it comes to stats and I am performing Kendall-Theil Sen (Siegel variation) linear regressions for my dissertation. I am really not sure how to interpret the output and google hasn't offered much help either. I know the estimate reflects the regression coefficients but I am completely oblivious as how to write this up correctly.
This is my output for one regression:
Call:
mblm(formula = SCI_TotalScore ~ DefeatTotalScore, dataframe = complete_dat_totals)
Residuals:
Min 1Q Median 3Q Max
-11.6089 -4.0464 -0.3589 6.1411 14.3911
Coefficients:
Estimate MAD V value Pr(>|V|)
(Intercept) 23.8589 7.6916 4095.0 < 2e-16 ***
DefeatTotalScore -0.2500 0.2553 245.5 1.05e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.247 on 88 degrees of freedom
Can anyone offer any help?
I would like to calculate the between and within variability of a parametric term (or mean squares of parametric term and residuals) in a mgcv::gam, but can't figure out how to do that with mgcv.
Below is an example. I've created a gam object. Then used the summary and anova.gam functions, but they only provide F-values. I assume the F-value of the parametric term can be interpreted like its linear model equivalent - as the mean sq of group divided by the mean sq of residuals.
library(mgcv)
demo <- read.csv("https://stats.idre.ucla.edu/stat/data/demo3.csv")
## Convert variables to factor
demo <- within(demo, {
group <- factor(group)
time <- factor(time)
id <- factor(id)
})
gam1 <- gam(pulse ~ group + s(id, bs="re"), method="ML", data = demo)
summary(gam1)
Family: gaussian
Link function: identity
Formula:
pulse ~ group + s(id, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.583 3.425 6.886 6.48e-07 ***
group2 18.417 4.844 3.802 0.000976 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(id) 7.944e-05 6 0 1
R-sq.(adj) = 0.369 Deviance explained = 39.7%
-ML = 92.376 Scale est. = 140.77 n = 24
anova.gam(gam1)
Family: gaussian
Link function: identity
Formula:
pulse ~ group + s(id, bs = "re")
Parametric Terms:
df F p-value
group 1 14.46 0.000976
Approximate significance of smooth terms:
edf Ref.df F p-value
s(id) 7.944e-05 6.000e+00 0 1
I would like to calculate the mean squares. With a simple linear model, like in the example below, I can use the anova function and get the mean squares. How can I calculate the mean squares in a gam using mgcv?
library(stats)
lm1 <- lm(pulse ~ group + id, data = demo)
anova(lm1)
Analysis of Variance Table
Response: pulse
Df Sum Sq Mean Sq F value Pr(>F)
group 1 2035.04 2035.04 10.636 0.004903 **
id 6 35.58 5.93 0.031 0.999829
Residuals 16 3061.33 191.33
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I'm a complete novice in R and I'm trying to do a non-linear least squares fit to some data. The following (SC4 and t are my data columns) seems to work:
fit = nls(SC4 ~ fin+(inc-fin)*exp(-t/T), data=asc4, start=c(fin=0.75,inc=0.55,T=150.0))
The "summary(fit)" command produces an output that doesn't include a p-value, which is ultimately, aside from the fitted parameters, what I'm trying to get. The parameters I get look sensible.
Formula: SC4 ~ fin + (inc - fin) * exp(-t/T)
Parameters:
Estimate Std. Error t value Pr(>|t|)
fin 0.73703 0.02065 35.683 <2e-16 ***
inc 0.55671 0.02206 25.236 <2e-16 ***
T 51.48446 21.25343 2.422 0.0224 *
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04988 on 27 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 4.114e-06
So is there any way to get a p-value? I'd be happy to use another command other than nls if that will do the job. In fact, I'd be happy to use gnuplot for example if there's some way to get a p-value from it (in fact gnuplot is what I normally use for graphics).
PS I'm looking for a p-value for the overall fit, rather than for individual coefficients.
The way to do this in R is you have to use the anova function to compute the fitted values of your current model and then fit your model with less variables, and then use the function anova(new_model,previous_model). The computed F-score will be closer to one if you cannot reject the null that the parameters for the variables you have removed are equal to zero. The summary function when doing the standard linear regression will usually do this for you automatically.
So for example this is how you would do it for the standard linear regression:
> x = rnorm(100)
> y=rnorm(100)
> reg = lm(y~x)
> summary(reg)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.3869 -0.6815 -0.1137 0.7431 2.5939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002802 0.105554 -0.027 0.9789
x -0.182983 0.104260 -1.755 0.0824 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.056 on 98 degrees of freedom
Multiple R-squared: 0.03047, Adjusted R-squared: 0.02058
F-statistic: 3.08 on 1 and 98 DF, p-value: 0.08237
But then if you use anova you should get the same F-score:
> anova(reg)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 3.432 3.4318 3.0802 0.08237 .
Residuals 98 109.186 1.1141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1